Machine Learning - Clustering
Why would we want to cluster data? For one, clusters in some feature space may indicate closeness of data on a semantic level. Also, clusters allow us to compress data as it sufficies to only transmit cluster centers instead of the data belonging to the clusters.
Distance measures
There are ceveral options to measure the distance between clusters. For clusters \(X\) and \(Y\) you might use:
- \(D_{min}(X, Y) = min_{\vec{x}\in X, \vec{y}\in Y}d(\vec{x}, \vec{y})\) this is the mimimum difference
- \(D_{max}(X, Y) = max_{\vec{x}\in X, \vec{y}\in Y}d(\vec{x}, \vec{y})\) this is the maximum difference
- \(D_{mean}(X, Y) = 1/{\vert X \vert \vert Y \vert}\sum{\vec{x}\in X, \vec{y}\in Y}d(\vec{x}, \vec{y})\) mean of all distances
- \(D_{centroid}(X, Y) = d(1/{\vert X \vert} \sum_{\vec{x}\in X} \vec{x}, 1/{\vert Y \vert} \sum_{\vec{y}\in Y})\) distance between cluster centers
Given a data distribution, it is not clear what clusters an algorithm should find. The definition of clusters depends on scale.
Bias in clustering
All clustering algorithm have some kind of bias:
- a certain cluster model is prefered
- the model comprises scale and shape of clusters
- usually bias is an implicit part of the algorithm
- adjustable parameters are usually processing parameters (of the algorithm), not model parameters
- the conncetion between parameters and the implicit cluster model usually needs to be inferred from the way the algorithm is working
- hierarchical clustering solves the problem for the scale parameter insofar as all solutions on different scales are presented in an ordered way
Hierarchical clustering
There are two complementary methods:
- Agglomerative clustering:
- start with each data point as a cluster
- merge clusters recursively bottom-up
- Divisive clustering
- start with all data points as a single cluster
- split clusters recursively top-down The result is a dendrogram representing all data in a hierachy of clusters.
Agglomorative clustering:
n = number of clusters
C = assign each of n data elements to a cluster C_i, i=1...n
while n>2:
find the pair of clusters C_i and C_j, i < j that optimizes the linkage criterion
merge C_i <- C_i unified C_j
if j < n:
C_j = C_n
n -= 1
Single linkage clustering employs the minimum cluster distance.
Complete linkage clustering employs the maximum cluster distance.
Single linkage clustering tends to chaining and complete linkage clustering prefers compact clusters. We can also use average linkage clustering or UPGMA (Unweighted Pair Group Method with Arithmetic mean). In centroid clustering centroid distance is used. Here real valued attributes are required for the centroid computation. Also, when joining two clusters, the resulting centroid is dominated by the cluster with more members.
Ward’s minimum variance clustering:
Merge the pair of clusters for which the increase in total variance is mimized.
\(E = \sum_i \sum_{\vec{x} \in C_i} (\vec{x} - \vec{\mu_i})^2\)
\(\vec{\mu_i} = 1/{\vert C_i \vert} \sum_{\vec{x} \in C_i}\vec{x}\)
In contrast to the previous approaches, this one is optimization based. It can also be implemented by a distance measure:
\(D_{ward} = D_{centroid}(X, Y) / (1/{\vert X \vert} + 1/{\vert Y \vert})\)
Propoerties:
- prefers spherical clusters and clusters of similar size
- robust against noise but not against outliers
Properties of hierachical clustering\
- any distance measure can be used
- we need only the distance matrix (not the data)
- no parameters
- efficency:
- agglomerative: \(O(n^3)\) in naive approach, \(O(n^2)\) SLINK-algorithm
- divisive: \(O(2^n)\) in naive approach, \(O(n^2)\) CLINK-algorithm
- in general, efficeny can be increased by avoiding unnecessary re-computation of distances
- resulting dendrogram offers alternative clusterings
- dendrogram needs to analyzed
- cut off at different levels of dendrogram may be necessary to get comparable clusters
- outliers are fully incorporated
Optimization based clustering
The idea of optimization based clustering is to maximize some kind “goodness function” which assigns a “goodness value” to any partioning of the data.
Basic maximization algorithm:
C = somehow partition data into clusters C_1...C_n
while not stop_condition():
choose an example datapoint x at random, denote its clister as C(x)
select a random target cluster C_i
∆E = change of the goodness function = E(x in C_i) - E(x in C(x))
if ∆E > 0:
put x from C(x) to C_i
else:
put x from C(x) to C_i with probability exp(β∆E)
increase β
- may be caught on local maxima
- dependence on intial partioning
- to escape local maxima, downhill steps are accepted with probability exp(β∆E)
- initially small β allows frequent downhill steps
- increasing β makes downhill steps less likely until the process “freezes” (simulated annealing)
Compression by clustering
Given a data set \(D = \{ \vec{x_1}, \vec{x_2}, ...\}\) of d-dimensional vectors \(\vec{x_i} \in \Re^d\). The number of bits per data point depends on d and the required precision (and the statistics of the distribution). A cluster algorithm will yield a number \(K < |D|\) of cluster centers \(\vec{w_j}\in\Re^d\) (also called nodes, refrence vectors or codewords). A data vector \(\vec{x_i}\) can now be approximated by its best match cluster center \(\vec{w_m}\) where \(m(\vec{x_i}) = argmin_j \|\vec{x_i} - \vec{w_j}\|\). This means that we have to transmit \({\vec{w_i}}\) once and after that only the number of the best match cluster center.\
- small K: high compression ratio, bad approximation
- large K: low compression ration, good approximation
K-means clustering
The term K-means clustering was termed by James McQueen in 1967 though the idea goes back to Hugo Steinhause 1956.
The algorithm works by first dividing \(D\) into clusters \(C_1 ... C_K\) which are represented by their K centers of gravity (means) \(\vec{w_1} ... \vec{w_K}\). The algorithm minimizes the qudratic error measure: \(E(D, \{ \vec{w_i}\}) = 1/{\vert D\vert} \sum_{i=1...\vert D\vert}\|\vec{x_i} - \vec{w_{m(\vec{x_i})}}\|^2\).
Iterative K-means clustering:
- start with randomly chosen reference vectors
- assign all of the data to best match refrence vectors
- update reference vectors by shifting them to the mean/center of their cluster
- stop if cluster centers have moved no more than \(\epsilon\), else goto 2
D = data set
t = 0
create K reference vectors w chosen randomly within a suitable bounding box in 'R^d'
C = list to store clusters
while some w_k has moved more than epsilon:
for k in range(1, K):
add cluster container to C
for x in D:
k = cluster k with x has the smallest distance to (distance from x to w_k)
add x to C[k]
t += 1
for k in range(1, K):
w_k = mean of all points in C_k
The number of clusters K implicitly defines scale and resulting shape of the clusters. K-means optimizes greedily and therefore can end up in a local optima. The kind of optima we end up in depends on the intial condition (cluster centers).
K-means clustering for color compression
In a digital image colors can be represented as RGB-colors encoded by 3x8 bit if not compressed. This makes \(2^{3*8} = 16,7 Mio\) different colors. Reducing this number to K prototypic colors and replacing the original colors with these prototypical colors enables us to drastically cut down on bits to transmit. A data vector \(\vec{x_i} \in \Re^3\) os the color triple of pixel number i. The cluster centers \(\vec{w_i} \in \Re^3\) are the prototypic colors.
(?P5-Visualisation?)
Soft clustering
So far we described clusters as sets of data points or by their centers which means that they where disjoint. This is called hard clustering because each data point is assigned to a single cluster. There is no way to express uncertainty about the assignment to a cluster.
When clustering softly we assign a data point to a cluster by probabilities. This allows us to express uncertainty about the assignment or gradual assignment. Clusters do not have hard boundaries. We will assign gaussians to each cluster center.
The probability density of the data distribution \(D = \{\vec{x_1}, \vec{x_2}, ...\}, \vec{x_i} \in \Re^d\) is a linear superposition of K Gaussians: \(P(\vec{x}) = \sum_{k=1...K}g_k N(\vec{x}, \vec{\mu_k}, C_k)\) where \(N(.,.,.)\) is a Gaussian with mean \(\vec{\mu}\) and covariance matrix \(C\). The “amplitude” assigned to a Gaussian centred at \(\vec{\mu}\) is \(g_k\), which is the a prioiri probability that a data point belongs to cluster \(k\).
So \(0 \leq g_k \leq 1\) and \(\sum_{k=1...K}g_k = 1\) must hold.
If we want to generate a data point according to \(P(\vec{x}) = \sum_{k=1...K}g_k N(\vec{x}, \vec{\mu_k}, C_k)\) then we could either:
- regard \(P(\vec{x})\) as a whole
- first select one of the Gaussians with probability \(g_k\), then generate a random \(\vec{x}\) with probability \(N(\vec{x}, \vec{\mu_k}, C_k)\).
Lets look at the latter option. The prior (a priori probability) that an example drawn at random from \(D\) belongs to cluster \(k\) is \(g_k\). The a posteriori probability that a given data point \(\vec{x}\) belongs to cluster \(k\) is \(P^{*}_{k}(\vec{x}) = g_k N(\vec{x}, \vec{\mu_k}, C_k) / \sum_{k=1...K}g_k N(\vec{x}, \vec{\mu_k}, C_k)\). To find the best mixture of \(K\) Gaussians to fot a given data set D, the parameters \(\{ g_k, \vec{\mu_k}, C_k \}\) must be found by the EM-algorithm. The derivation of the procesdure is left out here because the constraint \(\sum_{k=1...K}g_k = 1\) requires the Lagrange multiplier method.
EM for a mixture of Gaussians
K = number of Gaussians
t = 0 # step counter
choose initial values {g_k(0), mu_k(0), C_k(0)}
while not stop_condition():
# Expectation step (E-step)
P_k(t+1, x) = g_k(t) N(x, mu_k(t), C_k(t)) / sum(g_i(t)N(x, mu_i(t), C_i(t) for i in range(K)))
# Maximization step (M-step)
N_k = sum(P_k(t+1, x) for i in range(len(D)))
g_k(t+1) = N_k/len(D)
mu_k(t+1) = 1/N_k * sum(P_k(t+1, x_i) * x_i for i in range(D))
C_k(t+1) = 1/N_k * sum(P_k(t+1, x)*(x_i - mu_k(t+1))*(x_i - mu_k(t+1))**T for i in range(len(D)))
t+=1
The EM-algorithm yields only local optimum. It is computationally much more expensive than K-means. Also measures have to be taken so that a gaussian doesn’t collapse on a single data point. K-means can provide useful intialization for \(\vec{\mu_k}\) and local PCA for \(C_k\).
It is difficult to say wether an achieved clustering is good. We may test on different data subsets to check if a clustering is robust? We might also check the distribution of averaged distances in k-neighbor clusters or we compare intra-cluster distances (distances of data to cluster center) and inter-cluster distances (distances between cluster centers).
(?P5-Visualisation?)
Conceptual clustering: Cobweb
Supervised classification:
- pre-defined classes
- example set of pairs (object, class)
Unsupervised classification:
- classes are not fixed a priori
- classes (also: categories, concepts) are formed from th examples
- examples are sorted into the formed categories
- bias of the system lies in the preferences of categoriy formation
Conceptual clustering is a paradigm of unsupervised classification. It’s distinguishing property is that it generates a concept description for each generated class.
Cobweb (Fisher 1987) is the most well knwon algorithm for conceptual clustering. Its partly motivated by some drawbacks of ID3:
- continous attributes require thresholding
- no flexibility in case of errors
- disjoint learning phase (building the tree) and application phase (classifying data) are unnatural
- each learning step divides data only along one dimension of the attribute space
- defines categories by propositional logic
Ideas that make up COBWEB:
- unsupervised learning
- incremental learning, no separation of training and test phase
- probabilitstic representation: gradual assignment of objects to categories
- no a priori fixed number of categories
One important aspect of COBWEB is the global utility function which determines the number of categories, number of hierachy levels and assignment of objects to categories. The global utility function for categories \(C_1...C_N\), attributes \(A_i\) with values \(v_{ij}\) is \(S = 1/N \sum_{n=1...N}\sum_{i,j}P(A_i = v_{ij}) * P(A_i = v_ij \vert C_n) * P(C_n \vert A_i = v_{ij})\).
Interpretation:
- \(1/N\): Prefers fewer categories
- \(P(A_i = v_{ij} \vert C_n)\): Predicatability - probability that an obejct of category \(C_n\) has value \(v_{ij}\) for attribute \(A_i\) = average number of correctly predicted values \(v_{ij}\) for attribute \(A_i\) if you know it’s category \(C_n\). (*Intra-category-similarity)
- \(P(C_n \vert A_i = v_{ij})\): Predictiveness - probability, that an object with value \(v_{ij}\) for attribute \(A_i\) belongs to category \(C_n\). (*Inter-category-dissimilarity)
- \(P(A_i = v_{ij})\): stronger weighting of frequenct attribute values
A tree is learned by:
- creating a new terminal node
- merging two nodes
- splitting a node
when presented with a new example such that \(S\) is maximized!