This is work in progress... (c) 2021 Z. Gajarska and H. Lohninger



k-Means Clustering

How to perform this in ImageLab? Click the ImageLab logo to find the corresponding instructions.
One of the oldest and most robust cluster analysis methods is k-means, in which a predetermined number of clusters is found in an iterative manner, starting with random cluster centers. k-means clustering is one of the simpler exchange methods. It can also be used for larger datasets, as it can be calculated relatively quickly if the number of cluster centers is low.

From a mathematical point of view, k-means clustering corresponds to an optimization which minimizes the objective function

where |xij-cj|2 defines the distance between the data point i and the cluster center j and k indicates the number of expected clusters.

Algorithm
1. Initialization For initialization, the k predetermined cluster centers are set to k randomly selected data points. Alternatively, you can also use the first k data points. It is important that all k cluster centers have different positions in the p-dimensional space. The best results are achieved when the cluster centers are positioned in such a way that the distances between the initial cluster centers are maximal. A unique class number (1 to k) is assigned to each cluster center.
2. Classification Find the closest cluster center for each data point and assign the class number of this cluster center to the data point.
3. Recalculate cluster centers Recalculate the position of the cluster centers by averaging all data points belonging to a certain class.
4. Iteration Repeat from step 2 until the classification is stable.

The following figure shows the displacement of the cluster centers for a simple two-dimensional example:
Training of a simple dataset with three clusters. 1: unclassified dataset; 2: randomly selected cluster centers; 3: assignment of the classes; 4: recalculation of the centers; 5: allocation of the classes; 6: recalculation of the centers; 7: allocation of the classes; 8: recalculation of the centers (from here the allocation is stable).

Disadvantages of the procedure

In addition to the advantage of being quick and easy to implement, the k-means method also has a few disadvantages that should not be overlooked:

  • The result of clustering depends on the starting positions. You should therefore always make a few attempts with different starting positions.
  • There can be empty classes for which no data is assigned to certain cluster centers.
  • Choosing the wrong k (number of expected clusters) leads to wrong class assignments.
  • The process tends to make the clusters the same size. In the case of unequal cluster sizes, this inevitably results in incorrect assignments if the distances between the clusters are not significantly greater than the distances within the clusters.
  • The classification is not scaling-invariant. As a remedy - if the data and the question allow - standardized data can be used.