k-Means Algorithm (Lloyd’s Algorithm)

Most popular algorithm for clustering / unsupervised learning

k-Means Clustering Problem
Assumption: We can express any data point as a list (vector) of continuous values
Dissimilarity measure: squared Euclidean distance
Data point: finite number of features
k-means: expect k number of clusters
Global dissimilarity (k-means objective function): sum of dissimilarity for each cluster, for each data point in the kth cluster, for each feature

k-Means Algorithm
1st iteration: assign each data point to the cluster with the closest centre
2nd iteration: recalculate cluster centres by computing the mean

1. Visualisation
2. Silhouette coefficient
3. Split data set into 2 data sets

More Effective k-Means
1. Triangle inequality (ignore cluster centres that are relatively far from a given data point)
2. Local optimum versus global optimum (run k-means for different random initialisations)
3. k-means++