Beyond k-Means Algorithm

Clustering: grouping data according to similarity

Hard clustering (each data point to one cluster) versus soft clustering (each data point to have a different degree of membership in each cluster)

Squared Euclidean distance (larger clusters have higher values of k-means objective value and smaller clusters have lower values of k-means objective value) versus Gaussian mixture models (have different covariances) versus k-medoids (using median instead of mean) versus radial similarity

1. Is your data featurised?
2. Is each feature a continuous number?
3. Are these numbers commensurate?
 -standardise or normalise
4. Are there too many features?
 -Principal Component Analysis (PCA) is a preprocessing step for k-means
5. Are there any domain-specific reasons to change the features?

Big Data
Non-parametric Bayesian methods (allow clusters to grow as data grow)
 -non-parametric: infinitely many parameters