Beyond k-Means Algorithm
Clustering: grouping data according to similarity
Grouping
Hard clustering (each data point to one cluster) versus soft clustering (each data point to have a different degree of membership in each cluster)
Similarity
Squared Euclidean distance (larger clusters have higher values of k-means objective value and smaller clusters have lower values of k-means objective value) versus Gaussian mixture models (have different covariances) versus k-medoids (using median instead of mean) versus radial similarity
Data
1. Is your data featurised?
2. Is each feature a continuous number?
3. Are these numbers commensurate?
-standardise or normalise
4. Are there too many features?
-Principal Component Analysis (PCA) is a preprocessing step for k-means
5. Are there any domain-specific reasons to change the features?
Big Data
Non-parametric Bayesian methods (allow clusters to grow as data grow)
-non-parametric: infinitely many parameters