k-Means Clustering Application: Understanding The Genetic Code

By Alexander Gorban and Andrei Zinoyvey

Question: Is it true that DNA breaks down into meaningful words of the same length? If so, how long are the words?

1. Gather some real DNA of ACGT (fragment of full DNA sequence of particular bacteria: about 300,000 letters)
2. Break full sequence into strings of 300 letters each and make sure they do not overlap
3. Each string = data set (therefore, having 1017 data points)
4. Let m = \text{length of word}
5. Divide the DNA string into substrings of m
6. Count how many times the same substring occur
 For m=2: 4^2=16 possible words: AA, AC, AG,…, TT
 For m=3: 4^3=64 possible words
 For m=4: 4^4=256 possible words
7. Run PCA (principal component analysis)
8. Pick out top 2 principals components
9. Plot 2 principal components for each data set

Observation: m=3 shows clear structure and symmetry

10. Run k-means (after normalisation)

1. DNA is composed of words of 3 letters each as well as non-coding bits
2. The 3 letter words are known as codons
3. They encode amino acids