k-Means Clustering Application: Understanding The Genetic Code

By Alexander Gorban and Andrei Zinoyvey

Question: Is it true that DNA breaks down into meaningful words of the same length? If so, how long are the words?

Steps:
1. Gather some real DNA of ACGT (fragment of full DNA sequence of particular bacteria: about 300,000 letters)
2. Break full sequence into strings of 300 letters each and make sure they do not overlap
3. Each string = data set (therefore, having 1017 data points)
4. Let $m = \text{length of word}$
5. Divide the DNA string into substrings of $m$
6. Count how many times the same substring occur
For $m=2$: $4^2=16$ possible words: AA, AC, AG,…, TT
For $m=3$: $4^3=64$ possible words
For $m=4$: $4^4=256$ possible words
7. Run PCA (principal component analysis)
8. Pick out top 2 principals components
9. Plot 2 principal components for each data set

Observation: $m=3$ shows clear structure and symmetry

10. Run k-means (after normalisation)

Conclusion:
1. DNA is composed of words of 3 letters each as well as non-coding bits
2. The 3 letter words are known as codons
3. They encode amino acids