k-Means Clustering Application: Understanding The Genetic Code
By Alexander Gorban and Andrei Zinoyvey
Question: Is it true that DNA breaks down into meaningful words of the same length? If so, how long are the words?
Steps:
1. Gather some real DNA of ACGT (fragment of full DNA sequence of particular bacteria: about 300,000 letters)
2. Break full sequence into strings of 300 letters each and make sure they do not overlap
3. Each string = data set (therefore, having 1017 data points)
4. Let
5. Divide the DNA string into substrings of
6. Count how many times the same substring occur
For : possible words: AA, AC, AG,…, TT
For : possible words
For : possible words
7. Run PCA (principal component analysis)
8. Pick out top 2 principals components
9. Plot 2 principal components for each data set
Observation: shows clear structure and symmetry
10. Run k-means (after normalisation)
Conclusion:
1. DNA is composed of words of 3 letters each as well as non-coding bits
2. The 3 letter words are known as codons
3. They encode amino acids