k-Means Clustering Application: Understanding The Genetic Code

By Alexander Gorban and Andrei Zinoyvey

Question: Is it true that DNA breaks down into meaningful words of the same length? If so, how long are the words?

Steps:
1. Gather some real DNA of ACGT (fragment of full DNA sequence of particular bacteria: about 300,000 letters)
2. Break full sequence into strings of 300 letters each and make sure they do not overlap
3. Each string = data set (therefore, having 1017 data points)
4. Let m = \text{length of word}
5. Divide the DNA string into substrings of m
6. Count how many times the same substring occur
 For m=2: 4^2=16 possible words: AA, AC, AG,…, TT
 For m=3: 4^3=64 possible words
 For m=4: 4^4=256 possible words
7. Run PCA (principal component analysis)
8. Pick out top 2 principals components
9. Plot 2 principal components for each data set

Observation: m=3 shows clear structure and symmetry

10. Run k-means (after normalisation)

Conclusion:
1. DNA is composed of words of 3 letters each as well as non-coding bits
2. The 3 letter words are known as codons
3. They encode amino acids

loading
×