CATEGORY / MIT Data Science: Data To Insights

Data Analysis Application: Human-Generated Text Data

Examples: presenting news articles (based on relevancy), search engine presenting results (based on topic than popularity)

Mixed membership model: using a model exhibit multiple topics

Using LDA (latent dirichlet allocation) to analyse what MIT EECS are working on in their research
By Julia Lack

1. Assemble abstracts from each professor’s published papers (over 900 abstracts)
2. Pre-processing: remove most common words / least common words
3. Choose $k=5$ for number of topics and run stochastic variational inference

k-Means Clustering Application: Understanding The Genetic Code

By Alexander Gorban and Andrei Zinoyvey

Question: Is it true that DNA breaks down into meaningful words of the same length? If so, how long are the words?

1. Gather some real DNA of ACGT (fragment of full DNA sequence of particular bacteria: about 300,000 letters)
2. Break full sequence into strings of 300 letters each and make sure they do not overlap
3. Each string = data set (therefore, having 1017 data points)
4. Let $m = \text{length of word}$
5. Divide the DNA string into substrings of $m$
6. Count how many times the same substring occur
 For $m=2$: $4^2=16$ possible words: AA, AC, AG,…, TT
 For $m=3$: $4^3=64$ possible words
 For $m=4$: $4^4=256$ possible words
7. Run PCA (principal component analysis)
8. Pick out top 2 principals components
9. Plot 2 principal components for each data set

Observation: $m=3$ shows clear structure and symmetry

10. Run k-means (after normalisation)

1. DNA is composed of words of 3 letters each as well as non-coding bits
2. The 3 letter words are known as codons
3. They encode amino acids

Beyond Clustering

Problem with clustering: each data point needs to belong to only one group or cluster
Solution: feature allocation (mixed membership) instead of clustering

1. corpus of documents may belong to multiple categories
2. individual’s DNA may belong to multiple ancestral groups
3. individual votes may represent a number of different ideologies
4. individual interactions on a social network represent various different personal identities

Latent dirichlet allocation (LDA): algorithm for large amount of text data

Beyond k-Means Algorithm

Clustering: grouping data according to similarity

Hard clustering (each data point to one cluster) versus soft clustering (each data point to have a different degree of membership in each cluster)

Squared Euclidean distance (larger clusters have higher values of k-means objective value and smaller clusters have lower values of k-means objective value) versus Gaussian mixture models (have different covariances) versus k-medoids (using median instead of mean) versus radial similarity

1. Is your data featurised?
2. Is each feature a continuous number?
3. Are these numbers commensurate?
 -standardise or normalise
4. Are there too many features?
 -Principal Component Analysis (PCA) is a preprocessing step for k-means
5. Are there any domain-specific reasons to change the features?

Big Data
Non-parametric Bayesian methods (allow clusters to grow as data grow)
 -non-parametric: infinitely many parameters

k-Means Algorithm (Lloyd’s Algorithm)

Most popular algorithm for clustering / unsupervised learning

k-Means Clustering Problem
Assumption: We can express any data point as a list (vector) of continuous values
Dissimilarity measure: squared Euclidean distance
Data point: finite number of features
k-means: expect k number of clusters
Global dissimilarity (k-means objective function): sum of dissimilarity for each cluster, for each data point in the kth cluster, for each feature

k-Means Algorithm
1st iteration: assign each data point to the cluster with the closest centre
2nd iteration: recalculate cluster centres by computing the mean

1. Visualisation
2. Silhouette coefficient
3. Split data set into 2 data sets

More Effective k-Means
1. Triangle inequality (ignore cluster centres that are relatively far from a given data point)
2. Local optimum versus global optimum (run k-means for different random initialisations)
3. k-means++


Clustering: unsupervised problem of assigning each data point to exactly one group
Classification: supervised learning when labels are categorical

Clustering finds hidden groupings in data

Supervised & Unsupervised Learning

Machine learning in statistics: find hidden patterns in data
Supervised learning: learn from data but we have labels for all the data we’ve seen so far
Unsupervised learning: learn from data but we don’t have any labels
Data set: collection of data points that help us learn

Examples of supervised learning
1. Sorting if emails are spam
 a. Data set: all the emails sent to user
 b. Data point: single email
 c. Labels: spam / not spam

Examples of unsupervised learning
1. Sorting emails into topics
 a. If no labels are given, machine needs to intelligently sort it into different categories
2. Google News
3. Facebook trending stories

Introduction To MIT Data Science: Data To Insights

I have started these few threads about what I have learnt from MIT Data Science: Data To Insights course.

I highly recommend that you take up the course to learn more about the theoretical aspects of Data Science.

Week 1 – Module 1: Making sense of unstructured data


  1. What is unsupervised learning, and why is it challenging?
  2. Examples of unsupervised learning


  1. What is clustering?
  2. When to use clustering
  3. K-means preliminaries
  4. The K-means algorithm
  5. How to evaluate clustering
  6. Beyond K-means: what really makes a cluster?
  7. Beyond K-means: other notions of distance
  8. Beyond K-means: data and pre-processing
  9. Beyond K-means: big data and nonparametric Bayes
  10. Beyond clustering

Spectral Clustering, Components and Embeddings

  1. What if we do not have features to describe the data, or not all are meaningful?
  2. Finding the principal components in data, and applications
  3. The magic of eigenvectors I
  4. Clustering in graphs and networks
  5. Features from graphs: the magic of eigenvectors II
  6. Spectral clustering
  7. Modularity Clustering
  8. Embeddings: new features and their meaning

Week 2 – Module 2: Regression and Prediction

Classical Linear and Nonlinear Regression and Extensions

  1. Linear regression with one and several variable
  2. Linear regression for prediction
  3. Linear regression for causal inference
  4. Logistic and other types of nonlinear regression

Modern Regression with High-Dimensional Data

  1. Making good predictions with high-dimensional data; avoiding overfitting by validation and cross-validation
  2. Regularization by Lasso, Ridge, and their modifications
  3. Regression Trees, Random Forest, Boosted Trees

The Use of Modern Regression for Causal Inference

  1. Randomized Control Trials
  2. Observational Studies with Confounding

Week 3 – MODULE 3.1: Classification and Hypothesis Testing

Hypothesis Testing and Classification:

  1. What are anomalies? What is fraud? Spams?
  2. Binary Classification: False Positive/Negative, Precision / Recall, F1-Score
  3. Logistic and Probit regression: statistical binary classification
  4. Hypothesis testing: Ratio Test and Neyman-Pearson
  5. p-values: confidence
  6. Support vector machine: non-statistical classifier
  7. Perceptron: simple classifier with elegant interpretation

Week 4 – MODULE 3.2: Deep Learning

Deep Learning

  1. What is image classification? Introduce ImageNet and show examples
  2. Classification using a single linear threshold (perceptron)
  3. Hierarchical representations
  4. Fitting parameters using back-propagation
  5. Non-convex functions
  6. How interpret-able are its features?
  7. Manipulating deep nets (ostrich example)
  8. Transfer learning
  9. Other applications I: Speech recognition
  10. Other applications II: Natural language processing

Week 5 – MODULE 4: Recommendation Systems

Recommendations and ranking

  1. What does a recommendation system do?
  2. So what is the recommendation prediction problem? and what data do we have?
  3. Using population averages
  4. Using population comparisons and ranking

Collaborative filtering

  1. Personalization using collaborative filtering using similar users
  2. Personalization using collaborative filtering using similar items
  3. Personalization using collaborative filtering using similar users and items

Personalized Recommendations

  1. Personalization using comparisons, rankings and users-items
  2. Hidden Markov Model / Neural Nets, Bipartite graph and graphical model
  3. Using side-information
  4. 20 questions and active learning
  5. Building a system: algorithmic and system challenges


  1. Guidelines on building system
  2. Parting remarks and challenges

Week 6 – MODULE 5: Networks and Graphical Models


  1. Introduction to networks
  2. Examples of networks
  3. Representation of networks


  1. Centrality measures: degree, eigenvector, and page-rank
  2. Closeness and betweenness centrality
  3. Degree distribution, clustering, and small world
  4. Network models: Erdos-Renyi, configuration model, preferential attachment
  5. Stochastic models on networks for spread of viruses or ideas
  6. Influence maximization

Graphical models

  1. Undirected graphical models
  2. Ising and Gaussian models
  3. Learning graphical models from data
  4. Directed graphical models
  5. V-structures, “explaining away”, and learning directed graphical models
  6. Inference in graphical models: marginals and message passing
  7. Hidden Markov Model (HMM)
  8. Kalman filter

  Previous Page

- PAGE 2 OF 2 -