CATEGORY / Learning

30Oct

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Cost Function

Linear regression: solve a minimisation problem

minimise $\theta_0, \theta_1$ for $J(\theta_0, \theta_1)$

cost function: $J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) – y_{i} \right)^2$

30Oct

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Model Representation

$m$: number of training examples
$x$: input variables / features
$y$: output variables / target variables
$(x,y)$: single training example
$(x_i,y_i)$: $i^{th}$ training example

Training set to learning algorithm to hypothesis $h$ (based on size of house) to estimates price

$h_\theta (x) = \theta_0 + \theta_1 x$
$\theta_i$: parameters

Linear regression in one variable = univariate linear regression

29Oct

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Modularity Clustering

Method: define modularity score that we aim to maximise

Characteristic idea: compare communities to a random baseline graph that shares some properties with the actual graph, such as the number of edges and the degree of the nodes

1. Compute the number of edges within a community, then subtract the expected number of edges as per the baseline model ($A_{ij}-P_{ij}$)
2. $A_{ij}=1$: if edge is between $i,j$
$A_{ij}=1$: if there is no edge
$P_{ij}=0.26$

Modularity score: $\text{edges in community 1}-\text{baseline expected edges in community 1}+\text{edges in community 2}-\text{baseline expected edges in community 2}$
If score is large, then community is dense

Process:
1. Start with some partitioning, then move nodes between the groups to see if it improves the score
2. Use eigenvectors

29Oct

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Spectral Clustering

1. Create neighbourhood graph
2. Compute laplacian
3. Compute eigenvectors of laplacian, the ones with the smallest eigenvalues give new features
4. Run k-means clustering on new features

29Oct

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Eigenvectors

To capture global connectivity structure, eigenvectors are really useful. Results will be spectral clustering.

Spectrum of matrix: set of eigenvalues
Matrix: Laplacian of graph

Labelled graph:

Adjacency matrix: $\left(\begin{array}{rrrrrr}
0 & 1 & 0 & 0 & 1 & 0\\
1 & 0 & 1 & 0 & 1 & 0\\
0 & 1 & 0 & 1 & 0 & 0\\
0 & 0 & 1 & 0 & 1 & 1\\
1 & 1 & 0 & 1 & 0 & 0\\
0 & 0 & 0 & 1 & 0 & 0\\
\end{array}\right)$

Laplacian matrix: $\left(\begin{array}{rrrrrr}
2 & -1 & 0 & 0 & -1 & 0\\
-1 & 3 & -1 & 0 & -1 & 0\\
0 & -1 & 2 & -1 & 0 & 0\\
0 & 0 & -1 & 3 & -1 & -1\\
-1 & -1 & 0 & -1 & 3 & 0\\
0 & 0 & 0 & -1 & 0 & 1\\
\end{array}\right)$

22Oct

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Clusters

Cluster: points that are well-connected with each other (lots of edges)
Number of edges = volume
Volume per node = density
Cut value: separation of clusters
Cut(C) = 1 (number of edges between clusters)

1^st criteria: density must be large
2^nd criteria: there should not be too many edges between different clusters

Normalised Cut
$\text{Ncut(C)}=\frac{\text{Cut(C)}}{\text{Volume(C)}\times \text{Volume(V\C)}}$

Conductance
$\text{conductance(C)}=\frac{\text{Cut(C)}}{\text{Min\{Volume(C),Volume(V\C)\}}}$

Good clusters are not too small, internally well-connected and separated from the rest of the nodes

22Oct

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Principal Component Analysis

Patterns = principal component = vector
Finds major axis of variation in data
Each data point expressed as a linear combination of patterns

$Ax=\lambda x$
$\text{Matrix}\times\text{eigenvector}=\text{eigenvalue}\times\text{eigenvector}$
Eigenvectors capture major direction that are inherent in the matrix
The larger the eigenvalue, the more important is the vector
Covariance matrix contains terms for all positive pairs of features

20Oct

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Data Analysis Application: Human-Generated Text Data

Examples: presenting news articles (based on relevancy), search engine presenting results (based on topic than popularity)

Mixed membership model: using a model exhibit multiple topics

Using LDA (latent dirichlet allocation) to analyse what MIT EECS are working on in their research
By Julia Lack

Steps:
1. Assemble abstracts from each professor’s published papers (over 900 abstracts)
2. Pre-processing: remove most common words / least common words
3. Choose $k=5$ for number of topics and run stochastic variational inference

20Oct

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

k-Means Clustering Application: Understanding The Genetic Code

By Alexander Gorban and Andrei Zinoyvey

Question: Is it true that DNA breaks down into meaningful words of the same length? If so, how long are the words?

Steps:
1. Gather some real DNA of ACGT (fragment of full DNA sequence of particular bacteria: about 300,000 letters)
2. Break full sequence into strings of 300 letters each and make sure they do not overlap
3. Each string = data set (therefore, having 1017 data points)
4. Let $m = \text{length of word}$
5. Divide the DNA string into substrings of $m$
6. Count how many times the same substring occur
For $m=2$: $4^2=16$ possible words: AA, AC, AG,…, TT
For $m=3$: $4^3=64$ possible words
For $m=4$: $4^4=256$ possible words
7. Run PCA (principal component analysis)
8. Pick out top 2 principals components
9. Plot 2 principal components for each data set

Observation: $m=3$ shows clear structure and symmetry

10. Run k-means (after normalisation)

Conclusion:
1. DNA is composed of words of 3 letters each as well as non-coding bits
2. The 3 letter words are known as codons
3. They encode amino acids

20Oct

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Beyond Clustering

Problem with clustering: each data point needs to belong to only one group or cluster
Solution: feature allocation (mixed membership) instead of clustering

Examples:
1. corpus of documents may belong to multiple categories
2. individual’s DNA may belong to multiple ancestral groups
3. individual votes may represent a number of different ideologies
4. individual interactions on a social network represent various different personal identities

Latent dirichlet allocation (LDA): algorithm for large amount of text data

- PAGE 3 OF 4 -