Linear regression: solve a minimisation problem
minimise $\theta_0, \theta_1$ for $J(\theta_0, \theta_1)$
cost function: $J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) – y_{i} \right)^2$
Linear regression: solve a minimisation problem
minimise $\theta_0, \theta_1$ for $J(\theta_0, \theta_1)$
cost function: $J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) – y_{i} \right)^2$
$m$: number of training examples
$x$: input variables / features
$y$: output variables / target variables
$(x,y)$: single training example
$(x_i,y_i)$: $i^{th}$ training example
Training set to learning algorithm to hypothesis $h$ (based on size of house) to estimates price
$h_\theta (x) = \theta_0 + \theta_1 x$
$\theta_i$: parameters
Linear regression in one variable = univariate linear regression
Method: define modularity score that we aim to maximise
Characteristic idea: compare communities to a random baseline graph that shares some properties with the actual graph, such as the number of edges and the degree of the nodes
1. Compute the number of edges within a community, then subtract the expected number of edges as per the baseline model ($A_{ij}-P_{ij}$)
2. $A_{ij}=1$: if edge is between $i,j$
$A_{ij}=1$: if there is no edge
$P_{ij}=0.26$
Modularity score: $\text{edges in community 1}-\text{baseline expected edges in community 1}+\text{edges in community 2}-\text{baseline expected edges in community 2}$
If score is large, then community is dense
Process:
1. Start with some partitioning, then move nodes between the groups to see if it improves the score
2. Use eigenvectors
1. Create neighbourhood graph
2. Compute laplacian
3. Compute eigenvectors of laplacian, the ones with the smallest eigenvalues give new features
4. Run k-means clustering on new features
To capture global connectivity structure, eigenvectors are really useful. Results will be spectral clustering.
Spectrum of matrix: set of eigenvalues
Matrix: Laplacian of graph
Adjacency matrix: $\left(\begin{array}{rrrrrr}
0 & 1 & 0 & 0 & 1 & 0\\
1 & 0 & 1 & 0 & 1 & 0\\
0 & 1 & 0 & 1 & 0 & 0\\
0 & 0 & 1 & 0 & 1 & 1\\
1 & 1 & 0 & 1 & 0 & 0\\
0 & 0 & 0 & 1 & 0 & 0\\
\end{array}\right)$
Laplacian matrix: $\left(\begin{array}{rrrrrr}
2 & -1 & 0 & 0 & -1 & 0\\
-1 & 3 & -1 & 0 & -1 & 0\\
0 & -1 & 2 & -1 & 0 & 0\\
0 & 0 & -1 & 3 & -1 & -1\\
-1 & -1 & 0 & -1 & 3 & 0\\
0 & 0 & 0 & -1 & 0 & 1\\
\end{array}\right)$
Cluster: points that are well-connected with each other (lots of edges)
Number of edges = volume
Volume per node = density
Cut value: separation of clusters
Cut(C) = 1 (number of edges between clusters)
1st criteria: density must be large
2nd criteria: there should not be too many edges between different clusters
Normalised Cut
$\text{Ncut(C)}=\frac{\text{Cut(C)}}{\text{Volume(C)}\times \text{Volume(V\C)}}$
Conductance
$\text{conductance(C)}=\frac{\text{Cut(C)}}{\text{Min\{Volume(C),Volume(V\C)\}}}$
Good clusters are not too small, internally well-connected and separated from the rest of the nodes
Patterns = principal component = vector
Finds major axis of variation in data
Each data point expressed as a linear combination of patterns
$Ax=\lambda x$
$\text{Matrix}\times\text{eigenvector}=\text{eigenvalue}\times\text{eigenvector}$
Eigenvectors capture major direction that are inherent in the matrix
The larger the eigenvalue, the more important is the vector
Covariance matrix contains terms for all positive pairs of features
Examples: presenting news articles (based on relevancy), search engine presenting results (based on topic than popularity)
Mixed membership model: using a model exhibit multiple topics
Using LDA (latent dirichlet allocation) to analyse what MIT EECS are working on in their research
By Julia Lack
Steps:
1. Assemble abstracts from each professor’s published papers (over 900 abstracts)
2. Pre-processing: remove most common words / least common words
3. Choose $k=5$ for number of topics and run stochastic variational inference
By Alexander Gorban and Andrei Zinoyvey
Question: Is it true that DNA breaks down into meaningful words of the same length? If so, how long are the words?
Steps:
1. Gather some real DNA of ACGT (fragment of full DNA sequence of particular bacteria: about 300,000 letters)
2. Break full sequence into strings of 300 letters each and make sure they do not overlap
3. Each string = data set (therefore, having 1017 data points)
4. Let $m = \text{length of word}$
5. Divide the DNA string into substrings of $m$
6. Count how many times the same substring occur
For $m=2$: $4^2=16$ possible words: AA, AC, AG,…, TT
For $m=3$: $4^3=64$ possible words
For $m=4$: $4^4=256$ possible words
7. Run PCA (principal component analysis)
8. Pick out top 2 principals components
9. Plot 2 principal components for each data set
Observation: $m=3$ shows clear structure and symmetry
10. Run k-means (after normalisation)
Conclusion:
1. DNA is composed of words of 3 letters each as well as non-coding bits
2. The 3 letter words are known as codons
3. They encode amino acids
Problem with clustering: each data point needs to belong to only one group or cluster
Solution: feature allocation (mixed membership) instead of clustering
Examples:
1. corpus of documents may belong to multiple categories
2. individual’s DNA may belong to multiple ancestral groups
3. individual votes may represent a number of different ideologies
4. individual interactions on a social network represent various different personal identities
Latent dirichlet allocation (LDA): algorithm for large amount of text data
- PAGE 3 OF 4 -