CATEGORY / Learning

Regression Analysis

Definition: discovering correlations between the outcome y and the set of regressors x (features)

y: real random variable
x: vector or random variables X=(X_1,...,X_p)'

Example
For wages, suppose
y: hourly wage
x: regressors (experience, gender, education)

X=(D,W')'
D: target regressor
W: controls of components

Prediction: how can we use X to predict Y well?
Inference: how does the predicted value of Y change if we change a component of X holding the rest of the components of X fixed?

Cost Function

Linear regression: solve a minimisation problem

minimise \theta_0, \theta_1 for J(\theta_0, \theta_1)

cost function: J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2

Model Representation

m: number of training examples
x: input variables / features
y: output variables / target variables
(x,y): single training example
(x_i,y_i): i^{th} training example

Training set to learning algorithm to hypothesis h (based on size of house) to estimates price

h_\theta (x) = \theta_0 + \theta_1 x
\theta_i: parameters

Linear regression in one variable = univariate linear regression

Modularity Clustering

Method: define modularity score that we aim to maximise

Characteristic idea: compare communities to a random baseline graph that shares some properties with the actual graph, such as the number of edges and the degree of the nodes

1. Compute the number of edges within a community, then subtract the expected number of edges as per the baseline model (A_{ij}-P_{ij})
2. A_{ij}=1: if edge is between i,j
A_{ij}=1: if there is no edge
P_{ij}=0.26

Modularity score: \text{edges in community 1}-\text{baseline expected edges in community 1}+\text{edges in community 2}-\text{baseline expected edges in community 2}
If score is large, then community is dense

Process:
1. Start with some partitioning, then move nodes between the groups to see if it improves the score
2. Use eigenvectors

Spectral Clustering

1. Create neighbourhood graph
2. Compute laplacian
3. Compute eigenvectors of laplacian, the ones with the smallest eigenvalues give new features
4. Run k-means clustering on new features

Eigenvectors

To capture global connectivity structure, eigenvectors are really useful. Results will be spectral clustering.

Spectrum of matrix: set of eigenvalues
Matrix: Laplacian of graph

Labelled graph: 6n-graf.svg

Adjacency matrix: \left(\begin{array}{rrrrrr}  0 &  1 &  0 &  0 &  1 &  0\\  1 &  0 &  1 &  0 &  1 &  0\\  0 &  1 &  0 &  1 &  0 &  0\\  0 &  0 &  1 &  0 &  1 &  1\\  1 &  1 &  0 &  1 &  0 &  0\\  0 &  0 &  0 &  1 &  0 &  0\\ \end{array}\right)

Laplacian matrix: \left(\begin{array}{rrrrrr}  2 & -1 &  0 &  0 & -1 &  0\\ -1 &  3 & -1 &  0 & -1 &  0\\  0 & -1 &  2 & -1 &  0 &  0\\  0 &  0 & -1 &  3 & -1 & -1\\ -1 & -1 &  0 & -1 &  3 &  0\\  0 &  0 &  0 & -1 &  0 &  1\\ \end{array}\right)

Clusters

Cluster: points that are well-connected with each other (lots of edges)
Number of edges = volume
Volume per node = density
Cut value: separation of clusters
Cut(C) = 1 (number of edges between clusters)

1st criteria: density must be large
2nd criteria: there should not be too many edges between different clusters

Normalised Cut
\text{Ncut(C)}=\frac{\text{Cut(C)}}{\text{Volume(C)}\times \text{Volume(V\C)}}

Conductance
\text{conductance(C)}=\frac{\text{Cut(C)}}{\text{Min\{Volume(C),Volume(V\C)\}}}

Good clusters are not too small, internally well-connected and separated from the rest of the nodes

Principal Component Analysis

Patterns = principal component = vector
Finds major axis of variation in data
Each data point expressed as a linear combination of patterns

Ax=\lambda x
\text{Matrix}\times\text{eigenvector}=\text{eigenvalue}\times\text{eigenvector}
Eigenvectors capture major direction that are inherent in the matrix
The larger the eigenvalue, the more important is the vector
Covariance matrix contains terms for all positive pairs of features

Data Analysis Application: Human-Generated Text Data

Examples: presenting news articles (based on relevancy), search engine presenting results (based on topic than popularity)

Mixed membership model: using a model exhibit multiple topics

Using LDA (latent dirichlet allocation) to analyse what MIT EECS are working on in their research
By Julia Lack

Steps:
1. Assemble abstracts from each professor’s published papers (over 900 abstracts)
2. Pre-processing: remove most common words / least common words
3. Choose k=5 for number of topics and run stochastic variational inference

k-Means Clustering Application: Understanding The Genetic Code

By Alexander Gorban and Andrei Zinoyvey

Question: Is it true that DNA breaks down into meaningful words of the same length? If so, how long are the words?

Steps:
1. Gather some real DNA of ACGT (fragment of full DNA sequence of particular bacteria: about 300,000 letters)
2. Break full sequence into strings of 300 letters each and make sure they do not overlap
3. Each string = data set (therefore, having 1017 data points)
4. Let m = \text{length of word}
5. Divide the DNA string into substrings of m
6. Count how many times the same substring occur
 For m=2: 4^2=16 possible words: AA, AC, AG,…, TT
 For m=3: 4^3=64 possible words
 For m=4: 4^4=256 possible words
7. Run PCA (principal component analysis)
8. Pick out top 2 principals components
9. Plot 2 principal components for each data set

Observation: m=3 shows clear structure and symmetry

10. Run k-means (after normalisation)

Conclusion:
1. DNA is composed of words of 3 letters each as well as non-coding bits
2. The 3 letter words are known as codons
3. They encode amino acids


  Previous Page

- PAGE 3 OF 4 -

Next Page  

loading
×