CATEGORY / Learning

14Nov

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Matrix Inverse & Transpose

Matrix inverse: If $A$ is an $m\times m$ matrix, and if it has an inverse, then $A\times A^{-1}=A^{-1}\times A=I$
$A=\begin{bmatrix} a & b \newline c & d \newline \end{bmatrix}$
$A^{-1}=\frac{1}{ad-bc}\begin{bmatrix} d & -b \newline -c & a \newline \end{bmatrix}$
Note: Matrices that do not have an inverse are singular or degenerate.

Matrix transpose: Let $A$ be an $m\times n$ matrix, and let $B=A^T$ . then B is an $n\times n$ matrix and $B_{ij}=A_{ji}$ .
$A = \begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix}$
$A^T = \begin{bmatrix} a & c & e \newline b & d & f \newline \end{bmatrix}$

09Nov

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Properties Of Matrix Multiplication

1. Not commutative. $A\times B \neq B\times A$
2. Associative. $(A\times B)\times C = A\times (B\times C)$

e.g. For $A \times B$ where $A$ is $m\times n$ matrix and $B$ is $n\times m$ matrix,
$A\times B$ is an $m\times m$ matrix,
$B\times A$ is an $n\times n$ matrix.

Identity matrix
Denoted as $I$ or $I_{n\times n}$
e.g.

$\begin{bmatrix} 1 & 0 & 0 \newline 0 & 1 & 0 \newline 0 & 0 & 1 \newline \end{bmatrix}$

For any matrix $A$ , $A\times I=I\times A=A$

09Nov

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Matrix Multiplication

$\begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix} \times \begin{bmatrix} y \newline z \newline \end{bmatrix} = \begin{bmatrix} a\times y + b\times z \newline c\times y + d\times z \newline e\times y + f\times z \end{bmatrix}$

3 by 2 matrix $\times$ 2 by 1 matrix $=$ 3 by 1 matrix

$m$ by $n$ matrix $\times$ $n$ by $o$ matrix $=$ $m$ by $o$ matrix

09Nov

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Addition & Scalar Multiplication Of Matrices

Addition:

$\begin{bmatrix} a & b \newline c & d \newline \end{bmatrix} + \begin{bmatrix} w & x \newline y & z \newline \end{bmatrix} = \begin{bmatrix} a+w & b+x \newline c+y & d+z \newline \end{bmatrix}$

Scalar multiplication:

$\begin{bmatrix} a & b \newline c & d \newline \end{bmatrix} \times x = \begin{bmatrix} a\times x & b\times x \newline c\times x & d\times x \newline \end{bmatrix}$

09Nov

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Matrices & Vectors

Matrix
Matrix: rectangular array of numbers
Dimension of matrix: number of rows $\times$ number of columns
$A_{ij}$ : $i$ , $j$ entry in the $i^{th}$ row, $j^{th}$ column

e.g.

$\begin{bmatrix} a & b & c \newline d & e & f \newline g & h & i \newline j & k & l \end{bmatrix}$

dimension: $4\times3$ or $\mathbb{R^{4\times3}}$
$A_{11}=a$
$A_{32}=h$

Vector
Vector: $n\times1$ matrix
$v_{i}$ : $i^{th}$ element

e.g.

$\begin{bmatrix} a \newline b \newline c \end{bmatrix}$

dimension: 3-dimensional vector or $\mathbb{R^{3}}$
$v_{1}=a$
$v_{3}=c$

1-indexed vector:

$\begin{bmatrix} y_1 \newline y_2 \newline y_3 \end{bmatrix}$

0-indexed vector:

$\begin{bmatrix} y_0 \newline y_1 \newline y_2 \end{bmatrix}$

08Nov

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Gradient Descent

Gradient descent algorithm
repeat until convergence {
$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ (for $j=0$ and $j=1$ )
}

$\alpha$ : learning rate
$a:=b$ : assigning $b$ to $a$

Simultaneous update
temp0 := $\theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)$
temp1 := $\theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)$
$\theta_0$ := temp0
$\theta_1$ := temp1

Gradient descent for linear regression
repeat until convergence {

$<span class="ql-right-eqno"> </span><span class="ql-left-eqno"> </span><img src="https://teach.sg/wp-content/ql-cache/quicklatex.com-776add333c9f68e7c5d7e1045a24c150_l3.png" height="109" width="270" class="ql-img-displayed-equation quicklatex-auto-format" alt="\begin{align*} \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \\ \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \end{align*}" title="Rendered by QuickLaTeX.com"/>$

}

03Nov

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

P-Value

p-value: probability of observing an outcome which is at least as hostile (or adversarial) to the null hypothesis as the one observed

Example
Null hypothesis: mean lifetime of a manufacturing device = 9.4 years
Accepted: within 0.396 units

50 elements with sample mean of 8.96
What is the probability that when we generate a different and independent sample average of 50 observations, we get the value <8.96 if the null hypothesis is true?

Worse than 8.96
1. Getting a number smaller than 8.96
2. Getting a number larger than 9.84

$P(Z\leq-\frac{0.44}{1.43/\sqrt{50}})+P(Z\geq\frac{0.44}{1.43/\sqrt{50}})=2\times P(Z\leq-2.175)=3\%$

Conclusion: the larger the p-value, the stronger the evidence supporting the hypothesis.

03Nov

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Validity Of Binomial Distribution

Binomial distribution: discrete probability distribution of the number of successes in a sequence of $n$ independent yes/no experiments, each of which yields success with probability $p$

Null hypothesis: there is no there is no significant difference between specified populations, any observed difference being due to sampling or experimental error

$P(X = k) = \binom n k p^k(1-p)^{n-k}$

$P(X \le k) = \sum_{i=0}^{\lfloor k \rfloor} {n\choose i}p^i(1-p)^{n-i}$

03Nov

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Hypothesis Testing

Hypothesis testing: using a data observed from a distribution with unknown parameters, we hypothesise that the parameters of this distribution take particular values and test the validity of this hypothesis using statistical methods

Confidence intervals: provide probabilistic level of certainty regarding parameters of a distribution

Example:
1. $X_1, X_2,..., X_n$
2. unknown mean value $\mu$
3. known $\sigma$

normal distribution: $N(\mu, \sigma^2)$
estimate of $\mu$ : $\bar X=\frac{X_1, X_2,..., X_n}{n}$
distribution of $\bar x$ : $N(\mu, \frac{\sigma^2}{n})$

Suppose:
$P(\bar X\leq \mu+2)$
$P(\bar X-\mu\leq 2)$
$P(\frac{\bar X-\mu}{\sigma/\sqrt{n}}\leq \frac{2}{\sigma/\sqrt{n}})$

03Nov

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Types Of Errors

$\begin{array}{|l|l|l|l|} \hline & & \textbf{Predicted fraud?} & \\ \hline & & \textbf{Y} & \textbf{N} \\ \hline {\textbf{Is it actually fraud?}} & \textbf{Y} & +/+ \text{(true positive)} & -/+ \text{(false negative - type 2)} \\ \hline {\textbf{}} & \textbf{N} & +/- \text{(false positive - type 1)} & -/- \text{(true negative)} \\ \hline \end{array}$

Precision: how often a classifier is right when it says something is fraud $(\frac{\text{true positives}}{\text{true positives}+\text{false positives}})$
Recall: how much of the actual fraud that we correctly detect $(\frac{\text{true positives}}{\text{true positives}+\text{false negatives}})$

$\begin{array}{|l|l|} \hline \textbf{Conservation (flag fewer transactions)} & \textbf{Aggressive (flag more transactions)}\\\hline \text{high precision (few false positives)} & \text{low precision (many false positives)}\\\hline \text{low recall (miss some fraud)} & \text{high recall (catch most fraud)}\\\hline \end{array}$

Harmonic mean of $x$ and $y$ $=$ $\frac{1}{\frac{1}{2}(\frac{1}{x}+\frac{1}{y})}$

$F_1$ $=$ $\frac{1}{\frac{1}{2}(\frac{1}{\text{precision}}+\frac{1}{\text{recall}})}=\frac{2\times\text{precision}\times\text{recall}}{\text{precision}+\text{recall}}$

- PAGE 2 OF 4 -