CATEGORY / Learning

09Nov

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Properties Of Matrix Multiplication

1. Not commutative. $A\times B \neq B\times A$
2. Associative. $(A\times B)\times C = A\times (B\times C)$

e.g. For $A \times B$ where $A$ is $m\times n$ matrix and $B$ is $n\times m$ matrix,
$A\times B$ is an $m\times m$ matrix,
$B\times A$ is an $n\times n$ matrix.

Identity matrix
Denoted as $I$ or $I_{n\times n}$
e.g. $$\begin{bmatrix}
1 & 0 & 0 \newline
0 & 1 & 0 \newline
0 & 0 & 1 \newline
\end{bmatrix}$$
For any matrix $A$, $A\times I=I\times A=A$

09Nov

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Matrix Multiplication

$$\begin{bmatrix}
a & b \newline
c & d \newline
e & f
\end{bmatrix} \times
\begin{bmatrix}
y \newline
z \newline
\end{bmatrix} =
\begin{bmatrix}
a\times y + b\times z \newline
c\times y + d\times z \newline
e\times y + f\times z
\end{bmatrix}$$
3 by 2 matrix $\times$ 2 by 1 matrix $=$ 3 by 1 matrix

$m$ by $n$ matrix $\times$ $n$ by $o$ matrix $=$ $m$ by $o$ matrix

09Nov

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Addition & Scalar Multiplication Of Matrices

Addition: $$\begin{bmatrix}
a & b \newline
c & d \newline
\end{bmatrix} +
\begin{bmatrix}
w & x \newline
y & z \newline
\end{bmatrix} =
\begin{bmatrix}
a+w & b+x \newline
c+y & d+z \newline
\end{bmatrix}$$

Scalar multiplication: $$\begin{bmatrix}
a & b \newline
c & d \newline
\end{bmatrix} \times x =
\begin{bmatrix}
a\times x & b\times x \newline
c\times x & d\times x \newline
\end{bmatrix}$$

09Nov

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Matrices & Vectors

Matrix
Matrix: rectangular array of numbers
Dimension of matrix: number of rows $\times$ number of columns
$A_{ij}$: $i$, $j$ entry in the $i^{th}$ row, $j^{th}$ column

e.g. $$\begin{bmatrix}
a & b & c \newline
d & e & f \newline
g & h & i \newline
j & k & l
\end{bmatrix}$$
dimension: $4\times3$ or $\mathbb{R^{4\times3}}$
$A_{11}=a$
$A_{32}=h$

Vector
Vector: $n\times1$ matrix
$v_{i}$: $i^{th}$ element

e.g. $$\begin{bmatrix}
a \newline
b \newline
c
\end{bmatrix}$$
dimension: 3-dimensional vector or $\mathbb{R^{3}}$
$v_{1}=a$
$v_{3}=c$

1-indexed vector: $$\begin{bmatrix}
y_1 \newline
y_2 \newline
y_3
\end{bmatrix}$$

0-indexed vector: $$\begin{bmatrix}
y_0 \newline
y_1 \newline
y_2
\end{bmatrix}$$

08Nov

2016

Eugene / Learning, Stanford Machine Learning / 0 comment

Gradient Descent

Gradient descent algorithm
repeat until convergence {
$\theta_j := \theta_j – \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ (for $j=0$ and $j=1$)
}

$\alpha$: learning rate
$a:=b$: assigning $b$ to $a$

Simultaneous update
temp0 := $\theta_0 – \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)$
temp1 := $\theta_1 – \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)$
$\theta_0$ := temp0
$\theta_1$ := temp1

Gradient descent for linear regression
repeat until convergence {
$$\begin{align*}
\theta_0 := \theta_0 – \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) – y_{i}) \\
\theta_1 := \theta_1 – \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) – y_{i}) x_{i}\right)
\end{align*}$$
}

03Nov

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

P-Value

p-value: probability of observing an outcome which is at least as hostile (or adversarial) to the null hypothesis as the one observed

Example
Null hypothesis: mean lifetime of a manufacturing device = 9.4 years
Accepted: within 0.396 units

50 elements with sample mean of 8.96
What is the probability that when we generate a different and independent sample average of 50 observations, we get the value <8.96 if the null hypothesis is true?

Worse than 8.96
1. Getting a number smaller than 8.96
2. Getting a number larger than 9.84

$P(Z\leq-\frac{0.44}{1.43/\sqrt{50}})+P(Z\geq\frac{0.44}{1.43/\sqrt{50}})=2\times P(Z\leq-2.175)=3\%$

Conclusion: the larger the p-value, the stronger the evidence supporting the hypothesis.

03Nov

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Validity Of Binomial Distribution

Binomial distribution: discrete probability distribution of the number of successes in a sequence of $n$ independent yes/no experiments, each of which yields success with probability $p$

Null hypothesis: there is no there is no significant difference between specified populations, any observed difference being due to sampling or experimental error

$$P(X = k) = \binom n k p^k(1-p)^{n-k}$$
$$P(X \le k) = \sum_{i=0}^{\lfloor k \rfloor} {n\choose i}p^i(1-p)^{n-i}$$

03Nov

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Hypothesis Testing

Hypothesis testing: using a data observed from a distribution with unknown parameters, we hypothesise that the parameters of this distribution take particular values and test the validity of this hypothesis using statistical methods

Confidence intervals: provide probabilistic level of certainty regarding parameters of a distribution

Example:
1. $X_1, X_2,…, X_n$
2. unknown mean value $\mu$
3. known $\sigma$

normal distribution: $N(\mu, \sigma^2)$
estimate of $\mu$: $\bar X=\frac{X_1, X_2,…, X_n}{n}$
distribution of $\bar x$: $N(\mu, \frac{\sigma^2}{n})$

Suppose:
$P(\bar X\leq \mu+2)$
$P(\bar X-\mu\leq 2)$
$P(\frac{\bar X-\mu}{\sigma/\sqrt{n}}\leq \frac{2}{\sigma/\sqrt{n}})$

03Nov

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Types Of Errors

$$\begin{array}{|l|l|l|l|}
\hline
& & \textbf{Predicted fraud?} & \\ \hline
& & \textbf{Y} & \textbf{N} \\ \hline
{\textbf{Is it actually fraud?}} & \textbf{Y} & +/+ \text{(true positive)} & -/+ \text{(false negative – type 2)} \\ \hline
{\textbf{}} & \textbf{N} & +/- \text{(false positive – type 1)} & -/- \text{(true negative)} \\ \hline
\end{array}$$

Precision: how often a classifier is right when it says something is fraud $(\frac{\text{true positives}}{\text{true positives}+\text{false positives}})$
Recall: how much of the actual fraud that we correctly detect $(\frac{\text{true positives}}{\text{true positives}+\text{false negatives}})$

$$\begin{array}{|l|l|}
\hline
\textbf{Conservation (flag fewer transactions)} & \textbf{Aggressive (flag more transactions)}\\\hline
\text{high precision (few false positives)} & \text{low precision (many false positives)}\\\hline
\text{low recall (miss some fraud)} & \text{high recall (catch most fraud)}\\\hline
\end{array}$$

Harmonic mean of $x$ and $y$ $=$ $\frac{1}{\frac{1}{2}(\frac{1}{x}+\frac{1}{y})}$

$F_1$ $=$ $\frac{1}{\frac{1}{2}(\frac{1}{\text{precision}}+\frac{1}{\text{recall}})}=\frac{2\times\text{precision}\times\text{recall}}{\text{precision}+\text{recall}}$

01Nov

2016

Eugene / Learning, MIT Data Science: Data To Insights / 0 comment

Regression Analysis

Definition: discovering correlations between the outcome $y$ and the set of regressors $x$ (features)

$y$: real random variable
$x$: vector or random variables $X=(X_1,…,X_p)’$

Example
For wages, suppose
$y$: hourly wage
$x$: regressors (experience, gender, education)

$X=(D,W’)’$
$D$: target regressor
$W$: controls of components

Prediction: how can we use $X$ to predict $Y$ well?
Inference: how does the predicted value of $Y$ change if we change a component of $X$ holding the rest of the components of $X$ fixed?

- PAGE 2 OF 4 -