14Nov
2016
Eugene / Learning, Stanford Machine Learning / 0 comment
Application Of Gradient Descent
Feature scaling: get every feature into approximately a $-1 \le x_i \le 1$ range
Mean normalisation: replace $x_i$ with $x_i-\mu_i$ to make features have approximately zero mean (do not apply to $x_0=1$)
$x_i := \dfrac{x_i – \mu_i}{s_i}$
$\mu_i$: average value of $x_i$ in the training set
$s_1$: standard deviation
Points to note:
1. If gradient descent is working correctly, $J(\theta)$ should decrease after each iteration.
2. If $\alpha$ is too small, we will have slow convergence.
3. If $\alpha$ is too large, $J(\theta)$ may not converge.
Advantages and disadvantages:
1. Need to choose $\alpha$
2. Needs many iterations
3. Works well even when $n$ is large