Feature scaling: get every feature into approximately a $-1 \le x_i \le 1$ range

Mean normalisation: replace $x_i$ with $x_i-\mu_i$ to make features have approximately zero mean (do not apply to $x_0=1$)

$x_i := \dfrac{x_i – \mu_i}{s_i}$
$\mu_i$: average value of $x_i$ in the training set
$s_1$: standard deviation

Points to note:
1. If gradient descent is working correctly, $J(\theta)$ should decrease after each iteration.
2. If $\alpha$ is too small, we will have slow convergence.
3. If $\alpha$ is too large, $J(\theta)$ may not converge.

1. Need to choose $\alpha$
3. Works well even when $n$ is large