Application Of Gradient Descent

Feature scaling: get every feature into approximately a -1 \le x_i \le 1 range

Mean normalisation: replace x_i with x_i-\mu_i to make features have approximately zero mean (do not apply to x_0=1)

x_i := \dfrac{x_i - \mu_i}{s_i}
\mu_i: average value of x_i in the training set
s_1: standard deviation

Points to note:
1. If gradient descent is working correctly, J(\theta) should decrease after each iteration.
2. If \alpha is too small, we will have slow convergence.
3. If \alpha is too large, J(\theta) may not converge.

Advantages and disadvantages:
1. Need to choose \alpha
2. Needs many iterations
3. Works well even when n is large