Gradient descent

An exponentially-decaying learning rate #

Suppose that \(f:\mathbb{R}^n \rightarrow \mathbb{R}\) is differentiable. Starting with \(w_0 \in \mathbb{R}^n\), we can iterativaly construct a sequence of points \(\{w_t\}_{t=0}^\infty\) by letting \[w_{t+1} = w_t - \alpha \nabla(f)(w_t)\] for a fixed \(\alpha \in \mathbb{R}.\) As we proved in class, if \(\alpha\) is suitably chosen, the value of \(f\) decreases at each point of this sequence. More precisely:

Theorem: Suppose that \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is a function whose gradient \(\nabla f :\mathbb{R}^n \rightarrow \mathbb{R}^n\) is a Lipschitz continuous function with constant \(L\), all second derivatives of \(f\) exist, and \(\alpha = \tfrac{1}{L}\). Then \[ f(w_{t+1}) \leq f(w_t) - \tfrac{1}{2L} ||\nabla(f)(w_t)||^2.\]

This is a perfectly reasonable result if we have an a priori estimate for \(L\). If we do not, the gradient descent algorithm can be tweaked in the following way. Consider \(\alpha \in (0,1)\) and define a sequence of points by \[w_{t+1} = w_t - \alpha^t \nabla(f)(w_t).\] Verify the following version of a guaranteed progress bound:

Homework exercise: With \(f\) as above and an arbitrary \(\alpha \in (0,1)\), show that there exists an \( N \in \mathbb{N}\) such that if \(t\geq N\), then \[ f(w_{t+1}) \leq f(w_t) - \frac{\alpha^t}{2} ||\nabla(f)(w_t)||^2.\] You may use the fact that \(\lim_{t \rightarrow \infty} \alpha^t = 0\) without proof.

With an additional technical restriction on \(f\), this result can be used to prove that the sequence \(\{w_t\}\) is Cauchy and consequently converges. But I won’t ask you to prove that just yet…

Note: There is another advantage to using an exponentially-decaying learning rate instead of a constant one. According to the theorem above, a small constant learning rate guarantees that gradient descent is monotonically decreasing. But this may force the algorithm to seek only the nearest local minimum of the objective function \(f\). If the learning rate starts out as large, sacrificing the guarantee of monotonicity allows gradient descent to explore the basins of attraction of other local minima. The exponential decay guarantees that the learning rate will eventually be small.