Gradient descent

Gradient descent #

In class, we derived an efficient way to compute the gradient of the square loss function commonly used in regression problems. Armed with this knowledge, we can implement the gradient descent optimization algorithm in silico.

Gradient descent paths using different learning rates on the contour diagram of the objective function

Exercise:

Follow the lead of the following Python notebook. We will:

generate synthetic data from a function \(f\),
find a polynomial \(p\) that fits our data using gradient descent, and
compare \(f\) and \(p\).

How does the polynomial obtained via gradient descent compare to the polynomial we found by using a closed form solution to the loss minimization problem? Pay particular attention to the coefficients of the two models. Why do you think the disparity occurs? Do you expect the new model to generalize better?

In this exercise, you found a gradient update formula for regularized versions of the square loss function. Does regularization work, or is it just a good story?

Homework exercise:

Use the same notebook as above and repeat the exercise above using ridge and lasso regression; that is, incorporating \(L_2\) and \(L_1\) regularization as part of your loss function. You will have to modify the code appropriately. Vary the regularization coefficient \(\lambda\) and gauge its impact on your model, paying particular attention to the coefficients obtained. Detail your results addressing the following two questions:

Is there a qualitative difference in how each type of regression effects the model parameters; that is, the polynomial coefficients?
Is there an optimal amount of regularization? How do you find it?

Support your conclusions with evidence from the lab.