Gradient update formula

Gradient update formula #

Background #

Suppose we would like to find a set of weights \(\vec{w}\) that minimizes a loss function \(f\). One approach is to use gradient descent which creates a sequence of weight vectors \(\{\vec{w}_t\}_{t=0}^k\) using a gradient descent step: \[ \vec{w}_{t+1} = \vec{w}_t - \alpha \nabla(f)(\vec{w}_t).\] In a lot of situations, this computation can be expressed in terms of linear algebra. For instance, suppose that our training data has the form \( \mathcal{D} = \{(\vec{x}_i, y_i)\}_{i=1}^n \), we are performing regression so that our prediction for the input vector \(\vec{x}_i\) will be \(h_{\vec{w}}(\vec{x_i}) = \vec{x_i}^t \vec{w} \) for some, yet undetermined, weight vector \(\vec{w}\), and that we are using the squared error loss function \(f(\vec{w}) = \sum_{i=1}^n (h_{\vec{w}}(\vec{x_i}) - y_i)^2\). If we form the matrix

\[ A= \begin{bmatrix} \text{–} & \vec{x}_1^t & \text{–} \\ \text{–} & \vec{x}_2^t& \text{–} \\ \vdots & \vdots & \vdots \\ \text{–} & \vec{x}_n^t& \text{–} \\ \end{bmatrix} \]

and write \(\vec{y}\) for the vector of outputs, then the gradient update step takes the following simple form:

\[ \vec{w}_{t+1} = \vec{w}_t - 2\alpha (A^t A \vec{w}_t - A^t\vec{y}).\]

What if we wanted to perform gradient descent but using a more complex loss function?

Regularized loss functions #

Recall the notation \(||\vec{w}||_2 = \sqrt{\vec{w}^t \vec{w}}\) and \(||\vec{w}||_1 = \sum_{i=1}^m |w_i|\) and consider the following loss functions which penalize large weights:

\(L^2\)-regularized loss: \[f(\vec{w}) = \sum_{i=1}^n (\vec{x_i}^t \vec{w} - y_i)^2 + \lambda ||w||^2_2 \]
\(L^1\)-regularized loss: \[f(\vec{w}) = \sum_{i=1}^n (\vec{x_i}^t \vec{w} - y_i)^2 + \lambda ||w||_1 \]

We will use both later in the course when regularizing our models in hopes that they not only fit the training data well, but also make good predictions on validation data.

Homework exercise:

Using the techniques introduced in class, derive a gradient update formula for each of these loss functions.

Hint: When computing the gradient of the\(L^1\)-regularized loss, you may assume that the terms of your weight vector \(\vec{w}\) are all non-zero. You may find the function \(\text{sign}(x) \) useful; it is equal to \(1\) when \(x\) is positive and \(-1\) when it is negative.