Emprical risk minimization #
As usual, assume that we have a data set of input-output pairs of the form: \[\mathcal{D} = \{(\vec{x}_i, y_i)_{i=1}^n\}\]
and that we would like to find a function \(h_\vec{w}:\mathbb{R}^d \rightarrow \mathbb{R} \) in a hypothesis class \(\mathcal{H}\) so that \(h_\vec{w}(\vec{x}_i) \approx y_i\) for all \(i\). Here \(\vec{w}\) is a collection of real-valued parameters or weights that we can use the specify the function. For instance, the weights for a polynomial function may be the coefficients of its monomials, or coefficients when written in terms in terms of some other family of polynomials. Assessing the quality of our work is often phrased in terms of a loss function \(\mathcal{L}(\vec{w},\mathcal{D}\)) which encodes how well our model fits the data, how well it generalizes, and perhaps other criteria such as fairness. For instance, a common framework defines:
\[ \mathcal{L}(\vec{w},\mathcal{D}) = \tfrac{1}{n} \sum_i \ell(h_\vec{w}(\vec{x}_i), y_i) + \lambda r(\vec{w})\]
where \(\ell\) is a measure of distance between \(h_\vec{w}(\vec{x}_i)\) and \( y_i\) and \(r\) is a regularization term that stands as a proxy for the model’s ability to generalize. Our task can now be enunciated as:
Wishful thinking suggests we may be able to use elementary multivariable calculus techniques to do this. But wishful thinking almost never works out, and we are forced to develop more sophisticated optimization methods. This is the goal of the present section of the course.