Machine learning: The Hessian

The Hessian and the rate of change of the gradient #

Background #

Suppose \(f:\mathbb{R}^n \rightarrow \mathbb{R}\). The gradient of \(f\) is also a function \(\nabla(f) : \mathbb{R}^n \rightarrow \mathbb{R}^n\). If \(\vec{x} \in \mathbb{R}^n\), then we can write

\[ \nabla(f)(\vec{x}) = \Big( \tfrac{\partial f}{\partial x_1}(\vec{x}), \ldots , \tfrac{\partial f}{\partial x_n}(\vec{x}) \Big)^t\]

Each coordinate is the derivative of \(f\) in the direction of one of axes in \(\mathbb{R}^n\). The gradient is itself a function which changes with changes to the input \(\vec{x}\). The purpose of this exercise is to describe this change.

A derivative of the gradient #

Suppose that we are standing at the point \(\vec{x} \in \mathbb{R}^n\), have just computed the gradient vector \(\nabla(f)(\vec{x})\), and would like to know how it changes if we start moving in the direction of the unit vector \(\vec{v}\). The following difference quotient computes the corresponding instantaneous rate of change:

\[ \lim_{h \rightarrow 0} \frac{\nabla(f)(\vec{x} + h \vec{v}) - \nabla(f)(\vec{x}) }{ h }\]

Note that the numerator is a vector and the limit of this difference quotient will be a vector as well. It reflects the instantaneous rate of change of each component of \(\nabla(f)\). You may think of this as a derivative of the gradient of \(f\). There is a straightforward way to compute it directly but it requires a definition first.

Definition: The Hessian matrix of a function \(f:\mathbb{R}^n \rightarrow \mathbb{R}\) is the following array of second derivatives of \(f\) computed at \(\vec{x}\): \[ \text{Hess}(f)(\vec{x}) = \begin{bmatrix} \tfrac{\partial^2 f}{\partial x_1 \partial x_1} (\vec{x}) & \tfrac{\partial^2 f}{\partial x_1 \partial x_2} (\vec{x}) & \tfrac{\partial^2 f}{\partial x_1 \partial x_3} (\vec{x}) & \ldots & \tfrac{\partial^2 f}{\partial x_1 \partial x_n} (\vec{x})\\ \tfrac{\partial^2 f}{\partial x_2 \partial x_1} (\vec{x}) & \tfrac{\partial^2 f}{\partial x_2 \partial x_2} (\vec{x}) & \tfrac{\partial^2 f}{\partial x_2 \partial x_3} (\vec{x})& \ldots & \tfrac{\partial^2 f}{\partial x_2 \partial x_n} (\vec{x})\\ \vdots & \vdots & \vdots & & \vdots \\ \tfrac{\partial^2 f}{\partial x_n \partial x_1} (\vec{x}) & \tfrac{\partial^2 f}{\partial x_n \partial x_2} (\vec{x}) & \tfrac{\partial^2 f}{\partial x_n \partial x_3} (\vec{x})& \ldots & \tfrac{\partial^2 f}{\partial x_n \partial x_n} (\vec{x})\\ \end{bmatrix} \]

Homework exercise: Show that \[ \text{Hess}(f)(\vec{x}) \cdot \vec{v} = \lim_{h \rightarrow 0} \frac{\nabla(f)(\vec{x} + h \vec{v}) - \nabla(f)(\vec{x}) }{ h }\] That is, the derivative of the gradient vector in the direction of \(\vec{v}\) is a product of the Hessian matrix with \(\vec{v}\).

Hint: Take your time and set things up carefully; there is a lot of notation to digest. Here are facts that you may find useful.

  • The order when taking mixed partial derivatives does not matter. That is: \[\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i} \forall i,j \]
  • Computing limits of vector-valued functions can be done one component at a time. That is, if \(v_i(h)\) depends on \(h\), then \[ \lim_{h \rightarrow 0} \begin{pmatrix} v_1(h) \\ v_2(h) \\ \vdots \\ v_n(h) \\ \end{pmatrix} = \begin{pmatrix} \lim_{h \rightarrow 0} v_1(h) \\ \lim_{h \rightarrow 0} v_2(h) \\ \vdots \\ \lim_{h \rightarrow 0} v_n(h) \\ \end{pmatrix} \]
  • The directional derivative of a function \(g: \mathbb{R}^n \rightarrow \mathbb{R}\) in the direction of the unit vector \(\vec{v}\) can be computed as a dot product:
    \[ g_\vec{v} (\vec{x}) = \nabla(g)(\vec{x}) \cdot \vec{v} = \tfrac{\partial g}{\partial x_1}(\vec{x}) v_1 + \ldots +\tfrac{\partial g}{\partial x_n}(\vec{x}) v_n \]