Chain rule

Chain rule for neural networks #

written and developed by Thomas Pietraho

Derivatives with respect to input variables #

Consider the neural network in Figure 1 with weights and biases as indicated and assume that all of its neurons use the ReLU activation function. The neural network itself is a function \(\mathcal{N}(x)\) of the single variable \(x\):

Figure 1: Diagram of a single hidden layer neural network.

By carefully writing out the formulas, using the multivariate chain rule, and realizing that computing the derivative of ReLu at any specific point except for 0 is very simple, it is possible to compute the rate of change of \(\mathcal{N}\) with respect to the input variable \(x\). Carry this out below:

Exercise: Compute \(\frac{d\mathcal{N}}{dx}\) when \(x=17\).

Exercise: Compute \(\frac{d\mathcal{N}}{dx}\) when \(x = -2\).

Exercise: Suppose a neural network \(\mathcal{N}(x_1,x_2)\) has been designed to predict the unemployment rate based on two variables \(x_1\) and \(x_2\). Their current values are \(x_1=10\) and \(x_2=3\). After much strife, you compute that \[\frac{d\mathcal{N}}{dx_1}(10,3) = 0.45 \; \; \text{ and } \; \; \frac{d\mathcal{N}}{dx_2}(10,3) = -0.2\] As a policy maker, you can influence values of \(x_1\) and \(x_2\). How can you act to reduce the unemployment rate?

Derivatives with respect to weights and biases #

The output of a neural network varies with both: the values of the input vector \(\vec{x}\) as well as the weights and biases of the neurons: changing the value of a weight or of a bias changes the output of the neural network. Let us first study this is the case of a single neuron:

Unraveling what this diagram encodes, we find that \(\mathcal{N}(x,y) = \sigma(w_1 x + w_2 y +b ).\) It is a function of \(x\) and \(y\) as well as of \(w_1\), \(w_2\), and \(b\). So by holding all other variables constant, we can compute partial derivatives. For instance:

\[\frac{\partial\mathcal{N}}{\partial w_1} = \frac{\partial\sigma}{\partial c} \cdot \frac{\partial c }{\partial w_1} \] where \(c = w_1 x + w_2 y +b. \) We can similarly compute that partial derivatives of \(\mathcal{N}\) with respect to \(w_2\) and \(b\).

Exercise: Compute the partial derivative \(\frac{\partial\mathcal{N}}{\partial w_1}\) for the neuron in Figure 2 assuming that the activation function is \(\sigma(x) = \text{ReLU}(x)\) and the values of the variables are \[(x,y,w_1,w_2,b) = (1.0, \; 5.0, \; 0.5, \;-0.1, \; 0.3).\]

Exercise: Repeat the exercise above, this time computing \(\frac{\partial\mathcal{N}}{\partial w_2}\) and \(\frac{\partial\mathcal{N}}{\partial b}\), again at \[(x,y,w_1,w_2,b) = (1.0, \; 5.0, \; 0.5, \;-0.1, \; 0.3).\] How should we change the weights and bias of this neuron to increase its value as quickly as possible? Hint: Use the derivatives you just computed.