Machine learning: Multivariate output

\(\mathcal{N}: K \rightarrow \mathbb{R}^m\) #

In class, we showed that neural networks in \(\Sigma^n(\sigma)\) are universal approximators for continuous functions \(f: \mathbb{R}^n \rightarrow \mathbb{R}\). In his original paper, Kurt Hornik extended the result to continuous functions \(f: \mathbb{R}^n \rightarrow \mathbb{R}^m\) with a multivariable output by using the time-tested proof technique of frantic handwaving. He offered:

Analogous results are valid for multi-output networks approximating funtions from \(\mathbb{R}^n\) to \(\mathbb{R}^m\).

The goal of this homework exercise is to provide a formal proof of this results. We begin with a formal definition of the analogous class of neural networks:

Definition: Let \(\sigma: \mathbb{R} \rightarrow \mathbb{R}\) be a non-constant continuous function and write \(\Sigma^{n,m}(\sigma) \) for the family of functions \(s : \mathbb{R}^n \rightarrow \mathbb{R}^m\) that can be written in the following form: \[ s(\vec{x}) = \sum_i \vec{\beta_i} \sigma(\vec{w}_i^t \vec{x} + b_i)\] where \(\vec{\beta_i} \in \mathbb{R}^m, \) \(\vec{w}_i \in \mathbb{R}^n, \) and \( b_i \in \mathbb{R}\).

Exercise: Reconcile the definition above with our usual notion of a neural network with one hidden layer that has \(n\) input and \(m\) output neurons.

There is one more thing we have to agree upon. We would like to find a neural network that approximates a given continuous function, or in other words, a neural network whose outputs are close to the outputs of this continuous function. Since the outputs are both vectors in \(\mathbb{R}^m\), we need to fix a metric. For the purposes of this problem, let use use \(\rho_1\), that is:

\[\rho_1(\vec{x}, \vec{y}) = \sum_i |x_i - y_i|.\]

Equipped with the notation above, we can state a formal version of the result:

Theorem: Suppose that \(K \) is a compact subset of \(\mathbb{R}^n\), \(\sigma: \mathbb{R} \rightarrow \mathbb{R} \) a continuous squashing function, and \(f : K \rightarrow \mathbb{R}^m\) an arbitrary continuous function. Then for every \(\epsilon > 0\), there is a neural network \( s \in \Sigma^{n,m}(\sigma)\) so that \[ \rho_1( f(\vec{x}), s(\vec{x})) < \epsilon \] for all \(\vec{x} \in K\).

Homework exercise: Verify the above theorem. Use the version of the Universal Approximation Theorem we proved in class to approximate the output of each of the \(m\) output neurons individually with a neural network in \(\Sigma^n(\sigma)\). From these, construct a neural network in \(\Sigma^{n,m}(\sigma) \).