Machine learning: Stone-Weierstrass Theorem

The Stone-Weierstrass Theorem #

Background #

The original Weierstrass Theorem is often a capstone of a first course in real analysis:

Theorem: Polynomials are dense in the set of continuous functions on an interval. Or more precisely, if \(f \in \mathcal{C}[a,b]\), then there is a sequence of polynomials \(\{p_n\}_{n \in \mathbb{N}}\) so that \( \lim p_n = f\) in the uniform metric.

The standard proof is constructive; not only does it show that such a sequence of polynomials exists, but explicitly constructs one that works. Each \(p_n\) is the convolution product \(f * l_n\) where \(l_n\) is a polynomial, the \(n\)th Landau kernel. A close inspection of the proof shows that convergence of this sequence relies not on the particular form of the kernels \(l_n\), but rather on three properties of this family of functions. It leads us to the following definition:

Definition:

A sequence of functions \(\{k_n\}_{n \in \mathbb{N}}\) with \(k_n : [a,b] \rightarrow \mathbb{R}\) is a family of good kernels if:

  • for all \(n\), \(\int_\mathbb{R} k_n = 1\),
  • there is a constant \(M\) such that for all \(n\), \(\int_\mathbb{R} |k_n| \leq M\), and
  • for all \( \delta \), \( \lim \int_{|x| > \delta} |k_n| = 0.\)

The final technical condition simply ensures that the area under the tails of the \(k_n\) tends to zero as \(n\) increases, or in other words, the functions become concentrated around zero.

The Landau and Fejér families of good kernels.

Any choice of a family of good kernels can be used instead of the \(l_n\) and the standard proof of the Weierstrass Theorem can be adapted to give a more general result:

Theorem: If \(f \in \mathcal{C}[a,b]\) and \(\{k_n\}_{n \in \mathbb{N}}\) is a family of good kernels, then \( \lim f * k_n = f\) in the uniform metric.

Video of proof for your entertainment:

Of course, the functions \(f * k_n \) may not be polynomials. But this is to our benefit. This exercise examines a different choice of a family of good kernels and shows that trigonometric polynomials are also dense in \(\mathcal{C}[a,b]\). In fact, we can use the above result to show that a variety of types of functions are dense in \(\mathcal{C}[a,b]\); the challenge is finding an appropriate family of good kernels.

Stone’s generalization of Weierstrass’s theorem #

While what Weierstrass’s theorem is powerful, a hard-to-please mathematician can make the following complaint. Suppose that one wanted to show that a family of functions \(\mathcal{A}\) was dense in \(\mathcal{C}[a,b]\). To use the above approach, it is necessary to find a candidate family of functions \(k_n\) to serve as good kernels. This step can be quite difficult and even if successful, one still has to do some analysis to prove the three requisite conditions of good kernels are in fact satisfied.

Working roughly fifty years later, Marshall Stone had a beautiful bit of insight on how to avoid both of these difficulties and turn the problem of deciding whether a family of functions \(\mathcal{A}\) was dense in \(\mathcal{C}[a,b]\) from a problem in analysis to a problem in algebra. We will need two definitions.

Definition: A collection \(\mathcal{A}\) of real-valued continuous functions is an algebra if whenever \( f,g \in \mathcal{A} \) implies that \(f \cdot g \in \mathcal{A}.\) That is, \(\mathcal{A}\) is closed under multiplication.
Exercise: Show that the families of polynomials and trigonometric polynomials both form algebras of real-valued continuous functions.
Solution
It is not hard to check that the product of two polynomials is another polynomial. There is more work required for trigonometric polynomials, but it is not too bad if you use some fancy trig identities.

There are more general definitions of algebras that one might see in a course on ring theory, but put simply, they are vector spaces where in addition to being able to add vectors and multiply them by scalars, one can also multiply the vectors themselves.

Definition: A collection \(\mathcal{A}\) of real-valued continuous functions defined on \([a,b]\) separates points if for all \(x \neq y \in [a,b] \), there is a function \(f \in \mathcal{A}\) so that \(f(x) \neq f(y)\).
Exercise: Show that \(\mathcal{P}[a,b]\) separates points. Then find an algebra of functions that does not separate points.
Solution

Consider \(x,y \in \mathcal{P}[a,b]\). Let \(f(x) = x\). This is a polynomial and if \( x \neq y\), then \( x = f(x) \neq f(y) = y \).

For the sought-after counterexample, let \(\mathcal{A}\) consist of constant functions. Since a product of two constant functions is a third constant function, this is an algebra. But if \(f \in \mathcal{A}\), then for all \(x\) and \(y\), \(f(x)= f(y)\).

We are ready to state Stone’s generalization of Weierstrass’s theorem. It gives an easy-to-follow recipe for checking whether a family of functions is sufficiently rich to approximate all continuous functions. We state it in a slightly more general, multivariable form.

Theorem: Consider a compact subset \(X \subset \mathbb{R}^n\), write \(\mathcal{C}(X)\) for the family of continuous real-valued functions defined on \(X\), and let\(\mathcal{A}\) be a subset of \(\mathcal{C}(X)\). Then \(\mathcal{A}\) is dense in \(\mathcal{C}(X)\) if

  • \(\mathcal{A}\) is an algebra,
  • \(\mathcal{A}\) separates points, and
  • \(\mathcal{A}\) contains a non-zero constant function.

This is a fabulous result. Stone’s work allows us to answer a question in analysis by checking a little bit of simple algebra. We will use this to our advantage later in the course when we study neural networks. For the time being, prove the following multivariable version of the Weierstrass Theorem:

Homework exercise:

Consider \(\vec{a},\vec{b} \in \mathbb{R}^n\) with the property that the \(i\)th entry \(a_i\) of \(\vec{a}\) is smaller than the \(i\)th entry \(b_i\) of \(\vec{b}\). Then \(\vec{a}\) and \(\vec{b}\) define the multi-interval \( [\vec{a},\vec{b}]\) that consists of all vectors \(\vec{x}\) for which \(a_i \leq x_i \leq b_i\).

Show that every continuous function \(f: [\vec{a},\vec{b}] \rightarrow \mathbb{R}\) can be approximated by a sequence of multivariate polynomials.