Machine learning: Kolmogorov-Arnold networks

Kolmogorov–Arnold representation theorem #

When working on a solution to Hilbert’s thirteenth problem, and in particular, not thinking about artificial intelligence or machine learning of any kind, Andrey Kolomogorov and Vladimir Arnold proved the following remarkable theorem:

Theorem: Suppose that \(f \in \mathcal{C}[0,1]^n\). Then \[f(x_1,x_2, \ldots, x_n) = \sum_{i=0}^{2n+1} \Phi_j\big( \sum_{j=1}^n \phi_{i,j} (x_i)\big)\] where \(\Phi_j : \mathbb{R}\rightarrow \mathbb{R}\) and \(\phi_{i,j}: [0,1] \rightarrow \mathbb{R}\).

There are really two things remarkable about this statement. The first is that the theorem does not only approximate a function, it finds a way of representing it exactly! The second is that the structure of the representation is no too different from a neural network. In fact, here is a graphical representation of the theorem when \(n=2\):

A Kolmogorov-Arnold network
Image from Z. Liu et al

Each function \(f\) is exactly equal to a two layer network where all the weights equal to one, there is no bias term, but each activation function (that is, one of the \(\Phi_j\) or \(\phi_{i,j}\)) may be different and has to be learned individually. While this sounds extremely promising, you might be wondering at this point why we did not have an AI revolution in 1956, when this theorem was originally published. The answer is that these activation functions can, in practice, be wild and almost impossible to learn from data. Z. Liu et al write:

Because of this pathological behavior, the Kolmogorov-Arnold representation theorem was basically sentenced to death in machine learning, regarded as theoretically sound but practically useless.

Kolmogorov–Arnold networks #

In 2024, an act of clemency for the theorem finally arrived. The article KAN: Kolmogorov–Arnold Networks suggested two minor modifications to the original construction: use splines to approximate the activation functions, and allow networks to be deeper than two layers. Amazingly, this seems to work. KANs learn from data remarkably well. There are several possible projects:

  • Begin by proving any of the several versions of the Kolmogorov-Arnold Representation Theorem. Then describe a KAN and reconcile the theory which inspired their construction with Liu’s implementation. You may have to learn about B-splines along the way, but that would be time well-spent. You can choose to finish by showing how well a KAN is able to learn from real data, perhaps by fitting a data set not yet studied by someone else.

  • Even though the article introducing KANs is very recent, there is a decently-sized library of work about this construction. Choose one of the articles from this repository and summarize it as part of your project. You may focus on theory or apply the work to some data. The risk in doing the latter is having to rely on janky GitHub code.