Machine learning: The data manifold

The data manifold #

A fundamental question in machine learning lies in understanding the geometry of data. When working with high-dimensional data, this is especially difficult: we are inherently low-dimensional creatures and do not have great intuition for how things work in higher dimensions. And we do tend to work with high-dimensional data: lowly MNIST images already are vectors in \(\mathbb{R}^{784}\); financial data has dimension in the thousands; natural language processing embeds text in vector spaces such as \(\mathbb{R}^{512}\) or \(\mathbb{R}^{768}\).

Fortunately, even though data tends to be high-dimensional at first glance, it also tends to have much more structure than just a high-dimensional blob of points. In the image below, the data consists of points in \(\mathbb{R}^3\) all of which lie near a particular plane.

Not an unreasonable hypothesis is that the data points should actually lie on this plane and what we are observing is a some noise in our data set. That is, if we eliminate this noise, we should be able to write our data as vectors in \(\mathbb{R}^2\). Such a plane is called the data manifold and the process that removes purported noise is dimension reduction. Dimension reduction can at times be spectacular: one of my students working on 5000-dimensional financial data was able to reduce its dimension to a hyperplane of dimension 20.

Data manifolds do not have to be as simple as planes, or higher-dimensional hyperplanes. The following is an example of data that naturally lives on what is called a swiss roll manifold.

This data set can be condensed to two coordinates, but has to be unwound first. There are clear advantages to doing so. It should be clear that algorithms such as k-means clustering should really only be applied on the unwound version of this data. The following is a fundamental problem in machine learning:

Problem: Describe the manifold which underlies a data set, including an estimate of its dimension. And then replace the original data set with one of this smaller dimension.

There have been a number of advances in this area, but there is a lot more work to be done. Topological data analysis is one recent stream of thought that looks very promising.