Does dimension reduction matter? #
In machine learning, the intuition that “more data is always better” often hits a wall when it comes to the number of features, or dimensions, used to train a classifier. According to this intuition, reducing the dimension loses information about the data and so clustering can only be more accurate on the original, full-dimensional, data.
But reality is more subtle.
The following notebook investigates what happens to the accuracy of k-means clustering when there are extraneous variables present in the data. As you work through it, keep the following question in mind:
Question: What is the effect of including extraneous, redundant, or irrelevant features on the performance of a classifier?
The Curse of Dimensionality #
In the above exercise, we just observed one of the ways in which working with high-dimensional data can lead to surprising results. Consider a fixed point \(\vec{x}\) and \(n\) data points \(\mathcal{D} = \{\vec{x}_i\} \) distributed in a \(d\)-dimensional space. Let \(D_{min}\) be the distance of \(\vec{x}\) to its nearest neighbor in \(\mathcal{D}\) and \(D_{max}\) be the distance to its farthest neighbor. Under a wide range of conditions, one can prove that
\[\lim_{d \rightarrow \infty} \mathbb{E} \Big( \tfrac{D_{max}-D_{min}}{D_{min}} \Big) = 0.\]
This means that the relative difference between the closest point and the most distant point becomes negligible as the dimension of a data set increases. In other words, all points in \( \mathcal{D} \) become roughly the same distance from \(\vec{x}\). In such a space, the concept of a “nearest neighbor” becomes semantically meaningless.
Average relative distance, or contrast, for 50000 random points in different dimensions. Note the log scales.
The plot above illustrates the limit \(\lim_{d \rightarrow \infty} \mathbb{E} \; \big( \tfrac{D_{max}-D_{min}}{D_{min}}\big) \). The curve drops sharply with \(d\). Once this relative difference is very small, the numerical precision of the computer and the inherent noise in the data make it impossible for an algorithm to distinguish between clusters or classes based on proximity. With some probability theory, we could have derived a precise formula for the curve above, but the empirical evidence is quite convincing.
Some of you might be unhappy with the imprecise statement above. The formal statement of the theorem is due to Beyer, Goldstein, Ramakrishnan, and Shafti (1999). I’m surprised at the recentness of this result, as it provides fundamental intuition within the geometry of high-dimensional data.
Practical Implications #
Low contrast has significant impacts on machine learning and data analysis:
-
Clustering: Algorithms like \(k\)-means struggle because the “tightness” of a cluster is no longer distinguishable from the distance between clusters.
-
Outlier Detection: It becomes difficult to identify outliers because they are no longer far away in a relative sense.
-
Dimensionality Reduction: This is why dimension reduction techniques are crucial; if done correctly, they project the data back into a lower-dimensional manifold where contrast still exists and meaningful patterns can be observed.