SVD | Mathematical principles of machine learning

Linear dimension reduction via the SVD #

Consider a matrix \(X\) of rank \(r\) and take \(m \leq r\). In class we used the singular value decomposition to construct a matrix \(X_m\) that gives the best rank-\(m\) approximation to \(X\). The following outlines a dimension-reduction algorithm based on our construction. We are particularly interested in the setting when the columns of \(X\) are feature vectors in a high-dimensional vector space which we would like to replace by vectors in a lower-dimensional space.

Step 1: Given a matrix \(X \in \mathbb{R}^{n \cdot k} \) whose columns \(\vec{x}_i \in \mathbb{R}^n\) are feature vectors, begin by computing its singular value decomposition \(X = U \Sigma V^T.\)

Step 2: Form \(U_m\) and \(V_m\) from the first \(m\) columns of \(U\) and \(V\), respectively, let \(\Sigma_m\) be the diagonal matrix formed from the \(m\)-largest entries of \(\Sigma\) and define \[X_m = U_m \Sigma_m V_m^T.\] By the Eckhart-Young Theorem, this is the matrix of rank \(m\) that minimizes both the Frobenius and operator distances from \(X\).

Step 3: Let \(\vec{y}_i\) be the \(i\)th column of \(X_m\). It is an approximation of \(\vec{x}_i\), the \(i\)th column of \(X\). Let’s spend a little more time thinking about it. We will need the following way of thinking about matrix multiplication:

Fact: If \(\vec{z} = M \vec{w}\), then \(\vec{w}\) are the coefficients of \(\vec{z}\) when it is written as a linear combination of the columns of \(M\). Further, if \(Z = M W\), then \(\vec{z}_i\), the \(i\)th column of \(Z,\) can be written as a linear combination of the columns of \(M\) where \(\vec{w}_i\), the \(i\)th column of \(W,\) gives the coefficients in this linear combination.

Let \(C = \Sigma_m V_m^T\) so that \(X_m = U_m C\). The fact above tells us that each \(\vec{y}_i\) is a linear combination of the \(m\) columns of \(U_m\). The coefficients in this linear combination are given by the \(i\)th column \(\vec{c}_i\) of \(C\). The dimension reduction technique is the map \[\vec{x}_i \in \mathbb{R}^n \longrightarrow \vec{c}_i \in \mathbb{R}^m.\]

The geometry behind this process is interesting and worth understanding. Let \(\mathcal{H}\) be the \(m\)-dimensional hyperplane spanned by the columns of \(U_m\). The columns \(\vec{y}_i\) of \(X_m\) are the orthogonal projections of the columns \(\vec{x}_i\) of \(X\) onto \(\mathcal{H}\). Since \(\mathcal{H}\) is \(m\)-dimensional and each \(\vec{y}_i \in \mathcal{H}\), we can write \(\vec{y}_i\) using \(m\) coefficients when expressed in the basis above. This gives the vectors \(\vec{c}_i\).

Homework exercise: Consider the matrix: \[ X = \begin{bmatrix} 1 & 2 & 2 \\ 0 & 1 & 3 \\ 4 & 3 & 2 \\ \end{bmatrix} \] Use the above dimension-reduction technique to reduce each column of \(X\) to a vector in \(\mathbb{R}^2\). You may use computational tools in your work and can approximate the entries in your answer to two significant digits.

An extension of this technique #

It is also possible to use a variant of the technique above to reduce the dimension of vectors which are not columns of the matrix \(X\). But it will take a little work.

While the data matrix \(X\) in the above example is ridiculously small, in real settings there may be thousands of feature vectors each with thousands of entries each and the computation of the SVD will be substantial. Suppose we have already spent time and energy computing the decomposition \(X = U \Sigma V^T\) and the approximation \(X_m = U_m \Sigma_m V_m^T\). Using the process outlined above, we can now reduce the dimension of the feature vectors that form the columns of \(X\) with reckless abandon. But now suppose that a few more data points become available to us and we would like to similarly reduce their dimension. We have two choices:

Add the new data points as additional columns to \(X\) and recompute the singular value decomposition from scratch. This may be a substantial investment of resources and time. Or,
Keep the already computed decomposition and do something mildly clever.

The next homework exercise suggests a way to proceed with the second approach.

Homework exercise:

Let \(m < n \) and consider \(\vec{x} \in \mathbb{R}^n\). Assume that it is not one of the columns of \(X\) above; in other words, it is a new data point. Suppose that we have already computed the singular value decomposition for \(X\), including the hyperplane \(\mathcal{H}\). To compute a dimension-reduced version of the new vector \(\vec{x}\), we can:

first project \(\vec{x}\) onto \(\mathcal{H}\) yielding a vector \(\vec{y}\), and then
compute the coordinates of \(\vec{y}\) in terms of the basis for \(\mathcal{H}\) yielding a vector \(\vec{c} \in \mathbb{R}^m \).

Write down a formula for \(\vec{c}\). It is possible to do so using only matrix multiplication. Referring to the previous exercise, what is the dimension reduced version of the vector \(\vec{x} = (1,2,3)^T\)?

Hint

For the first part, feel free to use the formula for an orthogonal projection. You may have seen it in linear algebra. The formula simplifies if you know an orthonormal basis for the space you are projecting onto, which you do! The second part is not hard, but you have to think about things the right way. By way of a hint, note that if \(\vec{z} = M \vec{w}\), then then \(\vec{w}\) are the coefficients of \(\vec{z}\) when it is written as a linear combination of the columns of \(M\).

An upshot of this exercise is that it is not necessary to recompute the singular value decomposition for a larger matrix when new data becomes available. Once an SVD is computed, it can be used to reduce the dimension for additional feature vectors.