Machine learning: k-means

\(k\)-means and alternate metrics #

The traditional \(k\)-means algorithm uses the traditional Euclidean metric when assigning clusters. More precisely, define the loss function

\[\ell_2(C_1, \ldots ,C_k, \mu_1, \ldots, \mu_k) = \sum_{x_j \in C_i} ||x_j - \mu_i||_2^2\]

where each \(C_i\) represents a cluster, \(\mu_i\) its the center, and \(x_j \in \mathcal{D}\) represents a data point. We proved the following lemma:

Lemma: For a fixed choice of clusters \(C_1, \ldots ,C_k\), the loss function \(\ell_2\) is minimized when each \(\mu_i\) is the centroid of the points lying in \(C_i\). That is, \[\mu_i = \frac{1}{|C_i|} \sum_{x_j \in C_i} x_j\]

But since the loss function is really a sum of squares, this process is incredibly sensitive to outliers. We can fix this by deploying alternate metrics!

Homework exercise: Suppose that we used the metric \(\rho_1\) or \(\rho_\infty\) instead of \(\rho_2\) when implementing the \(k\)-means algorithm. State and prove an appropriate version of the above lemma for each of these metrics. How would the new clusters compare geometrically to the ones obtained by the traditional \(k\)-means algorithm? Hint: Be careful in the case when there is an even number of points in \(C_i\).