Digit classification

Classification of hand-written digits #

Background #

One of the oldest and most-studied problems in computer vision is the automated reading of human handwriting, and in particular, automated recognition of hand-written digits. The first implementations of machine ZIP code reading systems occurred in the 1980s, but the results were generally poor and suffered from high error rates. The late 1990s saw significant improvements and the error rates for the recognition of individual digits were reduced to around 1%. The short table below gives a few of the most recent results:

Year    Author              Error
2013    Goodfellow et al.   0.50%
2014    Graham              0.30%
2018    Kowsari et al.      0.18%
2019    Yours truly         0.20%

Modern results are sufficiently good that the problem of hand-written recognition is considered solved and current efforts concentrate on letter recognition, see for instance the EMINST data set, and other more complex vision tasks. This exercise examines the use of logistic regression in a simplified version of this task.

Data set #

The MNIST – short for Mixed National Institute of Standards and Technology– database was introduced by LeCun et al. in 1998. Since its introduction, this data set has been perhaps the most common testing ground for a variety of machine learning and pattern recognition algorithms. The data consists of 70,000 grey-scale images of handwritten digits, each represented as a 28x28 array of real numbers. Of these, 60,000 are in a training set and the remaining 10,000 are in a validation set. More than 250 writers were used to make the data set, drawn from high-school students and census bureau employees. The training and validation sets were selected so that an individual writer is only used in one of the two datasets.

Examples from the MNIST data set.
Images by Josef Steppan and Baldominos, Saez, and Isasi.

Central question #

We fill focus on a simple version of the digit classification question.

Project: Train a logistic regression classifier which differentiates between two digits of your choice.

The difficulty of this problem varies depending on the choice of two digits. Differentiating between a 4 and a 9 is the most difficult. Follow the outline below as your proceed.

Technical setup #

First, let us frame the problem in terms that we are familiar with. Once you choose two digits, say a 4 and a 9, the data set \[ \mathcal{D} = \{ (\vec{x}_i, y_i) \} \] will consist of a 784-dimenional input vector \(\vec{x}_i\) and an output \(y_i \in \{0,1\}\). Each coordinate of \(\vec{x_i}\) represents a pixel with a grayscale value in the interval \( [0,1] \). The output is 0 if the underlying digit represents a 4 and 1 if the digit is a 9. In logistic regression, the prediction for the image \(\vec{x}_i\) will depend on a vector \(\vec{w}\) of trainable weights and produce the value:
\[ \hat{y}_i = h_{\vec{w}}(\vec{x_i}) = \sigma(\vec{x_i}^t \vec{w}) \in (0,1). \]
In this context, \(\sigma\) is the sigmoid function. To assess the quality of each prediction, we will follow industry standards and use cross-entropy loss which is sometimes simply referred to as log-loss: \[ \ell(\hat{y}_i, y_i) = - y_i \log (\hat{y}_i) - (1-y_i) \log (1-\hat{y}_i). \] The fit of the model on the entire data set is assessed simply by computing a sum of these losses: \[ f(\vec{w}) = \mathcal{L}(\vec{w}, \mathcal{D}) = \sum_i \ell(h_{\vec{w}}(\vec{x}_i), y_i). \] The goal of logistic regression is to find a vector of weights which minimizes \(f\) on a validation set. We will use gradient descent to do so. But first, you will have to compute the appropriate gradient descent update formula!

Homework exercise:

Let \(A\) be the regression matrix formed in the usual way by stacking the row vectors \(\vec{x}^t_i\). If the loss function \(f\) is defined as above, write down the gradient descent update formula relating \(\vec{w}_{t+1}\) and \(\vec{w}_t\) in terms of \(A\).

You will first have to write down a formula for \(\nabla(f)(\vec{w})\). You will need to use the chain rule and should also freely refer to the previous homework problem about logistic regression.
To simplify notation, you may write \(\sigma(\vec{v})\) to indicate that the function \(\sigma\) has been applied to each coordinate of \(\vec{v}\).

Training a logistic regression model #

Algorithmically, logistic regression is very similar to linear regression. Follow the general approach we have developed in class for linear regression. The Python notebook should be helpful, but you will have to fill in some details. Use the following outline. You will answer the questions at the end.

Start by downloading the MNIST data set and spend a minute or two getting familiar with the images and the original labels.
Choose a pair of digits for this project; rather than classifying all ten, your work will simply distinguish between these two. Also choose the number of images in your training set. Values of 100, 500, and 1,000 are all interesting. Smaller training sets will lead to a less accurate classifiers but the training itself may be more interesting. For validation, we will use all instances of your digits in the pre-defined MNIST validation set.
Once you have defined training and validation sets, you can begin training your model by gradient descent. Make sure that you are using the correct loss function and gradient update formula. Luckily, the loss function log_loss is built-in.
Explore learning hyperparameters and create a model will the smallest loss you can. If you are trying to control overfitting, two strategies are especially useful:
- Decrease the learning rate. You may need to increase the number of iterations to compensate for slower learning.
- Add a regularization term, such as an \(L^1\)- or \(L^2\)-penalty.
Validation loss is an opaque way to assess the quality of your classifier. What does loss of 0.138 really mean? Model accuracy is more useful: on what percentage of validation samples does your model make the correct prediction. To compute accuracy, you will have to make a decision. If your model predicts 0.54, should you round to 1.0 and say the digit is a 9? Or should you round down, and classify the digit as a 4? Choose a threshold and compute the accuracy of your model.
Finally, inspect examples of when your classifier was correct and when it was not.

Homework exercise:

Summarize your work in a few paragraphs. Address the following questions if relevant:

How did your selection of hyperparameters evolve and how did this affect loss and accuracy?
Which pixels were most important in your prediction?
What method of regularization did you use? If you used \(L^1\)-regularization, did your model become more sparse?
Was there a detectable pattern(s) in the images which were misclassified?
Using methods available to us, speculate how you could build a better classifier.

Threshold #

In a regression-based classifier, the model produces a real number \(\hat{y}_i\) which has to be rounded to make a prediction. And rounding requires a choice of threshold \(\tau\). That is, we will predict one class if \(\hat{y}_i > \tau \) and the other if \(\hat{y}_i \leq \tau \)

At first glance, the choice of threshold seems easy: we should choose the threshold that results in the highest model accuracy. In the image below, I computed the accuracy of my digit classifier for different values of \(\tau\):

A value of \(\tau = 0.6 \) very roughly maximizes the accuracy of my model. In many real-life classification problems, making this choice is much more nuanced.

Homework exercise:

Discuss two examples where choosing to maximize model accuracy when determining a threshold value many not be desirable from a public-policy perspective. Explain by detailing the effect of the trade-off between the two types of error implicit in a classifier:

false positive errors: predicting 1 when the true class is 0, and
false negative errors: predicting 0 when the true class is 1.