Machine learning: Lab: Boston housing prices

Background #

Housing prices in the Boston area are easy to predict. Today, they are essentially infinite. But it wasn’t always like this. In this lab, we examine a house price data set from the 1970s collected in Boston and its suburbs. The data set covers roughly 500 neighborhoods; we are given twelve features for each:

1. CRIM      per capita crime rate by town
2. ZN        proportion of residential land zoned for lots over 
             25,000 sq.ft.
3. INDUS     proportion of non-retail business acres per town
4. CHAS      Charles River dummy variable (= 1 if tract bounds 
             river; 0 otherwise)
5. NOX       nitric oxides concentration (parts per 10 million)
6. RM        average number of rooms per dwelling
7. AGE       proportion of owner-occupied units built prior to 1940
8. DIS       weighted distances to five Boston employment centres
9. RAD       index of accessibility to radial highways
10. TAX      full-value property-tax rate per $10,000
11. PTRATIO  pupil-teacher ratio by town
12. LSTAT    % lower status of the population

Our primary task will be to build a multivariate polynomial model to estimate the median house price for each neighborhood based on these features. Our secondary task will be to evaluate \(L^1\)-regularization as a method of feature selection in regression models. This technique is often called lasso regession and is a fairly recent invention, dating back to the 1980s and 1990s.

Multivariate polynomial regression #

Suppose we have a data set \(\{\vec{x}_i, y_i\}_{i=1}^n \) where each input vector has the form \(\vec{x}_i = (x_{1i}, \ldots, x_{mi}) \). In linear regression, the task is to find weights \(\{c_i\}_{i=0}^m \) so that

\[ c_0 + c_1 x_{1i} + \ldots c_m x_{mi} \approx y_i\]

In polynomial regression, we also consider products and powers of the features in each input vector. For instance, in a degree two or quadratic model, our task is to estimate weights so that

\[ c_0 + \sum_k c_k x_{ki} + \sum_{k,l} c_{kl} x_{ki} x_{li} \approx y_i.\]

This is not an unreasonable thing to do. Often, the higher-order polynomial terms capture interactions between the features that may be useful in our task. For instance, in a disease model, both the numbers of infected and susceptible individuals are important, but the product infected * susceptible may be even more crucial. It captures the number of interactions between these two populations and may be better able to model the rate of disease spread.

Mathematically, multivariate polynomial regression is no more difficult than linear regression. For quadratic polynomials, form the matrix \(A\) whose \(i\)th row takes the form

\[ 1 \hspace{.2in} x_{1i} \hspace{.2in} x_{2i} \hspace{.2in} \ldots \hspace{.2in} x_{mi} \hspace{.2in} x_{1i}^2 \hspace{.2in} x_{1i} x_{2i} \hspace{.2in} \ldots \hspace{.2in} x_{mi} x_{m-1,i} \hspace{.2in} x_{mi}^2 \]

that is, first order terms followed by all second order terms. To compute the best-fit polynomial, find the best solution to the equation \( A \vec{c} \approx \vec{y} \).

Feature selection and model interpretability #

In regression, each coefficient has an appealing interpretation. It indicates the relative importance of the corresponding feature in the model. Unfortunately, models tend to use all features available to them and their predictions are based not on the contributions from a few important factors, but from a complicated symphony of all of them.

Problem: Find a way to select only the most important features from which to build your model. This will allow the user not only to make good predictions, but also to interpret the model and diagnose how its predictions are made.

Feature selection is an important question1 and many solutions have been proposed. See Section 11 of this article for some examples. In class, we discussed \(L^1\)-regularization. It tends to find solutions where many weights are close to zero and is thus ideal as a method of feature selection. The features whose coefficients remain non-zero presumably must be crucial to the model.

Caveat: If we are to use the relative sizes of coefficients as measures of feature importance, all features must be similarly scaled. For instance, if we change units for our feature, say from feet to meters, this will change the corresponding regression coefficient by a factor of roughly 3.28. But the importance of this feature really should stay the same! It is common to normalize a feature to take values in the interval \([0,1]\) before performing feature selection. We will do so in our lab.

The lab #

The Boston housing data as well as some useful Python code is contained in a CoLab notebook. Follow these steps as you work.

  1. First, download the housing data, select a degree for the multivariate polynomial you would like to use for your model, form the appropriate regression matrix, and standardize your data.

  2. Use gradient descent to fit a polynomial to the housing data. You will have to find good values for training hyperparameters such the learning rate and number of gradient descent steps to take. Make sure your model is not overfit.

  3. Examine the coefficients of the polynomial you found and what they say about important features for your prediction.

  4. Now repeat the experiment, this time using \(L^1\)-regularization. Change the gradient update step accordingly. You will need to find a good value for \(\mu\), the penalty coefficient. Larger values promote sparse weight vectors, but may lead to larger prediction errors.

  5. Repeat the above using multivariate polynomials of different degrees.

Questions: Please consider the following questions as you work.

  • What are the most important features in your model? Do they make sense as predictors of house prices? In the case of quadratic terms, is there an explanation for their importance?
  • Do higher-order polynomials form better models for house prices? When answering this question, keep in mind their performance on training versus validation data as well as interpretability of the results.

One more thing: There is one more subtlety. You used

  • training data to fit a polynomial, and
  • validation data to fine tune hyperparameters.

You probably tried many different hyperparameters, eventually narrowing down on some good choices. In some sense, in doing so, you used validation data as training data for your hyperparameters. How do you know that your hyperparameters are not not overfit? What could you do to check?

Homework exercise: Please write up the results of your work. First, summarize the process you used; your write-up should contain sufficient detail for a properly-motivated reader to be able to replicate your results. Include how you found reasonable values for the learning rate and regularization penalty. Then answer the questions posed above using results of your experiments as support.

  1. Public policy is one domain where simpler models with fewer features may be preferred to perhaps more accurate but more complicated models. A simple model is not only easier to explain to a politician or bureaucrat, but is also easier to study and predict marginal effects of different interventions. ↩︎