Finetuning models

Finetuning a pre-trained model #

Background #

Training a large model from scratch is time-consuming and expensive. Training the final Inception image classifier in 2017 cost roughly $100,000 while training ChatGPT-5 has been estimated to cost between $1.7 and $2.5 billion. While the trained model may be excellent at the task for which it was originally designed, that may not be the use-case in which you may be interested.

For instance, Inception achieved state-of-the-art performance distinguishing images from among 1000 pre-specified categories, such as dogs, cats, cars, and bicycles. However, you may be interested in developing a bird identification app that needs to distinguish between the few hundred bird species commonly seen in Maine such as loons, black-capped chickadees, turkey vultures, and the rufous-crowned, peg-legged, speckle-backed, cross-billed and cross-eyed towhee.

Similarly, a general large language model may need to be adapted to a specific domain, say such as medicine, by expanding its vocabulary and recomputing prediction probabilities for certain words and tokens.

In both instances, we can just begin again by retraining a model from scratch. But doing so may not only be expensive, but the amount of data available to do so may be limited. This suggests:

Question: How can we leverage the learned weights of a pre-trained model, adapting it to a new task?

Several approaches have been successful. We describe two. For simplicity, let’s assume that our model is a dense neural network of the form

\[\mathcal{N}(\vec{x}) = \sigma_k( W_k \; \sigma_{k-1}( W_{k-1}( \ldots \sigma_1(W_1(\vec{x}) +\vec{b_1} ) + \ldots + \vec{b_k})\] where each $W_i \in \mathbb{R}^{m_i \cdot n_i}$ and $\vec{b_i} \in \mathbb{R}^{n_i}$ have already been learned. The number of parameters of this model is the sum of number of entries in all the weight matrices and bias vectors. For modern foundational models, this number is often in the hundreds of billions.

Relearning weights of the final layer #

In this approach, the original weights and biases for all but the final layer are held fixed and a new final layer is computed by gradient descent. That is, the new model takes the form:

\[\mathcal{N}’(\vec{x}) = \sigma_k( W_k’\; \sigma_{k-1}( W_{k-1}( \ldots \sigma_1(W_1(\vec{x}) +\vec{b_1} ) + \ldots + \vec{b_k}’)\]

for another weight matrix $W_k’$ and bias vector $\vec{b_k}’$. There are two main advantages:

This approach reuses the elementary features learned by the initial layers $1$ through $k-1$ of the original model. The retrained classification layer learns to put these together differently to make its predictions.
Only one matrix of weights $W_k’$ and one vector of biases $\vec{b_k}’$ need to be learned, substantially reducing the required computations. In fact, the number of parameters that need to be learned is the number of entries in the final weight matrix and bias vector.

But there is a substantial disadvantage: there may be more useful elementary features that a model fine-tuned in this way will not be able to discover. It would be great to be able to modify the other weight matrices and bias vectors without adding too many parameters that need to be learned.

LoRA and the singular value decomposition #

Introduced in this paper, Low Rank Adaptation suggests a clever way to allow us to learn new weight matrices for all layers of the neural network while adding only a small number of learnable parameters. The idea is to replace each weight matrix $W_i$ in the pre-trained network $\mathcal{N}(\vec{x})$ with $W_i’=W_i + A_i$ where the rank of each $A_i$ is small. The weights in $W_i$ will remain fixed as the model learns only the entries for the matrix $A_i$. And as the following exercise shows, even though the matrices $W_i$ and $A_i$ both have the same number of entries, the number of parameters necessary to describe the latter is much smaller if its rank is small.

Homework exercise:

Show that the singular value decomposition $U \Sigma V^T$ for a matrix $M \in \mathbb{R}^{m \cdot n} $ of rank $r$ can be rewritten in the form:

\[ M = \sum_{i=1}^r \sigma_i \vec{u}_i \vec{v}_i^T\]

where each $\vec{u}_i \in \mathbb{R}^m$, $\vec{v}_i \in \mathbb{R}^n,$ and $\sigma_i \in \mathbb{R}$. What does this imply about the number of parameters that need to learned when fine-tuning a model using LoRA by augmenting each layer, as is commonly done, with a rank 1 matrix?

The key to this exercise turns out to be the singular value decomposition for rectangular matrices. The following videos give a quick introduction and a longer refresher of the requisite linear algebra.

SVD for rectangular matrices:

Matrix rank and the SVD: