Model stealing

Been caught stealing #

In the words of Jane’s Addiction, “When I want something, man, I don’t wanna pay for it.” There two goals of this exercise: to reconstruct part of a model stealing attack introduced the 2024 ICML, and to see one way to estimate the intrinsic dimension of a data set.

Background #

Training large models is expensive, and for a variety of (noble and less noble) reasons it desirable to be able to recover the weights and the structure of a neural network model simply from a collection \(\mathcal{D} = \{ (x_i, y_i)\}\) of input-output pairs. In a so-called black-box attack, the attacker selects a collection of inputs to a model, observes the corresponding outputs, does some magic which is probably based on math, and is able to extract some information about the model. This information is sometimes sufficient to reconstruct the model fully.

The Attack #

The original paper and presentation describing the attack are available. You can probably get away with just watching the presentation to get the main ideas. The last layer of a classification neural network is usually of the form \[\vec{y} = A\vec{x} + \vec{b}\] where \(A \in \mathbb{R}^{n \cdot k},\) \(\;\vec{x}\) is the input to this layer, \(\; k\) is the number of classes, and \(n\) is the number of neurons in the layer. Note that there is no activation on this layer: in practice it is possible to omit softmax unless one really wants an output that is a probability vector since softmax is monotone. The attack is able to reconstruct \(A\), but we focus on a sub-problem:

Problem: Estimate \(n\), the number of neurons in the classification layer of a neural network.

The team wanted to reverse-engineer the classification layer of an LLM, that is, the layer which predicts the next token in text. In this setting, each token is a class and \(k \approx 2^{15}\). The attack was simple:

Input 100,000 tokens into the model. Each produces an output vector in \(\mathbb{R}^k\).
Aggregate the output vectors as columns in an \(100,000 \times 2^{15}\) matrix \(B\).
Compute the singular value decomposition: \(B = U \Sigma V^T.\)

The graph of the singular values of \(B\), which are just the diagnoal entries of the matrix \(\Sigma\) was interesting. The image below focuses on the range between the 5100th and 5140th singular values.

Singular values derived from the Pythia-12B LLM. Carlini et al

Question: What is \(n\), the number of neurons in the penultimate layer?

Clearly the number ought to be \(n=5120\), but let’s actually work out the details. The team’s attack computed 100,000 vectors in \(\mathbb{R}^k\), but these vectors don’t really form a \(k\)-dimensional blob. In fact, they all lie in a lower-dimensional hyperplane \(\mathcal{H}\) whose dimension is determined by the matrix \(A\)!

If \(\vec{y} = A \vec{x} + \vec{b}\) for some \(\vec{x} \in \mathbb{R}^n\), then \(\vec{y} -\vec{b}\) must lie in the column space of \(A\), and if \(\text{rank} A = m\), this means that all outputs of the network lie in an \(m\)-dimensional subspace of \(\mathbb{R}^k\)

Two observations finish the argument:

a generic \(n\times k\) matrix has rank \(m=\text{min}(n,k)\), and
the rank of a matrix is the number of non-zero singular values.

From this, the team concluded that Pythia-12B LLM has a classification matrix \(A \in \mathbb{R}^{5120 \cdot 32768}\). The next step is to find the its entries, but that’s a story for another time.

Scholium: There is a greater lesson to be learned from this. If we viewed the matrix \(B\) as just some data matrix of unknown origin, and if we are working under the assumption that the columns of \(B\) form an \(m\)-dimensional hyperplane in \(\mathbb{R}^k\), then \(m\) can be recovered as the number of non-zero singular values for \(B\). If the data is noisy, we may choose to count only singular values above a certain threshold.

Warning: You may have thoughts about trying something like this on your own. Don’t, except with the explicit permission of the model’s owner. I would like to be able to continue offering this class in the future.