Transformers have revolutionized NLP by using multi‑head attention, where each layer contains multiple attention “heads” that learn to focus on different patterns in the input. The canonical Transformer‑large model uses 16 heads per layer (Vaswani et al., 2017), but this begs the question: are all of those heads necessary?
In “Are Sixteen Heads Really Better than One?”, Michel et al. (2019) introduce a principled way to answer this via the Head Importance Score (HIS). By measuring how sensitive the model’s loss is to each attention head, they identify and remove heads that contribute very little—or even negatively—to performance. Their surprising finding is that most heads can be pruned with negligible impact, and in some cases, pruning actually improves accuracy.
I decided to review this paper for my NLP class team project. Traditionally, pruning techniques have relied on the Head Importance Score to decide which heads contribute less and can be removed. We propose a two-fold strategy - leveraging both the Head Importance Score and Attention Entropy.
Michel et al. run experiments on two settings to test generality:
For both settings, the procedure is:
Pruning is post hoc, meaning no re‑training is performed after heads are removed.
To decide which heads to prune, we introduce a mask variable $\xi_h$ for each head $h$, where $\xi_h=1$ means the head is active and $\xi_h=0$ means it is disabled. We then measure how the loss $\mathcal{L}(x)$ on an input $x$ changes when $\xi_h$ is perturbed.
To decide which heads are least important, the authors introduce the Head Importance Score (HIS), a gradient-based metric inspired by Taylor expansion methods.
Each attention head is assigned a mask variable ( \xi_h ), where:
When the head is masked, the output becomes:
\[\mathrm{Att}_h^{\text{masked}}(x) = \xi_h \cdot \mathrm{Att}_h(x)\]The importance of a head is measured by how much the loss function ( \mathcal{L}(x) ) is affected when that head is masked. Specifically, the Head Importance Score is the expected sensitivity of the loss with respect to the mask variable:
\[I_h = \mathbb{E}_{x \sim X} \left| \frac{\partial \mathcal{L}(x)}{\partial \xi_h} \right|\]This gradient is computed via the chain rule:
\[\frac{\partial \mathcal{L}(x)}{\partial \xi_h} = \frac{\partial \mathcal{L}(x)}{\partial \mathrm{Att}_h(x)} \cdot \frac{\partial \mathrm{Att}_h^{\text{masked}}(x)}{\partial \xi_h} = \frac{\partial \mathcal{L}(x)}{\partial \mathrm{Att}_h(x)} \cdot \mathrm{Att}_h(x)\]Therefore, the final expression for ( I_h ) becomes:
\[I_h = \mathbb{E}_{x \sim X} \left| \mathrm{Att}_h(x)^T \cdot \frac{\partial \mathcal{L}(x)}{\partial \mathrm{Att}_h(x)} \right|\]To ensure comparability across heads within the same layer, the scores are normalized using the ( \ell_2 ) norm:
\[\hat{I}_h = \frac{I_h}{\left( \sum_{j \in \text{layer}(h)} I_j^2 \right)^{1/2}}\]This allows for consistent pruning decisions even across different layers.
These results demonstrate that the standard choice of 16 heads per layer is often over‑provisioned. Using the Head Importance Score, we can build smaller, faster models without sacrificing—and sometimes even improving—accuracy.
“Are Sixteen Heads Really Better than One?” provides a clear, gradient‑based methodology to quantify and prune attention heads. Their Head Importance Score uncovers redundancy in the Transformer’s multi‑head architecture and opens the door to more efficient, interpretable models. In our own work, we extend this idea by combining HIS with Attention Entropy, ensuring that retained heads are not only important but also maintain diverse, well‑distributed attention patterns.
References