(ENG) Are Sixteen Heads Really Better than One?

Table of Contents

1. Introduction
2. Experimental Setup
3. Head Importance Score
4. Results
5. Conclusion
6. References

1. Introduction

Transformers have revolutionized NLP by using multi‑head attention, where each layer contains multiple attention “heads” that learn to focus on different patterns in the input. The canonical Transformer‑large model uses 16 heads per layer (Vaswani et al., 2017), but this begs the question: are all of those heads necessary?

In “Are Sixteen Heads Really Better than One?”, Michel et al. (2019) introduce a principled way to answer this via the Head Importance Score (HIS). By measuring how sensitive the model’s loss is to each attention head, they identify and remove heads that contribute very little, or even negatively, to performance.

Their surprising finding is that most heads can be pruned with negligible impact, and in some cases, pruning heads improves accuracy.

I decided to review this paper for my NLP class team project. Traditionally, pruning techniques have relied on the Head Importance Score to decide which heads contribute less and can be removed. We attempted to use a two-fold strategy, leveraging both the Head Importance Score and Attention Entropy.

2. Experimental Setup

Michel et al. run experiments on two settings to test generality:

2.1 Machine Translation

This is the same model used in the original 2017 Attention paper.

Model: Transformer‑large (6 encoder + 6 decoder layers, 16 heads each).
Data: English→French corpus from WMT; evaluation on newstest2013.
Metric: BLEU score, computed on Moses‑tokenized output.

2.2 BERT Fine‑tuning

Model: BERT base‑uncased (12 layers, 12 heads each).
Tasks: Selected GLUE benchmark tasks (e.g., MNLI, SST‑2).
Metric: Task‑specific accuracy or F1; significance tested via paired $t$‑test.

2.3 Proecedure

For both settings, the procedure was:

Fine‑tune the pre‑trained model to convergence on the task.
Mask or disable attention heads according to a scoring criterion.
Evaluate performance drop relative to the unpruned baseline.

Pruning is post hoc, meaning no re‑training is performed after heads are removed.

3. Head Importance Score

To decide which heads to prune, researchers introduced a mask variable $\xi_h$ for each head $h$, where $\xi_h=1$ means the head is active and $\xi_h=0$ means it is disabled. We then measure how the loss $\mathcal{L}(x)$ on an input $x$ changes when $\xi_h$ is perturbed.

To decide which heads are least important, researchers introduced the Head Importance Score (HIS), a gradient-based metric inspired by Taylor expansion methods.

Each attention head is assigned a mask variable $( \xi_h )$, where:

$( \xi_h = 1 )$: the head is active
$( \xi_h = 0 )$: the head is pruned

When the head is masked, the output becomes:

\[\mathrm{Att}_h^{\text{masked}}(x) = \xi_h \cdot \mathrm{Att}_h(x)\]

The importance of a head is measured by how much the loss function $( \mathcal{L}(x) )$ is affected when that head is masked. Specifically, the Head Importance Score is the expected sensitivity of the loss with respect to the mask variable:

\[I_h = \mathbb{E}_{x \sim X} \left| \frac{\partial \mathcal{L}(x)}{\partial \xi_h} \right|\]

This gradient is computed via the chain rule:

\[\frac{\partial \mathcal{L}(x)}{\partial \xi_h} = \frac{\partial \mathcal{L}(x)}{\partial \mathrm{Att}_h(x)} \cdot \frac{\partial \mathrm{Att}_h^{\text{masked}}(x)}{\partial \xi_h} = \frac{\partial \mathcal{L}(x)}{\partial \mathrm{Att}_h(x)} \cdot \mathrm{Att}_h(x)\]

Therefore, the final expression for $I_h$ becomes:

\[I_h = \mathbb{E}_{x \sim X} \left| \mathrm{Att}_h(x)^T \cdot \frac{\partial \mathcal{L}(x)}{\partial \mathrm{Att}_h(x)} \right|\]

To ensure comparability across heads within the same layer, the scores are normalized using the $\ell_2$ norm:

\[\hat{I}_h = \frac{I_h}{\left( \sum_{j \in \text{layer}(h)} I_j^2 \right)^{1/2}}\]

This allows for consistent pruning decisions even across different layers.

4. Results

In the case of machine translation, the researchers used the original Transformer-large model proposed in 2017, with 6 layers and 16 attention heads per layer. They trained it on an English-to-French dataset and evaluated performance using the BLEU score.
For the BERT experiments, they used the BERT base-uncased model on standard NLP validation datasets, with statistical significance measured using t-tests.

These results demonstrate that the standard choice of 16 heads per layer is often over‑provisioned. Using the Head Importance Score, we can build smaller, faster models without sacrificing—and sometimes even improving—accuracy.

5. Conclusion

“Are Sixteen Heads Really Better than One?” poses a gradient‑based methodology to quantify the importance of heads and prune attention heads. The Head Importance Score uncovers the nature of redundancy in the Transformer’s multi‑head architecture and opened new doors towards more efficient models.

In our own work, we extend this idea by combining HIS with Attention Entropy, ensuring that retained heads are not only important but also maintain diverse, well‑distributed attention patterns.

6. References

Michel, P., Levy, O., & Neubig, G. (2019). Are Sixteen Heads Really Better than One? NeurIPS.
Voita, E., Pappas, N., & Titov, I. (2023). Stabilizing Transformer Training by Preventing Attention Entropy Collapse. arXiv.