(ENG) Are Sixteen Heads Really Better than One?

Introduction

Transformers have revolutionized NLP by using multi‑head attention, where each layer contains multiple attention “heads” that learn to focus on different patterns in the input. The canonical Transformer‑large model uses 16 heads per layer (Vaswani et al., 2017), but this begs the question: are all of those heads necessary?

In “Are Sixteen Heads Really Better than One?”, Michel et al. (2019) introduce a principled way to answer this via the Head Importance Score (HIS). By measuring how sensitive the model’s loss is to each attention head, they identify and remove heads that contribute very little—or even negatively—to performance. Their surprising finding is that most heads can be pruned with negligible impact, and in some cases, pruning actually improves accuracy.

I decided to review this paper for my NLP class team project. Traditionally, pruning techniques have relied on the Head Importance Score to decide which heads contribute less and can be removed. We propose a two-fold strategy - leveraging both the Head Importance Score and Attention Entropy.

Experiment Setup

Michel et al. run experiments on two settings to test generality:

  1. Machine Translation (MT):
    • Model: Transformer‑large (6 encoder + 6 decoder layers, 16 heads each).
    • Data: English→French corpus from WMT; evaluation on newstest2013.
    • Metric: BLEU score, computed on Moses‑tokenized output.
    • Significance: Bootstrap resampling ($p<0.01$) to flag statistically meaningful drops.
  2. BERT Fine‑tuning:
    • Model: BERT base‑uncased (12 layers, 12 heads each).
    • Tasks: Selected GLUE benchmark tasks (e.g., MNLI, SST‑2).
    • Metric: Task‑specific accuracy or F1; significance tested via paired $t$‑test.

For both settings, the procedure is:

  1. Fine‑tune the pre‑trained model to convergence on the task.
  2. Mask or disable attention heads according to a scoring criterion.
  3. Evaluate performance drop relative to the unpruned baseline.

Pruning is post hoc, meaning no re‑training is performed after heads are removed.

Head Importance Score

To decide which heads to prune, we introduce a mask variable $\xi_h$ for each head $h$, where $\xi_h=1$ means the head is active and $\xi_h=0$ means it is disabled. We then measure how the loss $\mathcal{L}(x)$ on an input $x$ changes when $\xi_h$ is perturbed.

To decide which heads are least important, the authors introduce the Head Importance Score (HIS), a gradient-based metric inspired by Taylor expansion methods.

Each attention head is assigned a mask variable ( \xi_h ), where:

When the head is masked, the output becomes:

\[\mathrm{Att}_h^{\text{masked}}(x) = \xi_h \cdot \mathrm{Att}_h(x)\]

The importance of a head is measured by how much the loss function ( \mathcal{L}(x) ) is affected when that head is masked. Specifically, the Head Importance Score is the expected sensitivity of the loss with respect to the mask variable:

\[I_h = \mathbb{E}_{x \sim X} \left| \frac{\partial \mathcal{L}(x)}{\partial \xi_h} \right|\]

This gradient is computed via the chain rule:

\[\frac{\partial \mathcal{L}(x)}{\partial \xi_h} = \frac{\partial \mathcal{L}(x)}{\partial \mathrm{Att}_h(x)} \cdot \frac{\partial \mathrm{Att}_h^{\text{masked}}(x)}{\partial \xi_h} = \frac{\partial \mathcal{L}(x)}{\partial \mathrm{Att}_h(x)} \cdot \mathrm{Att}_h(x)\]

Therefore, the final expression for ( I_h ) becomes:

\[I_h = \mathbb{E}_{x \sim X} \left| \mathrm{Att}_h(x)^T \cdot \frac{\partial \mathcal{L}(x)}{\partial \mathrm{Att}_h(x)} \right|\]

To ensure comparability across heads within the same layer, the scores are normalized using the ( \ell_2 ) norm:

\[\hat{I}_h = \frac{I_h}{\left( \sum_{j \in \text{layer}(h)} I_j^2 \right)^{1/2}}\]

This allows for consistent pruning decisions even across different layers.

Results of Pruning

These results demonstrate that the standard choice of 16 heads per layer is often over‑provisioned. Using the Head Importance Score, we can build smaller, faster models without sacrificing—and sometimes even improving—accuracy.

Conclusion

“Are Sixteen Heads Really Better than One?” provides a clear, gradient‑based methodology to quantify and prune attention heads. Their Head Importance Score uncovers redundancy in the Transformer’s multi‑head architecture and opens the door to more efficient, interpretable models. In our own work, we extend this idea by combining HIS with Attention Entropy, ensuring that retained heads are not only important but also maintain diverse, well‑distributed attention patterns.

References