Motifs in Attention Patterns of Large Language Models

This project investigates structure in the attention mechanisms of large language models. A lot of papers in interpretability involve manually inspecting attention patterns once heads of interest are identified. We aim to improve this process by providing a way to systematically embed attention patterns in a meaningful latent space, which we in turn use to embed the heads that produce them. We provide a suite of interactive tools to allow inspecting patterns produced by different heads, finding heads with similar patterns, looking through the embedding spaces, and looking at known classes of heads.

Pipeline

1. Pattern Extraction

Generate attention patterns from multiple language models using diverse text prompts.

2. Feature Extraction

Compute handcrafted features from each attention pattern.

3. Feature Analysis

Normalization and principal component analysis of the table of features.

4. Distances between heads

We now have a point cloud in PCA space for each head -- we can compute distances between each cloud to get a distance matrix.

5. Embeddings of heads

Use the distance matrix to get embeddings of heads in a meaningful latent space.

6. Clustering of heads

Cluster heads using hierarchical, HDBSCAN, and Leiden methods on the head distance matrix. Analyze cluster composition across models and layers.

7. Ablation

Zero-ablate individual attention heads and measure the impact on model loss across diverse prompts.