Motifs in Attention Patterns of Large Language Models

This project investigates structure in the attention mechanisms of large language models. A lot of papers in interpretability involve manually inspecting attention patterns once heads of interest are identified. We aim to improve this process by providing a way to systematically embed attention patterns in a meaningful latent space, which we in turn use to embed the heads that produce them. We provide a suite of interactive tools to allow inspecting patterns produced by different heads, finding heads with similar patterns, looking through the embedding spaces, and looking at known classes of heads.

Interface Map

The diagram below shows how all the analysis tools connect together. Click on any component to explore that interface.

Interface Relations Diagram
Existing connections
Planned connections

Pipeline

1. Pattern Extraction

Generate attention patterns from multiple language models using diverse text prompts.

2. Feature Extraction

Compute handcrafted features from each attention pattern.

3. Feature Analysis

Normalization and principal component analysis of the table of features.

4. Distances between heads

We now have a point cloud in PCA space for each head -- we can compute distances between each cloud to get a distance matrix.

5. Embeddings of heads

Use the distance matrix to get embeddings of heads in a meaningful latent space.