Extending "Towards Monosemanticity"

Apr 23, 2024

Feature distribution

Background

Based on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning [1] by Anthropic and Language models can explain neurons in language models [2] by OpenAI, I attempted to generate natural language explanations for the neurons in Distill-GPT-2 by projecting the final MLP output layer to a higher dimension (similar to dictionary learning) and then using a language model (gpt-4-turbo-2024-04-09) to generate natural language descriptions of each higher dimension using activation values. The underlying theoretical foundation is the Superposition hypothesis which, simply put, states that each neuron in a language model learns a complicated mix of concepts. For instance, a neuron may activate strongly on Korean and DNA sequences. Thus, by projecting MLP outputs to a higher dimension we can attempt to create “features” that represent an explainable concept.

Challenges

Reproducing Anthropic’s representation from the paper’s appendix
- Huge thanks to Neel Nanda for his blog post and repo [3, 4].
Tuning hyperparameters
Loss degradation over longer training runs
Automated interpretability using OpenAI’s implementation package - API changes requiring code refactoring due to data parsing changes or rewriting due to missing information. - Poor responses from GPT-4, ultimately making automated interpretability impossible without adjusting the prompts

Observations

Training Autoencoders for Reconstruction

The primary metric I used for MLP reconstruction efficacy was “reconstruction score”:

$$score = \frac{zero\_abl\_loss - recons\_loss}{zero\_abl\_loss - loss}$$

I was able to reproduce Anthropic’s autoencoder on a single-layer transformer with a GELU activation, achieving a reconstruction score of ~94% with 2 billion training tokens (fewer than Anthropic’s run). With Distill-GPT-2, I was only able to achieve a reconstruction score of ~77% (with a 32x dictionary size). I observed that (1) training with more tokens did not significantly improve reconstruction scores. Additionally, performance would deteriorate throughout the training run after reaching an optimum, suggesting that when scaling up this approach, a more sophisticated training strategy would be necessary.

Dictionary Size

Anthropic tested many dictionary sizes from 1x to 256x but focussed on 8x for their primary findings. I hypothesized that a larger size would be optimal for a larger model since it is trained on more tokens and learns more complex representations. I first trained a 32x dictionary and later trained a 128x dictionary. Training a larger dictionary naturally was more computationally intensive, $O(n)$.

Interpretability

The natural language explanations suggested that 32x is not expressive enough for Distill-GPT-2. Many features had the same explanation because they activated on a wide range of tokens. Since these ranges were not specific, these explanations fitted more to the high-frequency tokens in the evaluation text rather than the model. Thus, I posit that larger models need much larger dictionaries as they encode more features for each neuron in the MLP layer. There are ~15x more parameters in Distill-GPT-2 than in the single-layer transformer that Anthropic analyzed. Thus, I opted to test a 128x dictionary in addition to the 32x.

Training a 128x dictionary was far more computationally intensive and did not reach a high enough reproduction score to facilitate interpretability. After training for 2 billion tokens, the score was only ~48%. Additional training, led to degredation from this optimum, emphasizing the need for increased sophistication with autoencoder training as the model being interpreted becomes larger.

Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, Chris Olah. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. 2023. (Anthropic)
S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu, W. Saunders. Language models can explain neurons in language models. 2023. (OpenAI)
https://github.com/neelnanda-io/1L-Sparse-Autoencoder
https://github.com/neelnanda-io/TransformerLens
https://github.com/HoagyC/sparse_coding

Generative AI NLP Research

Extending "Towards Monosemanticity"

Background

Challenges

Observations

Training Autoencoders for Reconstruction

Dictionary Size

Interpretability

Kasra Lekan

Master’s Student

UNAUTHORIZED ACCESS

Extending "Towards Monosemanticity"

Background

Challenges

Observations

Training Autoencoders for Reconstruction

Dictionary Size

Interpretability

Related Works and Packages

Kasra Lekan

Master’s Student

UNAUTHORIZED ACCESS