Extending "Towards Monosemanticity"

Feature distribution


Based on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning [1] by Anthropic and Language models can explain neurons in language models [2] by OpenAI, I attempted to generate natural language explanations for the neurons in Distill-GPT-2 by projecting the final MLP output layer to a higher dimension (similar to dictionary learning) and then using a language model (gpt-4-turbo-2024-04-09) to generate natural language descriptions of each higher dimension using activation values. The underlying theoretical foundation is the Superposition hypothesis which, simply put, states that each neuron in a language model learns a complicated mix of concepts. For instance, a neuron may activate strongly on Korean and DNA sequences. Thus, by projecting MLP outputs to a higher dimension we can attempt to create “features” that represent an explainable concept.


  • Reproducing Anthropic’s representation from the paper’s appendix
    • Huge thanks to Neel Nanda for his blog post and repo [3, 4].
  • Tuning hyperparameters
  • Loss degradation over longer training runs
  • Automated interpretability using OpenAI’s implementation package  - API changes requiring code refactoring due to data parsing changes or rewriting due to missing information.  - Poor responses from GPT-4, ultimately making automated interpretability impossible without adjusting the prompts


Training Autoencoders for Reconstruction

The primary metric I used for MLP reconstruction efficacy was “reconstruction score”:

$$score = \frac{zero\_abl\_loss - recons\_loss}{zero\_abl\_loss - loss}$$

I was able to reproduce Anthropic’s autoencoder on a single-layer transformer with a GELU activation, achieving a reconstruction score of ~94% with 2 billion training tokens (fewer than Anthropic’s run). With Distill-GPT-2, I was only able to achieve a reconstruction score of ~77% (with a 32x dictionary size). I observed that (1) training with more tokens did not significantly improve reconstruction scores. Additionally, performance would deteriorate throughout the training run after reaching an optimum, suggesting that when scaling up this approach, a more sophisticated training strategy would be necessary.

Dictionary Size

Anthropic tested many dictionary sizes from 1x to 256x but focussed on 8x for their primary findings. I hypothesized that a larger size would be optimal for a larger model since it is trained on more tokens and learns more complex representations. I first trained a 32x dictionary and later trained a 128x dictionary. Training a larger dictionary naturally was more computationally intensive, $O(n)$.


The natural language explanations suggested that 32x is not expressive enough for Distill-GPT-2. Many features had the same explanation because they activated on a wide range of tokens. Since these ranges were not specific, these explanations fitted more to the high-frequency tokens in the evaluation text rather than the model. Thus, I posit that larger models need much larger dictionaries as they encode more features for each neuron in the MLP layer. There are ~15x more parameters in Distill-GPT-2 than in the single-layer transformer that Anthropic analyzed. Thus, I opted to test a 128x dictionary in addition to the 32x.

Training a 128x dictionary was far more computationally intensive and did not reach a high enough reproduction score to facilitate interpretability. After training for 2 billion tokens, the score was only ~48%. Additional training, led to degredation from this optimum, emphasizing the need for increased sophistication with autoencoder training as the model being interpreted becomes larger.

Related Works and Packages

  1. Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, Chris Olah. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. 2023. (Anthropic)
  2. S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu, W. Saunders. Language models can explain neurons in language models. 2023. (OpenAI)
  3. https://github.com/neelnanda-io/1L-Sparse-Autoencoder
  4. https://github.com/neelnanda-io/TransformerLens
  5. https://github.com/HoagyC/sparse_coding
Kasra Lekan
Kasra Lekan
Master’s Student

My research interests include natural language processing, human-AI interaction, and modelling complex systems.