GenAI Risks and Benefits Blog

Nov 1, 2023

Feng et. al, source

This seminar focused on understanding the potential risks and benefits of advances in Generative Artificial Intelligence and Large Language Models. It was a research-focused seminar in which student groups would plan two week-long modules with required readings and discussion questions for which they would lead.

Topics Included:

Introduction to Foundational Models
Alignment (planned by my group)
Prompting and Bias
Capabilities of LLMs
Hallucination
Machine Translation
Interpretability (planned by my group)
GANs and DeepFakes
Data Selection for LLMs
Watermarking on Generative Models
Multi-modal Models
Economic Impacts of AI

Example Syllabus from Interpretability:

Monday 23 October

Required: Alicja Chaszczewicz. Is Task-Agnostic Explainable AI a Myth?. arXiv, 2023 https://arxiv.org/pdf/2307.06963.pdf
Optional: Robert Geirhos, Roland S. Zimmermann, Blair Bilodeau, Wieland Brendel, Been Kim. Don’t trust your eyes: on the (un)reliability of feature visualizations. arXiv, 2023 https://arxiv.org/pdf/2306.04719.pdf
Optional: Sarah Wiegreffe, Yuval Pinter. Attention is not not Explanation EMNLP, 2019 https://arxiv.org/pdf/1908.04626.pdf (This is a response to Jain and Wallace’s Attention is not Explanation, NAACL 2019 paper, which sadly is not a response to any paper titled Attention is Explanation, but perhaps that is waiting to be written?)

Questions:

Chaszczewicz highlights shared challenges in XAI development across different data types (i.e. image, textual, graph data) and explanation units (i.e. saliency, attention, graph-type explainers). What are some potential similarities or differences in addressing these issues?
In cases where models produce accurate results but lack transparency, should the lack of explainability be a significant concern? How should organizations/developers balance the tradeoffs between explainability and accuracy?
How can XAI tools could be used to improve adversarial attacks?
In Attention is not not Explanation, the authors dispute a previous paper’s definition of explanation. Whose view do you find most convincing and why?

Wednesday 25 October

Required Readings

Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer El Showk, Nicholas Joseph, Nova DasSarma, Ben Mann, and others (Anthropic AI). Softmax Linear Units Transformers Circuit Thread, 2022 https://transformer-circuits.pub/2022/solu/index.html
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, and others (Anthropic AI). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Transformers Circuit Thread, 2023 https://transformer-circuits.pub/2023/monosemantic-features/index.html

Questions:

(Softmax Linear Units) Elhage et al. present the Superposition Hypothesis which argues that networks attempt to learn more features than the number of neurons in the networks. By delegating multiple features to a single node, interpreting the significance of the node becomes challenging. Do you believe this hypothesis based upon their explanation, or do you suspect there is some separate obstacle here, such as the counter-argument that nodes could represent standalone features that are difficult to infer but often obvious once discovered?
(Softmax Linear Units) Do you see any difference between SLU and ELU coupled with batch-norm/layer-norm? How does this relate to the reasons the LLM community shifted from ReLU (or variants like ELU) to GeLU?
(Towards Monosemanticity) Could the identification of these “interpretable” features could enable training (via distillation, or other ways) smaller models that still preserve interpretability?
(Towards Monosemanticity) Toying around with visualization seems to show a good identification of relevant positive tokens for concepts, but negative concepts do not seem to be very insightful. Try the explorer out for a few concepts and see if these observations align with what you see. What do you think might be happening here? Can it possibly be solved by changing the auto-encoder training pipeline, or possibly by involving structural changes like SLU? Are there other interesting elements or patterns you see?

Generative AI NLP Research

GenAI Risks and Benefits Blog

Example Syllabus from Interpretability:

Monday 23 October

Questions:

Wednesday 25 October

Questions:

Kasra Lekan

Master’s Student

UNAUTHORIZED ACCESS