Finding Neurons in a Haystack: Case Studies with Sparse Probing

Finding Neurons in a Haystack: Case Studies with Sparse Probing

2 May 2023

Katherine Harvey

Dmitrii Troitskii

Dimitris Bertsimas

Papers citing "Finding Neurons in a Haystack: Case Studies with Sparse Probing"

11 / 11 papers shown

Title
Demystifying optimized prompts in language models Rimon Melamed Lucas H. McCabe H. H. Huang 22 31 0 04 May 2025
XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs Marco Arazzi Vignesh Kumar Kembu Antonino Nocera V. P. 68 42 0 30 Apr 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang J. Zhang Xipeng Qiu 59 0 0 29 Apr 2025
Sparks of Artificial General Intelligence: Early experiments with GPT-4 Sébastien Bubeck Varun Chandrasekaran Ronen Eldan J. Gehrke Eric Horvitz ... Scott M. Lundberg Harsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang ELM AI4MH AI4CE ALM 197 2,232 0 22 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 205 295 0 01 Nov 2022
Polysemanticity and Capacity in Neural Networks Adam Scherlis Kshitij Sachan Adam Jermyn Joe Benton Buck Shlegeris MILM 105 17 0 04 Oct 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 229 326 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 91 183 0 21 Sep 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 205 291 0 24 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 217 1,508 0 31 Dec 2020
What you can cram into a single vector: Probing sentence embeddings for linguistic properties Alexis Conneau Germán Kruszewski Guillaume Lample Loïc Barrault Marco Baroni 192 824 0 03 May 2018