Do LLMs dream of elephants (when told not to)? Latent concept
association and associative memory in transformers

Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

26 June 2024

Goutham Rajendran

Pradeep Ravikumar

Papers citing "Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers"

19 / 19 papers shown

Title
Taming Knowledge Conflicts in Language Models Gaotang Li Yuzhong Chen Hanghang Tong KELM 44 0 0 14 Mar 2025
On the Robustness of Transformers against Context Hijacking for Linear Classification Tianle Li Chenyang Zhang Xingwu Chen Yuan Cao Difan Zou 67 0 0 24 Feb 2025
Understanding Knowledge Hijack Mechanism in In-context Learning through Associative Memory Shuo Wang Issei Sato 69 0 0 16 Dec 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective Meng Wang Yunzhi Yao Ziwen Xu Shuofei Qiao Shumin Deng ... Yong-jia Jiang Pengjun Xie Fei Huang Huajun Chen Ningyu Zhang 47 1 0 22 Jul 2024
Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell Taiming Lu Muhan Gao Kuai Yu Adam Byerly Daniel Khashabi 37 11 0 20 Jun 2024
Implicit Regularization of Gradient Flow on One-Layer Softmax Attention Heejune Sheen Siyu Chen Tianhao Wang Harrison H. Zhou MLT 23 10 0 13 Mar 2024
Gemma: Open Models Based on Gemini Research and Technology Gemma Team Gemma Team Thomas Mesnard Cassidy Hardin Robert Dadashi Surya Bhupatiraju ... Armand Joulin Noah Fiedel Evan Senter Alek Andreev Kathleen Kenealy VLM LLMAG 123 415 0 13 Mar 2024
On the Origins of Linear Representations in Large Language Models Yibo Jiang Goutham Rajendran Pradeep Ravikumar Bryon Aragam Victor Veitch 59 24 0 06 Mar 2024
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models Arijit Ghosh Chowdhury Md. Mofijul Islam Vaibhav Kumar F. H. Shezan Vaibhav Kumar Vinija Jain Aman Chadha AAML PILM 21 4 0 03 Mar 2024
Learning Associative Memories with Gradient Descent Vivien A. Cabannes Berfin Simsek A. Bietti 30 3 0 28 Feb 2024
InstructEdit: Instruction-based Knowledge Editing for Large Language Models Ningyu Zhang Bo Tian Siyuan Cheng Xiaozhuan Liang Yi Hu Kouying Xue Yanjie Gou Xi Chen Huajun Chen KELM 40 4 0 25 Feb 2024
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models Goutham Rajendran Simon Buchholz Bryon Aragam Bernhard Schölkopf Pradeep Ravikumar AI4CE 83 19 0 14 Feb 2024
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 189 260 0 28 Apr 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding Yuchen Li Yuan-Fang Li Andrej Risteski 98 61 0 07 Mar 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 486 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 240 453 0 24 Sep 2022
Fast Model Editing at Scale E. Mitchell Charles Lin Antoine Bosselut Chelsea Finn Christopher D. Manning KELM 219 341 0 21 Oct 2021
Sorting through the noise: Testing robustness of information processing in pre-trained language models Lalchand Pandia Allyson Ettinger 34 37 0 25 Sep 2021