Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

11 January 2024

Papers citing "Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models"

28 / 78 papers shown

Title
An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs Daking Rai Ziyu Yao LRM 21 7 0 18 Jun 2024
Who's asking? User personas and the mechanics of latent misalignment Asma Ghandeharioun Ann Yuan Marius Guerard Emily Reif Michael A. Lepori Lucas Dixon LLMSV 36 7 0 17 Jun 2024
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller Min Cai Yuchen Zhang Shichang Zhang Fan Yin Difan Zou Yisong Yue Ziniu Hu 21 0 0 04 Jun 2024
LoFiT: Localized Fine-tuning on LLM Representations Fangcong Yin Xi Ye Greg Durrett 28 12 0 03 Jun 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models Kiho Park Yo Joong Choe Yibo Jiang Victor Veitch 43 24 0 03 Jun 2024
Knowledge Circuits in Pretrained Transformers Yunzhi Yao Ningyu Zhang Zekun Xi Meng Wang Ziwen Xu Shumin Deng Huajun Chen KELM 64 19 0 28 May 2024
Improved Generation of Adversarial Examples Against Safety-aligned LLMs Qizhang Li Yiwen Guo Wangmeng Zuo Hao Chen AAML SILM 21 5 0 28 May 2024
InversionView: A General-Purpose Method for Reading Information from Neural Activations Xinting Huang Madhur Panwar Navin Goyal Michael Hahn 26 3 0 27 May 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 38 111 0 22 Apr 2024
Does Transformer Interpretability Transfer to RNNs? Gonccalo Paulo Thomas Marshall Nora Belrose 41 6 0 09 Apr 2024
Unveiling LLMs: The Evolution of Latent Representations in a Temporal Knowledge Graph Marco Bronzini Carlo Nicolini Bruno Lepri Jacopo Staiano Andrea Passerini KELM 25 0 0 04 Apr 2024
Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations Lei Yu Meng Cao Jackie Chi Kit Cheung Yue Dong HILM 33 6 0 27 Mar 2024
SelfIE: Self-Interpretation of Large Language Model Embeddings Haozhe Chen Carl Vondrick Chengzhi Mao 19 17 0 16 Mar 2024
Rebuilding ROME : Resolving Model Collapse during Sequential Model Editing Akshat Gupta Sidharth Baskaran Gopala Anumanchipalli KELM 63 19 0 11 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Jing-ling Huang Zhengxuan Wu Christopher Potts Mor Geva Atticus Geiger 53 25 0 27 Feb 2024
Understanding and Patching Compositional Reasoning in LLMs Zhaoyi Li Gangwei Jiang Hong Xie Linqi Song Defu Lian Ying Wei LRM 46 20 0 22 Feb 2024
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space Shahar Katz Yonatan Belinkov Mor Geva Lior Wolf 47 9 1 20 Feb 2024
Representation Surgery: Theory and Practice of Affine Steering Shashwat Singh Shauli Ravfogel Jonathan Herzig Roee Aharoni Ryan Cotterell Ponnurangam Kumaraguru LLMSV 27 12 0 15 Feb 2024
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia Giovanni Monea Maxime Peyrard Martin Josifoski Vishrav Chaudhary Jason Eisner Emre Kiciman Hamid Palangi Barun Patra Robert West KELM 47 12 0 04 Dec 2023
DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers Anna Langedijk Hosein Mohebbi Gabriele Sarti Willem H. Zuidema Jaap Jumelet 19 10 0 05 Oct 2023
Can LLMs facilitate interpretation of pre-trained language models? Basel Mousi Nadir Durrani Fahim Dalvi 36 12 0 22 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 184 116 0 30 Apr 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 189 260 0 28 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 486 0 01 Nov 2022
Linearly Mapping from Image to Text Space Jack Merullo Louis Castricato Carsten Eickhoff Ellie Pavlick VLM 159 104 0 30 Sep 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason W. Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter F. Xia Ed H. Chi Quoc Le Denny Zhou LM&Ro LRM AI4CE ReLM 315 8,261 0 28 Jan 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 221 402 0 24 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 245 1,977 0 31 Dec 2020