The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

10 October 2023

Papers citing "The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets"

26 / 26 papers shown

Title
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control Hannah Cyberey David E. Evans LLMSV 67 0 0 23 Apr 2025
Do Large Language Models know who did what to whom? Joseph M. Denning Xiaohan Bryor Snefjella Idan A. Blank 50 0 0 23 Apr 2025
Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots Erfan Shayegani G M Shahariar Sara Abdali Lei Yu Nael B. Abu-Ghazaleh Yue Dong AAML 37 0 0 01 Apr 2025
Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations Ziwei Ji L. Yu Yeskendir Koishekenov Yejin Bang Anthony Hartshorn Alan Schelten Cheng Zhang Pascale Fung Nicola Cancedda 41 1 0 18 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models Thomas Winninger Boussad Addad Katarzyna Kapusta AAML 59 0 0 08 Mar 2025
Linear Representations of Political Perspective Emerge in Large Language Models Junsol Kim James Evans Aaron Schein 58 1 0 03 Mar 2025
Understanding and Rectifying Safety Perception Distortion in VLMs Xiaohan Zou Jian Kang George Kesidis Lu Lin 87 0 0 18 Feb 2025
Lines of Thought in Large Language Models Raphael Sarfati Toni J. B. Liu Nicolas Boullé Christopher Earls LRM VLM LM&Ro 50 1 0 17 Feb 2025
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages Jannik Brinkmann Chris Wendler Christian Bartelt Aaron Mueller 38 9 0 10 Jan 2025
ICLR: In-Context Learning of Representations Core Francisco Park Andrew Lee Ekdeep Singh Lubana Yongyi Yang Maya Okawa Kento Nishi Martin Wattenberg Hidenori Tanaka AIFin 108 3 0 29 Dec 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Zeqing He Zhibo Wang Zhixuan Chu Huiyu Xu Rui Zheng Kui Ren Chun Chen 36 3 0 17 Nov 2024
Prompt-Guided Internal States for Hallucination Detection of Large Language Models Fujie Zhang Peiqi Yu Biao Yi Baolei Zhang Tong Li Zheli Liu HILM LRM 36 0 0 07 Nov 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling Emanuele Marconato Sébastien Lachapelle Sebastian Weichwald Luigi Gresele 44 3 0 30 Oct 2024
Do LLMs "know" internally when they follow instructions? Juyeon Heo Christina Heinze-Deml Oussama Elachqar Shirley Ren Udhay Nallasamy Andy Miller Kwan Ho Ryan Chan Jaya Narain 31 3 0 18 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 40 13 0 15 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 28 7 0 10 Oct 2024
Provable Weak-to-Strong Generalization via Benign Overfitting David X. Wu A. Sahai 52 6 0 06 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 55 2 0 30 Sep 2024
Robust LLM safeguarding via refusal feature adversarial training L. Yu Virginie Do Karen Hambardzumyan Nicola Cancedda AAML 36 9 0 30 Sep 2024
Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis Daoyang Li Mingyu Jin Qingcheng Zeng Mengnan Du 35 2 0 22 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 49 18 0 02 Jul 2024
Feature contamination: Neural networks learn uncorrelated features and fail to generalize Tianren Zhang Chujie Zhao Guanyu Chen Yizhou Jiang Feng Chen OOD MLT OODD 44 2 0 05 Jun 2024
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 153 170 0 02 May 2023
The Internal State of an LLM Knows When It's Lying A. Azaria Tom Michael Mitchell HILM 210 297 0 26 Apr 2023
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 117 314 0 21 Sep 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 219 291 0 24 Feb 2021