The Linear Representation Hypothesis and the Geometry of Large Language Models

7 November 2023

Papers citing "The Linear Representation Hypothesis and the Geometry of Large Language Models"

25 / 125 papers shown

Title
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity Rheeya Uppaal Apratim De Yiting He Yiquao Zhong Junjie Hu 29 7 0 22 May 2024
A Philosophical Introduction to Language Models - Part II: The Way Forward Raphael Milliere Cameron Buckner LRM 52 13 0 06 May 2024
Improving Dictionary Learning with Gated Sparse Autoencoders Senthooran Rajamanoharan Arthur Conmy Lewis Smith Tom Lieberum Vikrant Varma János Kramár Rohin Shah Neel Nanda RALM 20 78 0 24 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 38 111 0 22 Apr 2024
DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion Yu Li Zhihua Wei Han Jiang Chuanyang Gong LLMSV 23 2 0 16 Apr 2024
Finding Visual Task Vectors Alberto Hojel Yutong Bai Trevor Darrell Amir Globerson Amir Bar 60 6 0 08 Apr 2024
ReFT: Representation Finetuning for Language Models Zhengxuan Wu Aryaman Arora Zheng Wang Atticus Geiger Daniel Jurafsky Christopher D. Manning Christopher Potts OffRL 30 58 0 04 Apr 2024
Concept-based Analysis of Neural Networks via Vision-Language Models Ravi Mangal Nina Narodytska Divya Gopinath Boyue Caroline Hu Anirban Roy Susmit Jha Corina S. Pasareanu CoGe 18 3 0 28 Mar 2024
Monotonic Representation of Numeric Properties in Language Models Benjamin Heinzerling Kentaro Inui KELM MILM 40 9 0 15 Mar 2024
Towards a theory of model distillation Enric Boix-Adserà FedML VLM 44 6 0 14 Mar 2024
Gemma: Open Models Based on Gemini Research and Technology Gemma Team Gemma Team Thomas Mesnard Cassidy Hardin Robert Dadashi Surya Bhupatiraju ... Armand Joulin Noah Fiedel Evan Senter Alek Andreev Kathleen Kenealy VLM LLMAG 129 423 0 13 Mar 2024
Language Models Represent Beliefs of Self and Others Wentao Zhu Zhining Zhang Yizhou Wang MILM LRM 38 7 0 28 Feb 2024
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE) Usha Bhalla Alexander X. Oesterling Suraj Srinivas Flavio du Pin Calmon Himabindu Lakkaraju 34 35 0 16 Feb 2024
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models Goutham Rajendran Simon Buchholz Bryon Aragam Bernhard Schölkopf Pradeep Ravikumar AI4CE 83 21 0 14 Feb 2024
Challenges in Mechanistically Interpreting Model Representations Satvik Golechha James Dao 35 3 0 06 Feb 2024
LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law Toni J. B. Liu Nicolas Boullé Raphael Sarfati Christopher Earls AI4TS 25 11 0 01 Feb 2024
Tradeoffs Between Alignment and Helpfulness in Language Models with Representation Engineering Yotam Wolf Noam Wies Dorin Shteyman Binyamin Rothberg Yoav Levine Amnon Shashua LLMSV 21 13 0 29 Jan 2024
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments Zhengxuan Wu Atticus Geiger Jing-ling Huang Aryaman Arora Thomas F. Icard Christopher Potts Noah D. Goodman 28 6 0 23 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Andrew Lee Xiaoyan Bai Itamar Pres Martin Wattenberg Jonathan K. Kummerfeld Rada Mihalcea 64 95 0 03 Jan 2024
Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation Floris Holstege Bram Wouters Noud van Giersbergen C. Diks 21 1 0 18 Oct 2023
Towards Causal Foundation Model: on Duality between Causal Inference and Attention Jiaqi Zhang Joel Jennings Agrin Hilmkil Nick Pawlowski Cheng Zhang Chao Ma CML 41 13 0 01 Oct 2023
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 120 316 0 21 Sep 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 224 404 0 24 Feb 2021
Contrastive Learning Inverts the Data Generating Process Roland S. Zimmermann Yash Sharma Steffen Schneider Matthias Bethge Wieland Brendel SSL 236 207 0 17 Feb 2021
Word Translation Without Parallel Data Alexis Conneau Guillaume Lample MarcÁurelio Ranzato Ludovic Denoyer Hervé Jégou 165 1,634 0 11 Oct 2017