Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.03658
Cited By
The Linear Representation Hypothesis and the Geometry of Large Language Models
7 November 2023
Kiho Park
Yo Joong Choe
Victor Veitch
LLMSV
MILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"The Linear Representation Hypothesis and the Geometry of Large Language Models"
25 / 125 papers shown
Title
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Rheeya Uppaal
Apratim De
Yiting He
Yiquao Zhong
Junjie Hu
29
7
0
22 May 2024
A Philosophical Introduction to Language Models - Part II: The Way Forward
Raphael Milliere
Cameron Buckner
LRM
52
13
0
06 May 2024
Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Tom Lieberum
Vikrant Varma
János Kramár
Rohin Shah
Neel Nanda
RALM
20
78
0
24 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
38
111
0
22 Apr 2024
DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
Yu Li
Zhihua Wei
Han Jiang
Chuanyang Gong
LLMSV
23
2
0
16 Apr 2024
Finding Visual Task Vectors
Alberto Hojel
Yutong Bai
Trevor Darrell
Amir Globerson
Amir Bar
60
6
0
08 Apr 2024
ReFT: Representation Finetuning for Language Models
Zhengxuan Wu
Aryaman Arora
Zheng Wang
Atticus Geiger
Daniel Jurafsky
Christopher D. Manning
Christopher Potts
OffRL
30
58
0
04 Apr 2024
Concept-based Analysis of Neural Networks via Vision-Language Models
Ravi Mangal
Nina Narodytska
Divya Gopinath
Boyue Caroline Hu
Anirban Roy
Susmit Jha
Corina S. Pasareanu
CoGe
18
3
0
28 Mar 2024
Monotonic Representation of Numeric Properties in Language Models
Benjamin Heinzerling
Kentaro Inui
KELM
MILM
40
9
0
15 Mar 2024
Towards a theory of model distillation
Enric Boix-Adserà
FedML
VLM
44
6
0
14 Mar 2024
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team
Gemma Team Thomas Mesnard
Cassidy Hardin
Robert Dadashi
Surya Bhupatiraju
...
Armand Joulin
Noah Fiedel
Evan Senter
Alek Andreev
Kathleen Kenealy
VLM
LLMAG
129
423
0
13 Mar 2024
Language Models Represent Beliefs of Self and Others
Wentao Zhu
Zhining Zhang
Yizhou Wang
MILM
LRM
38
7
0
28 Feb 2024
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
Usha Bhalla
Alexander X. Oesterling
Suraj Srinivas
Flavio du Pin Calmon
Himabindu Lakkaraju
34
35
0
16 Feb 2024
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models
Goutham Rajendran
Simon Buchholz
Bryon Aragam
Bernhard Schölkopf
Pradeep Ravikumar
AI4CE
83
21
0
14 Feb 2024
Challenges in Mechanistically Interpreting Model Representations
Satvik Golechha
James Dao
35
3
0
06 Feb 2024
LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law
Toni J. B. Liu
Nicolas Boullé
Raphael Sarfati
Christopher Earls
AI4TS
25
11
0
01 Feb 2024
Tradeoffs Between Alignment and Helpfulness in Language Models with Representation Engineering
Yotam Wolf
Noam Wies
Dorin Shteyman
Binyamin Rothberg
Yoav Levine
Amnon Shashua
LLMSV
21
13
0
29 Jan 2024
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
Zhengxuan Wu
Atticus Geiger
Jing-ling Huang
Aryaman Arora
Thomas F. Icard
Christopher Potts
Noah D. Goodman
28
6
0
23 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
64
95
0
03 Jan 2024
Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation
Floris Holstege
Bram Wouters
Noud van Giersbergen
C. Diks
21
1
0
18 Oct 2023
Towards Causal Foundation Model: on Duality between Causal Inference and Attention
Jiaqi Zhang
Joel Jennings
Agrin Hilmkil
Nick Pawlowski
Cheng Zhang
Chao Ma
CML
41
13
0
01 Oct 2023
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
120
316
0
21 Sep 2022
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
224
404
0
24 Feb 2021
Contrastive Learning Inverts the Data Generating Process
Roland S. Zimmermann
Yash Sharma
Steffen Schneider
Matthias Bethge
Wieland Brendel
SSL
236
207
0
17 Feb 2021
Word Translation Without Parallel Data
Alexis Conneau
Guillaume Lample
MarcÁurelio Ranzato
Ludovic Denoyer
Hervé Jégou
165
1,634
0
11 Oct 2017
Previous
1
2
3