Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.01037
Cited By
Eliciting Latent Knowledge from Quirky Language Models
2 December 2023
Alex Troy Mallen
Madeline Brumley
Julia Kharchenko
Nora Belrose
HILM
RALM
KELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Eliciting Latent Knowledge from Quirky Language Models"
23 / 23 papers shown
Title
Exploring How LLMs Capture and Represent Domain-Specific Knowledge
Mirian Hipolito Garcia
Camille Couturier
Daniel Madrigal Diaz
Ankur Mallick
Anastasios Kyrillidis
Robert Sim
Victor Rühle
Saravan Rajmohan
25
0
0
23 Apr 2025
Mechanistic Anomaly Detection for "Quirky" Language Models
David Johnston
Arkajyoti Chakraborty
Nora Belrose
24
0
0
09 Apr 2025
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
LRM
62
11
0
14 Mar 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
HyunJin Kim
Xiaoyuan Yi
Jing Yao
Muhua Huang
Jinyeong Bak
James Evans
Xing Xie
36
0
0
08 Mar 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
100
0
0
24 Feb 2025
Does Representation Matter? Exploring Intermediate Layers in Large Language Models
Oscar Skean
Md Rifat Arefin
Yann LeCun
Ravid Shwartz-Ziv
79
7
0
12 Dec 2024
Steering Language Model Refusal with Sparse Autoencoders
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
LLMSV
56
10
0
18 Nov 2024
Balancing Label Quantity and Quality for Scalable Elicitation
Alex Troy Mallen
Nora Belrose
25
1
0
17 Oct 2024
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
Phillip Guo
Aaquib Syed
Abhay Sheshadri
Aidan Ewart
Gintare Karolina Dziugaite
KELM
MU
31
5
0
16 Oct 2024
Towards Inference-time Category-wise Safety Steering for Large Language Models
Amrita Bhattacharjee
Shaona Ghosh
Traian Rebedea
Christopher Parisien
LLMSV
26
2
0
02 Oct 2024
Cluster-norm for Unsupervised Probing of Knowledge
Walter Laurito
Sharan Maiya
Grégoire Dhimoïla
Owen
Owen Yeung
Kaarel Hänni
27
2
0
26 Jul 2024
Investigating the Indirect Object Identification circuit in Mamba
Danielle Ensign
Adrià Garriga-Alonso
Mamba
29
0
0
19 Jul 2024
Analyzing the Generalization and Reliability of Steering Vectors
Daniel Tan
David Chanin
Aengus Lynch
Dimitrios Kanoulas
Brooks Paige
Adrià Garriga-Alonso
Robert Kirk
LLMSV
84
16
0
17 Jul 2024
Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models
Yuji Zhang
Sha Li
Jiateng Liu
Pengfei Yu
Yi Ren Fung
Jing Li
Manling Li
Heng Ji
29
10
0
10 Jul 2024
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Sheridan Feucht
David Atkinson
Byron C. Wallace
David Bau
43
5
0
28 Jun 2024
Monitoring Latent World States in Language Models with Propositional Probes
Jiahai Feng
Stuart Russell
Jacob Steinhardt
HILM
32
6
0
27 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
37
23
0
11 Jun 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
38
111
0
22 Apr 2024
Does Transformer Interpretability Transfer to RNNs?
Gonccalo Paulo
Thomas Marshall
Nora Belrose
41
6
0
09 Apr 2024
A Language Model's Guide Through Latent Space
Dimitri von Rutte
Sotiris Anagnostidis
Gregor Bachmann
Thomas Hofmann
32
21
0
22 Feb 2024
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
91
167
0
10 Oct 2023
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
245
1,977
0
31 Dec 2020
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
199
199
0
02 May 2018
1