Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.15131
Cited By
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching
25 November 2023
James Campbell
Richard Ren
Phillip Guo
HILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching"
3 / 3 papers shown
Title
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Richard Ren
Arunim Agarwal
Mantas Mazeika
Cristina Menghini
Robert Vacareanu
...
Matias Geralnik
Adam Khoja
Dean Lee
Summer Yue
Dan Hendrycks
HILM
ALM
88
0
0
05 Mar 2025
On the Role of Attention Heads in Large Language Model Safety
Z. Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Junfeng Fang
Yongbin Li
57
5
0
17 Oct 2024
Standards for Belief Representations in LLMs
Daniel A. Herrmann
B. Levinstein
34
6
0
31 May 2024
1