Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

25 November 2023

Papers citing "Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching"

3 / 3 papers shown

Title
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems Richard Ren Arunim Agarwal Mantas Mazeika Cristina Menghini Robert Vacareanu ... Matias Geralnik Adam Khoja Dean Lee Summer Yue Dan Hendrycks HILM ALM 88 0 0 05 Mar 2025
On the Role of Attention Heads in Large Language Model Safety Z. Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Junfeng Fang Yongbin Li 57 5 0 17 Oct 2024
Standards for Belief Representations in LLMs Daniel A. Herrmann B. Levinstein 34 6 0 31 May 2024