Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.09476
Cited By
Overthinking the Truth: Understanding how Language Models Process False Demonstrations
18 July 2023
Danny Halawi
Jean-Stanislas Denain
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Overthinking the Truth: Understanding how Language Models Process False Demonstrations"
49 / 49 papers shown
Title
Empowering GraphRAG with Knowledge Filtering and Integration
Kai Guo
Harry Shomer
Shenglai Zeng
Haoyu Han
Yu Wang
Jiliang Tang
61
0
0
18 Mar 2025
HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models
Xinyan Jiang
Hang Ye
Yongxin Zhu
Xiaoying Zheng
Zikang Chen
Jun Gong
41
0
0
17 Mar 2025
CER: Confidence Enhanced Reasoning in LLMs
Ali Razghandi
Seyed Mohsen Hosseini
Mahdieh Soleymani Baghshah
LRM
98
2
0
21 Feb 2025
SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence
Zhining Liu
Rana Ali Amjad
Ravinarayana Adkathimar
Tianxin Wei
Hanghang Tong
RALM
HILM
95
0
0
12 Feb 2025
Learning Task Representations from In-Context Learning
Baturay Saglam
Zhuoran Yang
Dionysis Kalogerias
Amin Karbasi
55
0
0
08 Feb 2025
Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment
Haoyu Wang
Zeyu Qin
Li Shen
Xueqian Wang
Minhao Cheng
Dacheng Tao
89
1
0
06 Feb 2025
Improve Decoding Factuality by Token-wise Cross Layer Entropy of Large Language Models
Jialiang Wu
Yi Shen
Sijia Liu
Yi Tang
Sen Song
Xiaoyi Wang
Longjun Cai
63
0
0
05 Feb 2025
Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge
Daniel Tamayo
Aitor Gonzalez-Agirre
Javier Hernando
Marta Villegas
KELM
85
3
0
04 Feb 2025
Out-of-distribution generalization via composition: a lens through induction heads in Transformers
Jiajun Song
Zhuoyan Xu
Yiqiao Zhong
80
4
0
31 Dec 2024
Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
Zhangqi Jiang
Junkai Chen
Beier Zhu
Tingjin Luo
Yankun Shen
Xu Yang
98
4
0
23 Nov 2024
Controllable Context Sensitivity and the Knob Behind It
Julian Minder
Kevin Du
Niklas Stoehr
Giovanni Monea
Chris Wendler
Robert West
Ryan Cotterell
KELM
39
3
0
11 Nov 2024
Looking Beyond The Top-1: Transformers Determine Top Tokens In Order
Daria Lioubashevski
Tomer Schlank
Gabriel Stanovsky
Ariel Goldstein
29
1
0
26 Oct 2024
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA
Eduard Tulchinskii
Laida Kushnareva
Kristian Kuznetsov
Anastasia Voznyuk
Andrei Andriiainen
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
65
1
0
03 Oct 2024
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Nick Jiang
Anish Kachinthaya
Suzie Petryk
Yossi Gandelsman
VLM
32
14
0
03 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang-Yu He
Yi Zeng
AAML
45
0
0
03 Oct 2024
Optimal ablation for interpretability
Maximilian Li
Lucas Janson
FAtt
44
2
0
16 Sep 2024
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience
Zhonghao He
Jascha Achterberg
Katie Collins
Kevin K. Nejad
Danyal Akarca
...
Chole Li
Kai J. Sandbrink
Stephen Casper
Anna Ivanova
Grace W. Lindsay
AI4CE
28
1
0
22 Aug 2024
Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference
Ke Shen
M. Kejriwal
29
0
0
04 Aug 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILM
AAML
37
28
0
28 Jun 2024
Monitoring Latent World States in Language Models with Propositional Probes
Jiahai Feng
Stuart Russell
Jacob Steinhardt
HILM
32
6
0
27 Jun 2024
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space
Core Francisco Park
Maya Okawa
Andrew Lee
Ekdeep Singh Lubana
Hidenori Tanaka
52
6
0
27 Jun 2024
SeCoKD: Aligning Large Language Models for In-Context Learning with Fewer Shots
Weixing Wang
Haojin Yang
Christoph Meinel
24
0
0
20 Jun 2024
Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning
Hui Liu
Wenya Wang
Hao Sun
Chris Xing Tian
Chenqi Kong
Xin Dong
Haoliang Li
54
5
0
14 Jun 2024
Calibrating Reasoning in Language Models with Internal Consistency
Zhihui Xie
Jizhou Guo
Tong Yu
Shuai Li
LRM
43
8
0
29 May 2024
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions
Ruizhe Li
Yanjun Gao
KELM
27
5
0
06 May 2024
Self-Demos: Eliciting Out-of-Demonstration Generalizability in Large Language Models
Wei He
Shichun Liu
Jun Zhao
Yiwen Ding
Yi Lu
Zhiheng Xi
Tao Gui
Qi Zhang
Xuanjing Huang
37
1
0
01 Apr 2024
On Large Language Models' Hallucination with Regard to Known Facts
Che Jiang
Biqing Qi
Xiangyu Hong
Dayuan Fu
Yang Cheng
Fandong Meng
Mo Yu
Bowen Zhou
Jie Zhou
HILM
LRM
31
16
0
29 Mar 2024
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
Shiqi Chen
Miao Xiong
Junteng Liu
Zhengxuan Wu
Teng Xiao
Siyang Gao
Junxian He
HILM
51
21
0
03 Mar 2024
Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models
Zhuoran Jin
Pengfei Cao
Hongbang Yuan
Yubo Chen
Jiexin Xu
Huaijun Li
Xiaojian Jiang
Kang Liu
Jun Zhao
178
34
0
28 Feb 2024
From Understanding to Utilization: A Survey on Explainability for Large Language Models
Haoyan Luo
Lucia Specia
38
21
0
23 Jan 2024
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Tianyu Cui
Yanling Wang
Chuanpu Fu
Yong Xiao
Sijia Li
...
Junwu Xiong
Xinyu Kong
Zujie Wen
Ke Xu
Qi Li
55
56
0
11 Jan 2024
LLM Factoscope: Uncovering LLMs' Factual Discernment through Inner States Analysis
Jinwen He
Yujia Gong
Kai-xiang Chen
Zijin Lin
Chengán Wei
Yue Zhao
19
3
0
27 Dec 2023
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching
James Campbell
Richard Ren
Phillip Guo
HILM
14
15
0
25 Nov 2023
Function Vectors in Large Language Models
Eric Todd
Millicent Li
Arnab Sen Sharma
Aaron Mueller
Byron C. Wallace
David Bau
8
99
0
23 Oct 2023
Copy Suppression: Comprehensively Understanding an Attention Head
Callum McDougall
Arthur Conmy
Cody Rushing
Thomas McGrath
Neel Nanda
MILM
15
41
0
06 Oct 2023
DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
Anna Langedijk
Hosein Mohebbi
Gabriele Sarti
Willem H. Zuidema
Jaap Jumelet
19
10
0
05 Oct 2023
A Formalism and Approach for Improving Robustness of Large Language Models Using Risk-Adjusted Confidence Scores
Ke Shen
M. Kejriwal
6
2
0
05 Oct 2023
Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models
Mert Yuksekgonul
Varun Chandrasekaran
Erik Jones
Suriya Gunasekar
Ranjita Naik
Hamid Palangi
Ece Kamar
Besmira Nushi
HILM
13
39
0
26 Sep 2023
Cognitive Mirage: A Review of Hallucinations in Large Language Models
Hongbin Ye
Tong Liu
Aijia Zhang
Wei Hua
Weiqiang Jia
HILM
37
76
0
13 Sep 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Peter S. Park
Simon Goldstein
Aidan O'Gara
Michael Chen
Dan Hendrycks
25
137
0
28 Aug 2023
Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence
John J. Nay
David Karamardian
Sarah Lawsky
Wenting Tao
Meghana Moorthy Bhat
Raghav Jain
Aaron Travis Lee
Jonathan H. Choi
Jungo Kasai
ELM
AILaw
16
56
0
12 Jun 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose
Zach Furman
Logan Smith
Danny Halawi
Igor V. Ostrovsky
Lev McKinney
Stella Biderman
Jacob Steinhardt
20
192
0
14 Mar 2023
Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations
Xinxi Lyu
Sewon Min
Iz Beltagy
Luke Zettlemoyer
Hannaneh Hajishirzi
VLM
12
62
0
19 Dec 2022
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
491
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
240
456
0
24 Sep 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
40
181
0
30 Aug 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
248
1,986
0
31 Dec 2020
Making Pre-trained Language Models Better Few-shot Learners
Tianyu Gao
Adam Fisch
Danqi Chen
241
1,913
0
31 Dec 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,943
0
20 Apr 2018
1