Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.03827
Cited By
Discovering Latent Knowledge in Language Models Without Supervision
7 December 2022
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Discovering Latent Knowledge in Language Models Without Supervision"
50 / 267 papers shown
Title
Does Representation Matter? Exploring Intermediate Layers in Large Language Models
Oscar Skean
Md Rifat Arefin
Yann LeCun
Ravid Shwartz-Ziv
76
7
0
12 Dec 2024
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Cameron Tice
Philipp Alexander Kreer
Nathan Helm-Burger
Prithviraj Singh Shahani
Fedor Ryzhenkov
Jacob Haimes
Felix Hofstätter
Teun van der Weij
74
1
0
02 Dec 2024
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning
Keito Kudo
Yoichi Aoki
Tatsuki Kuribayashi
Shusaku Sone
Masaya Taniguchi
Ana Brassard
Keisuke Sakaguchi
Kentaro Inui
ReLM
LRM
69
0
0
02 Dec 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Zhibo Wang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
49
3
0
17 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Maria Kirch
Severin Field
Severin Field
Helen Yannakoudakis
Stephen Casper
26
1
0
02 Nov 2024
Improving Uncertainty Quantification in Large Language Models via Semantic Embeddings
Yashvir S. Grewal
Edwin V. Bonilla
Thang D. Bui
UQCV
23
3
0
30 Oct 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling
Emanuele Marconato
Sébastien Lachapelle
Sebastian Weichwald
Luigi Gresele
61
3
0
30 Oct 2024
Distinguishing Ignorance from Error in LLM Hallucinations
Adi Simhi
Jonathan Herzig
Idan Szpektor
Yonatan Belinkov
HILM
53
2
0
29 Oct 2024
Rethinking the Uncertainty: A Critical Review and Analysis in the Era of Large Language Models
Mohammad Beigi
Sijia Wang
Ying Shen
Zihao Lin
Adithya Kulkarni
...
Ming Jin
Jin-Hee Cho
Dawei Zhou
Chang-Tien Lu
Lifu Huang
21
1
0
26 Oct 2024
Leveraging the Domain Adaptation of Retrieval Augmented Generation Models for Question Answering and Reducing Hallucination
Salman Rakin
Md. A. R. Shibly
Zahin M. Hossain
Zeeshan Khan
Md. Mostofa Akbar
18
1
0
23 Oct 2024
DEAN: Deactivating the Coupled Neurons to Mitigate Fairness-Privacy Conflicts in Large Language Models
Chen Qian
Dongrui Liu
Jie Zhang
Yong Liu
Jing Shao
24
1
0
22 Oct 2024
Chatting with Bots: AI, Speech Acts, and the Edge of Assertion
Iwan Williams
Tim Bayne
24
1
0
22 Oct 2024
Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion
Denitsa Saynova
Lovisa Hagström
Moa Johansson
Richard Johansson
Marco Kuhlmann
HILM
34
0
0
18 Oct 2024
Do LLMs "know" internally when they follow instructions?
Juyeon Heo
Christina Heinze-Deml
Oussama Elachqar
Shirley Ren
Udhay Nallasamy
Andy Miller
Kwan Ho Ryan Chan
Jaya Narain
44
3
0
18 Oct 2024
Enhancing Fact Retrieval in PLMs through Truthfulness
Paul Youssef
Jorg Schlotterer
C. Seifert
KELM
HILM
20
0
0
17 Oct 2024
Anchored Alignment for Self-Explanations Enhancement
Luis Felipe Villa-Arenas
Ata Nizamoglu
Qianli Wang
Sebastian Möller
Vera Schmitt
19
0
0
17 Oct 2024
Balancing Label Quantity and Quality for Scalable Elicitation
Alex Troy Mallen
Nora Belrose
25
1
0
17 Oct 2024
Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation
Yiming Wang
Pei Zhang
Baosong Yang
Derek F. Wong
Rui-cang Wang
LRM
40
4
0
17 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
49
13
0
15 Oct 2024
Safety-Aware Fine-Tuning of Large Language Models
Hyeong Kyu Choi
Xuefeng Du
Yixuan Li
35
10
0
13 Oct 2024
NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models
Zheng Yi Ho
Siyuan Liang
Sen Zhang
Yibing Zhan
Dacheng Tao
26
1
0
11 Oct 2024
Chip-Tuning: Classify Before Language Models Say
Fangwei Zhu
Dian Li
Jiajun Huang
Gang Liu
Hui Wang
Zhifang Sui
23
0
0
09 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
39
12
0
08 Oct 2024
Intuitions of Compromise: Utilitarianism vs. Contractualism
Jared Moore
Yejin Choi
Sydney Levine
21
0
0
07 Oct 2024
Evaluating Language Model Character Traits
Francis Rhys Ward
Zejia Yang
Alex Jackson
Randy Brown
Chandler Smith
Grace Colverd
Louis Thomson
Raymond Douglas
Patrik Bartak
Andrew Rowan
32
0
0
05 Oct 2024
Understanding Reasoning in Chain-of-Thought from the Hopfieldian View
Lijie Hu
Liang Liu
Shu Yang
Xin Chen
Zhen Tan
Muhammad Asif Ali
Mengdi Li
Di Wang
LRM
39
1
0
04 Oct 2024
FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs
Deema Alnuhait
Neeraja Kirtane
Muhammad Khalifa
Hao Peng
HILM
LRM
34
2
0
03 Oct 2024
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
Hadas Orgad
Michael Toker
Zorik Gekhman
Roi Reichart
Idan Szpektor
Hadas Kotek
Yonatan Belinkov
HILM
AIFin
46
24
0
03 Oct 2024
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA
Eduard Tulchinskii
Laida Kushnareva
Kristian Kuznetsov
Anastasia Voznyuk
Andrei Andriiainen
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
63
1
0
03 Oct 2024
Integrative Decoding: Improve Factuality via Implicit Self-consistency
Yi Cheng
Xiao Liang
Yeyun Gong
Wen Xiao
Song Wang
...
Wenjie Li
Jian Jiao
Qi Chen
Peng Cheng
Wayne Xiong
HILM
50
1
0
02 Oct 2024
Towards Inference-time Category-wise Safety Steering for Large Language Models
Amrita Bhattacharjee
Shaona Ghosh
Traian Rebedea
Christopher Parisien
LLMSV
21
2
0
02 Oct 2024
Attention layers provably solve single-location regression
P. Marion
Raphael Berthier
Gérard Biau
Claire Boyer
49
2
0
02 Oct 2024
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data
Xuefeng Du
Reshmi Ghosh
Robert Sim
Ahmed Salem
Vitor Carvalho
Emily Lawton
Yixuan Li
Jack W. Stokes
VLM
AAML
32
5
0
01 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training
L. Yu
Virginie Do
Karen Hambardzumyan
Nicola Cancedda
AAML
53
9
0
30 Sep 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao
Heng Zhao
Bo Shen
Ali Payani
Fan Yang
Mengnan Du
57
2
0
30 Sep 2024
A Survey on the Honesty of Large Language Models
Siheng Li
Cheng Yang
Taiqiang Wu
Chufan Shi
Yuji Zhang
...
Jie Zhou
Yujiu Yang
Ngai Wong
Xixin Wu
Wai Lam
HILM
27
4
0
27 Sep 2024
HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection
Xuefeng Du
Chaowei Xiao
Yixuan Li
HILM
27
16
0
26 Sep 2024
Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective
Van-Cuong Pham
Thien Huu Nguyen
LLMSV
33
3
0
16 Sep 2024
HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision Making
Sumera Anjum
Hanzhi Zhang
Wenjun Zhou
Eun Jin Paek
Xiaopeng Zhao
Yunhe Feng
15
1
0
16 Sep 2024
Optimal ablation for interpretability
Maximilian Li
Lucas Janson
FAtt
44
2
0
16 Sep 2024
On the Relationship between Truth and Political Bias in Language Models
S. Fulay
William Brannon
Shrestha Mohanty
Cassandra Overney
Elinor Poole-Dayan
Deb Roy
Jad Kabbara
HILM
22
1
0
09 Sep 2024
Modularity in Transformers: Investigating Neuron Separability & Specialization
Nicholas Pochinkov
Thomas Jones
Mohammed Rashidur Rahman
25
0
0
30 Aug 2024
Personality Alignment of Large Language Models
Minjun Zhu
Linyi Yang
Yue Zhang
Yue Zhang
ALM
50
5
0
21 Aug 2024
A Little Confidence Goes a Long Way
J. Scoville
Shang Gao
Devanshu Agrawal
Javed Qadrud-Din
19
0
0
20 Aug 2024
Lower Layer Matters: Alleviating Hallucination via Multi-Layer Fusion Contrastive Decoding with Truthfulness Refocused
Dingwei Chen
Feiteng Fang
Shiwen Ni
Feng Liang
Ruifeng Xu
Min Yang
Chengming Li
HILM
14
1
0
16 Aug 2024
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
Shiping Liu
Kecheng Zheng
Wei Chen
MLLM
41
33
0
31 Jul 2024
Cluster-norm for Unsupervised Probing of Knowledge
Walter Laurito
Sharan Maiya
Grégoire Dhimoïla
Owen
Owen Yeung
Kaarel Hänni
27
2
0
26 Jul 2024
Internal Consistency and Self-Feedback in Large Language Models: A Survey
Xun Liang
Shichao Song
Zifan Zheng
Hanyu Wang
Qingchen Yu
...
Rong-Hua Li
Peng Cheng
Zhonghao Wang
Feiyu Xiong
Zhiyu Li
HILM
LRM
56
24
0
19 Jul 2024
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach
Nils Palumbo
Ravi Mangal
Zifan Wang
Saranya Vijayakumar
Corina S. Pasareanu
Somesh Jha
34
1
0
18 Jul 2024
Analyzing the Generalization and Reliability of Steering Vectors
Daniel Tan
David Chanin
Aengus Lynch
Dimitrios Kanoulas
Brooks Paige
Adrià Garriga-Alonso
Robert Kirk
LLMSV
84
16
0
17 Jul 2024
Previous
1
2
3
4
5
6
Next