Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.03827
Cited By
Discovering Latent Knowledge in Language Models Without Supervision
7 December 2022
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Discovering Latent Knowledge in Language Models Without Supervision"
50 / 267 papers shown
Title
Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering
Jessica Y. Bo
Tianyu Xu
Ishan Chatterjee
Katrina Passarella-Ward
Achin Kulshrestha
D Shin
LLMSV
66
0
0
07 May 2025
Multi-agents based User Values Mining for Recommendation
L. Chen
Wei Yuan
Tong Chen
Xiangyu Zhao
Nguyen Quoc Viet Hung
Hongzhi Yin
OffRL
37
0
0
02 May 2025
Towards Long Context Hallucination Detection
Siyi Liu
Kishaloy Halder
Zheng Qi
Wei Xiao
Nikolaos Pappas
Phu Mon Htut
Neha Anna John
Yassine Benajiba
Dan Roth
HILM
70
0
0
28 Apr 2025
Exploring How LLMs Capture and Represent Domain-Specific Knowledge
Mirian Hipolito Garcia
Camille Couturier
Daniel Madrigal Diaz
Ankur Mallick
Anastasios Kyrillidis
Robert Sim
Victor Rühle
Saravan Rajmohan
25
0
0
23 Apr 2025
Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation
Yunpu Zhao
Rui Zhang
Junbin Xiao
Ruibo Hou
Jiaming Guo
Zihao Zhang
Yifan Hao
Yunji Chen
21
0
0
21 Apr 2025
Functional Abstraction of Knowledge Recall in Large Language Models
Zijian Wang
Chang Xu
KELM
32
0
0
20 Apr 2025
Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion
Yejun Yoon
Jaeyoon Jung
Seunghyun Yoon
Kunwoo Park
27
0
0
19 Apr 2025
The Geometry of Self-Verification in a Task-Specific Reasoning Model
Andrew Lee
Lihao Sun
Chris Wendler
Fernanda Viégas
Martin Wattenberg
LRM
29
0
0
19 Apr 2025
Alleviating the Fear of Losing Alignment in LLM Fine-tuning
Kang Yang
Guanhong Tao
X. Chen
Jun Xu
31
0
0
13 Apr 2025
HalluShift: Measuring Distribution Shifts towards Hallucination Detection in LLMs
Sharanya Dasgupta
Sujoy Nath
Arkaprabha Basu
Pourya Shamsolmoali
Swagatam Das
HILM
58
0
0
13 Apr 2025
Enhancing Mathematical Reasoning in Large Language Models with Self-Consistency-Based Hallucination Detection
MingShan Liu
Shi Bo
Jialing Fang
LRM
22
0
0
13 Apr 2025
Robust Hallucination Detection in LLMs via Adaptive Token Selection
Mengjia Niu
Hamed Haddadi
Guansong Pang
HILM
53
0
0
10 Apr 2025
ThoughtProbe: Classifier-Guided Thought Space Exploration Leveraging LLM Intrinsic Reasoning
Zijian Wang
Chang Xu
LRM
21
1
0
09 Apr 2025
Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification
Anqi Zhang
Yulin Chen
Jane Pan
Chen Zhao
Aurojit Panda
Jinyang Li
He He
ReLM
LRM
32
2
0
07 Apr 2025
Among Us: A Sandbox for Agentic Deception
Satvik Golechha
Adrià Garriga-Alonso
LLMAG
44
2
0
05 Apr 2025
The quasi-semantic competence of LLMs: a case study on the part-whole relation
Mattia Proietti
Alessandro Lenci
38
0
0
03 Apr 2025
How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
Hongzhe Du
Weikai Li
Min Cai
Karim Saraipour
Zimin Zhang
Himabindu Lakkaraju
Yizhou Sun
Shichang Zhang
KELM
51
0
0
03 Apr 2025
Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots
Erfan Shayegani
G M Shahariar
Sara Abdali
Lei Yu
Nael B. Abu-Ghazaleh
Yue Dong
AAML
46
0
0
01 Apr 2025
The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction
Yihuai Hong
Dian Zhou
Meng Cao
Lei Yu
Zhijing Jin
LRM
41
0
0
29 Mar 2025
Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes
Sharan Maiya
Yinhong Liu
Ramit Debnath
Anna Korhonen
30
0
0
22 Mar 2025
KoGNER: A Novel Framework for Knowledge Graph Distillation on Biomedical Named Entity Recognition
Heming Zhang
Wenyu Li
Di Huang
Yinjie Tang
Yixin Chen
Philip R. O. Payne
Fuhai Li
39
0
0
19 Mar 2025
Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations
Ziwei Ji
L. Yu
Yeskendir Koishekenov
Yejin Bang
Anthony Hartshorn
Alan Schelten
Cheng Zhang
Pascale Fung
Nicola Cancedda
46
1
0
18 Mar 2025
Don't lie to your friends: Learning what you know from collaborative self-play
Jacob Eisenstein
Reza Aghajani
Adam Fisch
Dheeru Dua
Fantine Huot
Mirella Lapata
Vicky Zayats
Jonathan Berant
66
0
0
18 Mar 2025
C^2 ATTACK: Towards Representation Backdoor on CLIP via Concept Confusion
Lijie Hu
Junchi Liao
Weimin Lyu
Shaopeng Fu
Tianhao Huang
Shu Yang
Guimin Hu
Di Wang
AAML
65
0
0
12 Mar 2025
Probing Latent Subspaces in LLM for AI Security: Identifying and Manipulating Adversarial States
Xin Wei Chia
Jonathan Pan
AAML
39
0
0
12 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu
Dong Gong
Erdun Gao
Zhen Zhang
Biwei Huang
Mingming Gong
Anton van den Hengel
Javen Qinfeng Shi
J. Shi
67
0
0
12 Mar 2025
SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs
Samir Abdaljalil
Hasan Kurban
Parichit Sharma
Erchin Serpedin
Rachad Atat
HILM
48
0
0
07 Mar 2025
Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs
Zara Siddique
Irtaza Khalid
Liam D. Turner
Luis Espinosa-Anke
LLMSV
56
0
0
07 Mar 2025
Personalize Your LLM: Fake it then Align it
Yijing Zhang
Dyah Adila
Changho Shin
Frederic Sala
86
0
0
02 Mar 2025
How to Steer LLM Latents for Hallucination Detection?
Seongheon Park
Xuefeng Du
Min-Hsuan Yeh
Haobo Wang
Yixuan Li
LLMSV
44
1
0
01 Mar 2025
Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries
Tianyi Lorena Yan
Robin Jia
KELM
MU
46
0
0
27 Feb 2025
LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint
Qianli Ma
Dongrui Liu
Qian Chen
Linfeng Zhang
Jing Shao
MoMe
56
0
0
24 Feb 2025
Is Free Self-Alignment Possible?
Dyah Adila
Changho Shin
Yijing Zhang
Frederic Sala
MoMe
105
2
0
24 Feb 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
100
0
0
24 Feb 2025
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
Tom Wollschlager
Jannes Elstner
Simon Geisler
Vincent Cohen-Addad
Stephan Günnemann
Johannes Gasteiger
LLMSV
57
0
0
24 Feb 2025
Activation Steering in Neural Theorem Provers
Shashank Kirtania
LLMSV
67
0
0
21 Feb 2025
Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL
Wichayaporn Wongkamjan
Yanze Wang
Feng Gu
Denis Peskoff
Jonathan K. Kummerfeld
Jonathan May
Jordan Boyd-Graber
44
0
0
18 Feb 2025
Exploring Representations and Interventions in Time Series Foundation Models
Michał Wiliński
Mononito Goswami
Nina Żukowska
Willa Potosnak
Artur Dubrawski
AI4TS
55
0
0
17 Feb 2025
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning
L. Zhang
Lijie Hu
Di Wang
LRM
83
0
0
17 Feb 2025
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
X. Wang
Yan Hu
Wenyu Du
Reynold Cheng
Benyou Wang
Difan Zou
51
0
0
17 Feb 2025
Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models
Xin Zhou
Yiwen Guo
Ruotian Ma
Tao Gui
Qi Zhang
Xuanjing Huang
LRM
81
2
0
13 Feb 2025
Refine Knowledge of Large Language Models via Adaptive Contrastive Learning
Yinghui Li
Haojing Huang
Jiayi Kuang
Yangning Li
Shu Guo
C. Qu
Xiaoyu Tan
Hai-Tao Zheng
Ying Shen
Philip S. Yu
CLL
66
5
0
11 Feb 2025
Efficient Knowledge Feeding to Language Models: A Novel Integrated Encoder-Decoder Architecture
Sachin Kumar
Rishi Gottimukkala
Supriya Devidutta
K. Spindler
RALM
KELM
3DV
42
0
0
07 Feb 2025
Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators
Dingkang Yang
Dongling Xiao
Jinjie Wei
Mingcheng Li
Zhaoyu Chen
Ke Li
L. Zhang
HILM
90
3
0
28 Jan 2025
An Attempt to Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models
Jaturong Kongmanee
34
1
0
28 Jan 2025
Risk-Aware Distributional Intervention Policies for Language Models
Bao Nguyen
Binh Nguyen
Duy Nguyen
V. Nguyen
28
1
0
28 Jan 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
55
1
0
20 Jan 2025
Decoding Knowledge in Large Language Models: A Framework for Categorization and Comprehension
Yanbo Fang
Ruixiang Tang
ELM
28
0
0
03 Jan 2025
ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation
Weilong Dong
Xinwei Wu
Renren Jin
Shaoyang Xu
Deyi Xiong
47
6
0
31 Dec 2024
ICLR: In-Context Learning of Representations
Core Francisco Park
Andrew Lee
Ekdeep Singh Lubana
Yongyi Yang
Maya Okawa
Kento Nishi
Martin Wattenberg
Hidenori Tanaka
AIFin
111
3
0
29 Dec 2024
1
2
3
4
5
6
Next