ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.03827
  4. Cited By
Discovering Latent Knowledge in Language Models Without Supervision

Discovering Latent Knowledge in Language Models Without Supervision

7 December 2022
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Discovering Latent Knowledge in Language Models Without Supervision"

50 / 267 papers shown
Title
States Hidden in Hidden States: LLMs Emerge Discrete State
  Representations Implicitly
States Hidden in Hidden States: LLMs Emerge Discrete State Representations Implicitly
Junhao Chen
Shengding Hu
Zhiyuan Liu
Maosong Sun
LRM
30
5
0
16 Jul 2024
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in
  Large Language Models Using Only Attention Maps
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Yung-Sung Chuang
Linlu Qiu
Cheng-Yu Hsieh
Ranjay Krishna
Yoon Kim
James R. Glass
HILM
16
32
0
09 Jul 2024
Truth is Universal: Robust Detection of Lies in LLMs
Truth is Universal: Robust Detection of Lies in LLMs
Lennart Bürger
Fred Hamprecht
B. Nadler
HILM
33
6
0
03 Jul 2024
Belief Revision: The Adaptability of Large Language Models Reasoning
Belief Revision: The Adaptability of Large Language Models Reasoning
Bryan Wilie
Samuel Cahyawijaya
Etsuko Ishii
Junxian He
Pascale Fung
KELM
LRM
34
0
0
28 Jun 2024
Monitoring Latent World States in Language Models with Propositional
  Probes
Monitoring Latent World States in Language Models with Propositional Probes
Jiahai Feng
Stuart Russell
Jacob Steinhardt
HILM
27
6
0
27 Jun 2024
Confidence Regulation Neurons in Language Models
Confidence Regulation Neurons in Language Models
Alessandro Stolfo
Ben Wu
Wes Gurnee
Yonatan Belinkov
Xingyi Song
Mrinmaya Sachan
Neel Nanda
21
12
0
24 Jun 2024
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in
  LLMs
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
Jannik Kossen
Jiatong Han
Muhammed Razzak
Lisa Schut
Shreshth A. Malik
Yarin Gal
HILM
44
33
0
22 Jun 2024
Understanding Finetuning for Factual Knowledge Extraction
Understanding Finetuning for Factual Knowledge Extraction
Gaurav R. Ghosal
Tatsunori Hashimoto
Aditi Raghunathan
42
11
0
20 Jun 2024
Factual Confidence of LLMs: on Reliability and Robustness of Current
  Estimators
Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators
Matéo Mahaut
Laura Aina
Paula Czarnowska
Momchil Hardalov
Thomas Müller
Lluís Marquez
HILM
27
11
0
19 Jun 2024
BeHonest: Benchmarking Honesty in Large Language Models
BeHonest: Benchmarking Honesty in Large Language Models
Steffi Chern
Zhulin Hu
Yuqing Yang
Ethan Chern
Yuan Guo
Jiahe Jin
Binjie Wang
Pengfei Liu
HILM
ALM
81
3
0
19 Jun 2024
Enhancing Language Model Factuality via Activation-Based Confidence
  Calibration and Guided Decoding
Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided Decoding
Xin Liu
Farima Fatahi Bayat
Lu Wang
21
4
0
19 Jun 2024
Locating and Extracting Relational Concepts in Large Language Models
Locating and Extracting Relational Concepts in Large Language Models
Zijian Wang
Britney White
Chang Xu
KELM
30
0
0
19 Jun 2024
A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning
A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning
Lijie Hu
Liang Liu
Shu Yang
Xin Chen
Hongru Xiao
Mengdi Li
Pan Zhou
Muhammad Asif Ali
Di Wang
LRM
33
5
0
18 Jun 2024
Who's asking? User personas and the mechanics of latent misalignment
Who's asking? User personas and the mechanics of latent misalignment
Asma Ghandeharioun
Ann Yuan
Marius Guerard
Emily Reif
Michael A. Lepori
Lucas Dixon
LLMSV
36
6
0
17 Jun 2024
InternalInspector $I^2$: Robust Confidence Estimation in LLMs through
  Internal States
InternalInspector I2I^2I2: Robust Confidence Estimation in LLMs through Internal States
Mohammad Beigi
Ying Shen
Runing Yang
Zihao Lin
Qifan Wang
Ankith Mohan
Jianfeng He
Ming Jin
Chang-Tien Lu
Lifu Huang
HILM
23
4
0
17 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
45
130
0
17 Jun 2024
Large Language Models Must Be Taught to Know What They Don't Know
Large Language Models Must Be Taught to Know What They Don't Know
Sanyam Kapoor
Nate Gruver
Manley Roberts
Katherine Collins
Arka Pal
Umang Bhatt
Adrian Weller
Samuel Dooley
Micah Goldblum
Andrew Gordon Wilson
32
13
0
12 Jun 2024
Legend: Leveraging Representation Engineering to Annotate Safety Margin
  for Preference Datasets
Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
Duanyu Feng
Bowen Qin
Chen Huang
Youcheng Huang
Zheng-Wei Zhang
Wenqiang Lei
44
2
0
12 Jun 2024
REAL Sampling: Boosting Factuality and Diversity of Open-Ended
  Generation via Asymptotic Entropy
REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy
Haw-Shiuan Chang
Nanyun Peng
Mohit Bansal
Anil Ramakrishna
Tagyoung Chung
HILM
33
2
0
11 Jun 2024
Estimating the Hallucination Rate of Generative AI
Estimating the Hallucination Rate of Generative AI
Andrew Jesson
Nicolas Beltran-Velez
Quentin Chu
Sweta Karlekar
Jannik Kossen
Yarin Gal
John P. Cunningham
David M. Blei
35
6
0
11 Jun 2024
PaCE: Parsimonious Concept Engineering for Large Language Models
PaCE: Parsimonious Concept Engineering for Large Language Models
Jinqi Luo
Tianjiao Ding
Kwan Ho Ryan Chan
D. Thaker
Aditya Chattopadhyay
Chris Callison-Burch
René Vidal
CVBM
35
7
0
06 Jun 2024
Contrastive Sparse Autoencoders for Interpreting Planning of
  Chess-Playing Agents
Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
Yoann Poupart
21
0
0
06 Jun 2024
Discovering Bias in Latent Space: An Unsupervised Debiasing Approach
Discovering Bias in Latent Space: An Unsupervised Debiasing Approach
Dyah Adila
Shuai Zhang
Boran Han
Yuyang Wang
AAML
LLMSV
19
6
0
05 Jun 2024
Cycles of Thought: Measuring LLM Confidence through Stable Explanations
Cycles of Thought: Measuring LLM Confidence through Stable Explanations
Evan Becker
Stefano Soatto
32
6
0
05 Jun 2024
To Believe or Not to Believe Your LLM
To Believe or Not to Believe Your LLM
Yasin Abbasi-Yadkori
Ilja Kuzborskij
András György
Csaba Szepesvári
UQCV
53
39
0
04 Jun 2024
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and
  Latent Concept
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept
Guangliang Liu
Haitao Mao
Bochuan Cao
Zhiyu Xue
K. Johnson
Jiliang Tang
Rongrong Wang
LRM
24
9
0
04 Jun 2024
Position: An Inner Interpretability Framework for AI Inspired by Lessons
  from Cognitive Neuroscience
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience
Martina G. Vilas
Federico Adolfi
David Poeppel
Gemma Roig
33
5
0
03 Jun 2024
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
Andis Draguns
Andrew Gritsevskiy
S. Motwani
Charlie Rogers-Smith
Jeffrey Ladish
Christian Schroeder de Witt
40
2
0
03 Jun 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Kiho Park
Yo Joong Choe
Yibo Jiang
Victor Veitch
43
23
0
03 Jun 2024
Standards for Belief Representations in LLMs
Standards for Belief Representations in LLMs
Daniel A. Herrmann
B. Levinstein
29
6
0
31 May 2024
Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs
Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs
Jialiang Xu
Michael Moor
J. Leskovec
19
2
0
29 May 2024
CtrlA: Adaptive Retrieval-Augmented Generation via Probe-Guided Control
CtrlA: Adaptive Retrieval-Augmented Generation via Probe-Guided Control
Huanshuo Liu
Hao Zhang
Zhijiang Guo
Kuicai Dong
Xiangyang Li
Yi Quan Lee
Cong Zhang
Yong-jin Liu
3DV
26
1
0
29 May 2024
Efficient Model-agnostic Alignment via Bayesian Persuasion
Efficient Model-agnostic Alignment via Bayesian Persuasion
Fengshuo Bai
Mingzhi Wang
Zhaowei Zhang
Boyuan Chen
Yinda Xu
Ying Wen
Yaodong Yang
37
3
0
29 May 2024
Calibrating Reasoning in Language Models with Internal Consistency
Calibrating Reasoning in Language Models with Internal Consistency
Zhihui Xie
Jizhou Guo
Tong Yu
Shuai Li
LRM
43
8
0
29 May 2024
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty
  in Words?
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?
G. Yona
Roee Aharoni
Mor Geva
HILM
40
17
0
27 May 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
57
7
0
26 May 2024
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model
  Against LLM Red-Teaming
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming
Jiaxu Liu
Xiangyu Yin
Sihao Wu
Jianhong Wang
Meng Fang
Xinping Yi
Xiaowei Huang
24
4
0
21 May 2024
Spectral Editing of Activations for Large Language Model Alignment
Spectral Editing of Activations for Large Language Model Alignment
Yifu Qiu
Zheng Zhao
Yftah Ziser
Anna Korhonen
E. Ponti
Shay B. Cohen
KELM
LLMSV
23
15
0
15 May 2024
Can Language Models Explain Their Own Classification Behavior?
Can Language Models Explain Their Own Classification Behavior?
Dane Sherburn
Bilal Chughtai
Owain Evans
20
0
0
13 May 2024
An Assessment of Model-On-Model Deception
An Assessment of Model-On-Model Deception
Julius Heitkoetter
Michael Gerovitch
Laker Newhouse
34
2
0
10 May 2024
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
Zorik Gekhman
G. Yona
Roee Aharoni
Matan Eyal
Amir Feder
Roi Reichart
Jonathan Herzig
48
98
0
09 May 2024
Binary Hypothesis Testing for Softmax Models and Leverage Score Models
Binary Hypothesis Testing for Softmax Models and Leverage Score Models
Yeqi Gao
Yuzhou Gu
Zhao-quan Song
30
0
0
09 May 2024
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Joshua Clymer
Caden Juang
Severin Field
CVBM
25
1
0
08 May 2024
A Causal Explainable Guardrails for Large Language Models
A Causal Explainable Guardrails for Large Language Models
Zhixuan Chu
Yan Wang
Longfei Li
Zhibo Wang
Zhan Qin
Kui Ren
LLMSV
41
7
0
07 May 2024
Enhanced Language Model Truthfulness with Learnable Intervention and
  Uncertainty Expression
Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression
Farima Fatahi Bayat
Xin Liu
H. V. Jagadish
Lu Wang
HILM
KELM
23
2
0
01 May 2024
Truth-value judgment in language models: belief directions are context
  sensitive
Truth-value judgment in language models: belief directions are context sensitive
Stefan F. Schouten
Peter Bloem
Ilia Markov
Piek Vossen
KELM
63
0
0
29 Apr 2024
Uncertainty Estimation and Quantification for LLMs: A Simple Supervised
  Approach
Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach
Linyu Liu
Yu Pan
Xiaocheng Li
Guanting Chen
30
19
0
24 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
35
111
0
22 Apr 2024
Towards Reliable Latent Knowledge Estimation in LLMs: In-Context
  Learning vs. Prompting Based Factual Knowledge Extraction
Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction
Qinyuan Wu
Mohammad Aflah Khan
Soumi Das
Vedant Nanda
Bishwamittra Ghosh
Camila Kolling
Till Speicher
Laurent Bindschaedler
Krishna P. Gummadi
Evimaria Terzi
KELM
21
4
0
19 Apr 2024
Towards Logically Consistent Language Models via Probabilistic Reasoning
Towards Logically Consistent Language Models via Probabilistic Reasoning
Diego Calanzone
Stefano Teso
Antonio Vergari
LRM
HILM
29
2
0
19 Apr 2024
Previous
123456
Next