ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.03827
  4. Cited By
Discovering Latent Knowledge in Language Models Without Supervision

Discovering Latent Knowledge in Language Models Without Supervision

7 December 2022
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Discovering Latent Knowledge in Language Models Without Supervision"

50 / 267 papers shown
Title
Constructing Benchmarks and Interventions for Combating Hallucinations
  in LLMs
Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
Adi Simhi
Jonathan Herzig
Idan Szpektor
Yonatan Belinkov
HILM
39
10
0
15 Apr 2024
GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM
  Applications
GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications
Shishir G. Patil
Tianjun Zhang
Vivian Fang
Noppapon C Roy Huang
Uc Berkeley
Aaron Hao
Martin Casado
Joseph E. Gonzalez Raluca
Ada Popa
Ion Stoica
ALM
24
9
0
10 Apr 2024
Does Transformer Interpretability Transfer to RNNs?
Does Transformer Interpretability Transfer to RNNs?
Gonccalo Paulo
Thomas Marshall
Nora Belrose
41
6
0
09 Apr 2024
PoLLMgraph: Unraveling Hallucinations in Large Language Models via State
  Transition Dynamics
PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics
Derui Zhu
Dingfan Chen
Qing Li
Zongxiong Chen
Lei Ma
Jens Grossklags
Mario Fritz
HILM
35
8
0
06 Apr 2024
The Probabilities Also Matter: A More Faithful Metric for Faithfulness
  of Free-Text Explanations in Large Language Models
The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models
Noah Y. Siegel
Oana-Maria Camburu
N. Heess
Maria Perez-Ortiz
18
8
0
04 Apr 2024
Can multiple-choice questions really be useful in detecting the
  abilities of LLMs?
Can multiple-choice questions really be useful in detecting the abilities of LLMs?
Wangyue Li
Liangzhi Li
Tong Xiang
Xiao Liu
Wei Deng
Noa Garcia
ELM
34
28
0
26 Mar 2024
Language Models in Dialogue: Conversational Maxims for Human-AI
  Interactions
Language Models in Dialogue: Conversational Maxims for Human-AI Interactions
Erik Miehling
Manish Nagireddy
P. Sattigeri
Elizabeth M. Daly
David Piorkowski
John T. Richards
ALM
19
11
0
22 Mar 2024
Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A
  Case Study on Domain-Specific Queries in Private Knowledge-Bases
Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases
Jiarui Li
Ye Yuan
Zehua Zhang
RALM
20
43
0
15 Mar 2024
The First to Know: How Token Distributions Reveal Hidden Knowledge in
  Large Vision-Language Models?
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
Qinyu Zhao
Ming Xu
Kartik Gupta
Akshay Asthana
Liang Zheng
Stephen Gould
29
7
0
14 Mar 2024
From Instructions to Constraints: Language Model Alignment with
  Automatic Constraint Verification
From Instructions to Constraints: Language Model Alignment with Automatic Constraint Verification
Fei Wang
Chao Shang
Sarthak Jain
Shuai Wang
Qiang Ning
Bonan Min
Vittorio Castelli
Yassine Benajiba
Dan Roth
ALM
22
7
0
10 Mar 2024
On the Origins of Linear Representations in Large Language Models
On the Origins of Linear Representations in Large Language Models
Yibo Jiang
Goutham Rajendran
Pradeep Ravikumar
Bryon Aragam
Victor Veitch
59
24
0
06 Mar 2024
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period
  of Large Language Models
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
Chao Qian
Jie M. Zhang
Wei Yao
Dongrui Liu
Zhen-fei Yin
Yu Qiao
Yong Liu
Jing Shao
LLMSV
LRM
42
13
0
29 Feb 2024
Language Models Represent Beliefs of Self and Others
Language Models Represent Beliefs of Self and Others
Wentao Zhu
Zhining Zhang
Yizhou Wang
MILM
LRM
33
7
0
28 Feb 2024
Characterizing Truthfulness in Large Language Model Generations with
  Local Intrinsic Dimension
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension
Fan Yin
Jayanth Srinivasa
Kai-Wei Chang
HILM
52
19
0
28 Feb 2024
TruthX: Alleviating Hallucinations by Editing Large Language Models in
  Truthful Space
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
Shaolei Zhang
Tian Yu
Yang Feng
HILM
KELM
21
39
0
27 Feb 2024
FairBelief -- Assessing Harmful Beliefs in Language Models
FairBelief -- Assessing Harmful Beliefs in Language Models
Mattia Setzu
Marta Marchiori Manerba
Pasquale Minervini
Debora Nozza
16
0
0
27 Feb 2024
Eight Methods to Evaluate Robust Unlearning in LLMs
Eight Methods to Evaluate Robust Unlearning in LLMs
Aengus Lynch
Phillip Guo
Aidan Ewart
Stephen Casper
Dylan Hadfield-Menell
ELM
MU
35
55
0
26 Feb 2024
Mirror: A Multiple-perspective Self-Reflection Method for Knowledge-rich
  Reasoning
Mirror: A Multiple-perspective Self-Reflection Method for Knowledge-rich Reasoning
Hanqi Yan
Qinglin Zhu
Xinyu Wang
Lin Gui
Yulan He
LRM
LLMAG
24
4
0
22 Feb 2024
A Language Model's Guide Through Latent Space
A Language Model's Guide Through Latent Space
Dimitri von Rutte
Sotiris Anagnostidis
Gregor Bachmann
Thomas Hofmann
27
21
0
22 Feb 2024
Acquiring Clean Language Models from Backdoor Poisoned Datasets by
  Downscaling Frequency Space
Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space
Zongru Wu
Zhuosheng Zhang
Pengzhou Cheng
Gongshen Liu
AAML
28
4
0
19 Feb 2024
Uncovering Latent Human Wellbeing in Language Model Embeddings
Uncovering Latent Human Wellbeing in Language Model Embeddings
Pedro Freire
ChengCheng Tan
Adam Gleave
Dan Hendrycks
Scott Emmons
22
1
0
19 Feb 2024
Chain-of-Thought Reasoning Without Prompting
Chain-of-Thought Reasoning Without Prompting
Xuezhi Wang
Denny Zhou
ReLM
LRM
144
97
0
15 Feb 2024
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via
  Self-Evaluation
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation
Xiaoying Zhang
Baolin Peng
Ye Tian
Jingyan Zhou
Lifeng Jin
Linfeng Song
Haitao Mi
Helen Meng
HILM
28
42
0
14 Feb 2024
Learning Interpretable Concepts: Unifying Causal Representation Learning
  and Foundation Models
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models
Goutham Rajendran
Simon Buchholz
Bryon Aragam
Bernhard Schölkopf
Pradeep Ravikumar
AI4CE
83
19
0
14 Feb 2024
TELLER: A Trustworthy Framework for Explainable, Generalizable and
  Controllable Fake News Detection
TELLER: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection
Hui Liu
Wenya Wang
Haoru Li
Haoliang Li
39
3
0
12 Feb 2024
Opening the AI black box: program synthesis via mechanistic
  interpretability
Opening the AI black box: program synthesis via mechanistic interpretability
Eric J. Michaud
Isaac Liao
Vedang Lad
Ziming Liu
Anish Mudide
Chloe Loughridge
Zifan Carl Guo
Tara Rezaei Kheirkhah
Mateja Vukelić
Max Tegmark
18
12
0
07 Feb 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank
  Modifications
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
Kaixuan Huang
Yangsibo Huang
Tinghao Xie
Xiangyu Qi
Mengzhou Xia
Prateek Mittal
Mengdi Wang
Peter Henderson
AAML
55
78
0
07 Feb 2024
Challenges in Mechanistically Interpreting Model Representations
Challenges in Mechanistically Interpreting Model Representations
Satvik Golechha
James Dao
35
3
0
06 Feb 2024
Distinguishing the Knowable from the Unknowable with Language Models
Distinguishing the Knowable from the Unknowable with Language Models
Gustaf Ahdritz
Tian Qin
Nikhil Vyas
Boaz Barak
Benjamin L. Edelman
16
18
0
05 Feb 2024
Prospects for inconsistency detection using large language models and
  sheaves
Prospects for inconsistency detection using large language models and sheaves
Steve Huntsman
Michael Robinson
Ludmilla Huntsman
18
4
0
30 Jan 2024
Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for
  Hallucination Mitigation
Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for Hallucination Mitigation
Yuxin Liang
Zhuoyang Song
Hao Wang
Jiaxing Zhang
HILM
23
28
0
27 Jan 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
13
75
0
25 Jan 2024
Can AI Assistants Know What They Don't Know?
Can AI Assistants Know What They Don't Know?
Qinyuan Cheng
Tianxiang Sun
Xiangyang Liu
Wenwei Zhang
Zhangyue Yin
Shimin Li
Linyang Li
Zhengfu He
Kai Chen
Xipeng Qiu
29
23
0
24 Jan 2024
GRATH: Gradual Self-Truthifying for Large Language Models
GRATH: Gradual Self-Truthifying for Large Language Models
Weixin Chen
D. Song
Bo-wen Li
HILM
SyDa
16
5
0
22 Jan 2024
Universal Neurons in GPT2 Language Models
Universal Neurons in GPT2 Language Models
Wes Gurnee
Theo Horsley
Zifan Carl Guo
Tara Rezaei Kheirkhah
Qinyi Sun
Will Hathaway
Neel Nanda
Dimitris Bertsimas
MILM
92
37
0
22 Jan 2024
Quantifying stability of non-power-seeking in artificial agents
Quantifying stability of non-power-seeking in artificial agents
Evan Ryan Gunter
Yevgeny Liokumovich
Victoria Krakovna
13
1
0
07 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO
  and Toxicity
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
64
95
0
03 Jan 2024
Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models
  through Intervention without Tuning
Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning
Zhongzhi Chen
Xingwu Sun
Xianfeng Jiao
Fengzong Lian
Zhanhui Kang
Di Wang
Cheng-zhong Xu
HILM
24
27
0
29 Dec 2023
LLM Factoscope: Uncovering LLMs' Factual Discernment through Inner
  States Analysis
LLM Factoscope: Uncovering LLMs' Factual Discernment through Inner States Analysis
Jinwen He
Yujia Gong
Kai-xiang Chen
Zijin Lin
Chengán Wei
Yue Zhao
11
3
0
27 Dec 2023
Challenges with unsupervised LLM knowledge discovery
Challenges with unsupervised LLM knowledge discovery
Sebastian Farquhar
Vikrant Varma
Zachary Kenton
Johannes Gasteiger
Vladimir Mikulik
Rohin Shah
24
23
0
15 Dec 2023
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
  Supervision
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns
Pavel Izmailov
Jan Hendrik Kirchner
Bowen Baker
Leo Gao
...
Adrien Ecoffet
Manas Joglekar
Jan Leike
Ilya Sutskever
Jeff Wu
ELM
23
254
0
14 Dec 2023
Alignment for Honesty
Alignment for Honesty
Yuqing Yang
Ethan Chern
Xipeng Qiu
Graham Neubig
Pengfei Liu
23
27
0
12 Dec 2023
Weakly Supervised Detection of Hallucinations in LLM Activations
Weakly Supervised Detection of Hallucinations in LLM Activations
Miriam Rateike
C. Cintas
John Wamburu
Tanya Akumu
Skyler Speakman
10
11
0
05 Dec 2023
Characterizing Large Language Model Geometry Helps Solve Toxicity
  Detection and Generation
Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation
Randall Balestriero
Romain Cosentino
Sarath Shekkizhar
17
2
0
04 Dec 2023
Eliciting Latent Knowledge from Quirky Language Models
Eliciting Latent Knowledge from Quirky Language Models
Alex Troy Mallen
Madeline Brumley
Julia Kharchenko
Nora Belrose
HILM
RALM
KELM
11
25
0
02 Dec 2023
Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation
Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation
P. Bricman
11
0
0
01 Dec 2023
Cognitive Dissonance: Why Do Language Model Outputs Disagree with
  Internal Representations of Truthfulness?
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Kevin Liu
Stephen Casper
Dylan Hadfield-Menell
Jacob Andreas
HILM
52
35
0
27 Nov 2023
SPIN: Sparsifying and Integrating Internal Neurons in Large Language
  Models for Text Classification
SPIN: Sparsifying and Integrating Internal Neurons in Large Language Models for Text Classification
Difan Jiao
Yilun Liu
Zhenwei Tang
Daniel Matter
Jürgen Pfeffer
Ashton Anderson
17
1
0
27 Nov 2023
Rescue: Ranking LLM Responses with Partial Ordering to Improve Response
  Generation
Rescue: Ranking LLM Responses with Partial Ordering to Improve Response Generation
Yikun Wang
Rui Zheng
Haoming Li
Qi Zhang
Tao Gui
Fei Liu
OffRL
19
3
0
15 Nov 2023
Towards Evaluating AI Systems for Moral Status Using Self-Reports
Towards Evaluating AI Systems for Moral Status Using Self-Reports
Ethan Perez
Robert Long
ELM
18
8
0
14 Nov 2023
Previous
123456
Next