ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2209.00626
  4. Cited By
The Alignment Problem from a Deep Learning Perspective

The Alignment Problem from a Deep Learning Perspective

30 August 2022
Richard Ngo
Lawrence Chan
Sören Mindermann
ArXivPDFHTML

Papers citing "The Alignment Problem from a Deep Learning Perspective"

50 / 130 papers shown
Title
An alignment safety case sketch based on debate
An alignment safety case sketch based on debate
Marie Davidsen Buhl
Jacob Pfau
Benjamin Hilton
Geoffrey Irving
30
0
0
06 May 2025
What Is AI Safety? What Do We Want It to Be?
What Is AI Safety? What Do We Want It to Be?
Jacqueline Harding
Cameron Domenico Kirk-Giannini
62
0
0
05 May 2025
Real-World Gaps in AI Governance Research
Real-World Gaps in AI Governance Research
Ilan Strauss
Isobel Moure
Tim O'Reilly
Sruly Rosenblat
61
0
0
30 Apr 2025
Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models
Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models
Thilo Hagendorff
Sarah Fabi
ReLM
ELM
LRM
40
0
0
14 Apr 2025
An Evaluation of Cultural Value Alignment in LLM
An Evaluation of Cultural Value Alignment in LLM
Nicholas Sukiennik
Chen Gao
Fengli Xu
Y. Li
29
0
0
11 Apr 2025
On the Robustness of GUI Grounding Models Against Image Attacks
On the Robustness of GUI Grounding Models Against Image Attacks
Haoren Zhao
Tianyi Chen
Zhen Wang
AAML
36
0
0
07 Apr 2025
Representation Bending for Large Language Model Safety
Representation Bending for Large Language Model Safety
Ashkan Yousefpour
Taeheon Kim
Ryan S. Kwon
Seungbeen Lee
Wonje Jeung
Seungju Han
Alvin Wan
Harrison Ngan
Youngjae Yu
Jonghyun Choi
AAML
ALM
KELM
52
0
0
02 Apr 2025
AI threats to national security can be countered through an incident regime
AI threats to national security can be countered through an incident regime
Alejandro Ortega
36
0
0
25 Mar 2025
Mixture of Experts Made Intrinsically Interpretable
Xingyi Yang
Constantin Venhoff
Ashkan Khakzar
Christian Schroeder de Witt
P. Dokania
Adel Bibi
Philip H. S. Torr
MoE
49
0
0
05 Mar 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
76
8
0
24 Feb 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
57
1
0
20 Jan 2025
Learning to Assist Humans without Inferring Rewards
Learning to Assist Humans without Inferring Rewards
Vivek Myers
Evan Ellis
Sergey Levine
Benjamin Eysenbach
Anca Dragan
35
2
0
17 Jan 2025
Measuring Error Alignment for Decision-Making Systems
Measuring Error Alignment for Decision-Making Systems
Binxia Xu
Antonis Bikakis
Daniel Onah
A. Vlachidis
Luke Dickens
34
0
0
03 Jan 2025
Measuring Goal-Directedness
Measuring Goal-Directedness
Matt MacDermott
James Fox
Francesco Belardinelli
Tom Everitt
83
1
0
06 Dec 2024
The Two-Hop Curse: LLMs trained on A$\rightarrow$B, B$\rightarrow$C fail to learn A$\rightarrow$C
The Two-Hop Curse: LLMs trained on A→\rightarrow→B, B→\rightarrow→C fail to learn A→\rightarrow→C
Mikita Balesni
Tomek Korbak
Owain Evans
ReLM
LRM
79
0
0
25 Nov 2024
Can an AI Agent Safely Run a Government? Existence of Probably
  Approximately Aligned Policies
Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies
Frédéric Berdoz
Roger Wattenhofer
83
0
0
21 Nov 2024
Safety case template for frontier AI: A cyber inability argument
Safety case template for frontier AI: A cyber inability argument
Arthur Goemans
Marie Davidsen Buhl
Jonas Schuett
Tomek Korbak
Jessica Wang
Benjamin Hilton
Geoffrey Irving
56
15
0
12 Nov 2024
AI Ethics by Design: Implementing Customizable Guardrails for
  Responsible AI Development
AI Ethics by Design: Implementing Customizable Guardrails for Responsible AI Development
Kristina Šekrst
Jeremy McHugh
Jonathan Rodriguez Cefalu
59
0
0
05 Nov 2024
Constrained Human-AI Cooperation: An Inclusive Embodied Social
  Intelligence Challenge
Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge
Weihua Du
Qiushi Lyu
Jiaming Shan
Zhenting Qi
Hongxin Zhang
...
Andi Peng
Tianmin Shu
Kwonjoon Lee
Behzad Dariush
Chuang Gan
36
1
0
04 Nov 2024
Towards evaluations-based safety cases for AI scheming
Towards evaluations-based safety cases for AI scheming
Mikita Balesni
Marius Hobbhahn
David Lindner
Alexander Meinke
Tomek Korbak
...
Dan Braun
Bilal Chughtai
Owain Evans
Daniel Kokotajlo
Lucius Bushnaq
ELM
37
9
0
29 Oct 2024
Fast Best-of-N Decoding via Speculative Rejection
Fast Best-of-N Decoding via Speculative Rejection
Hanshi Sun
Momin Haider
Ruiqi Zhang
Huitao Yang
Jiahao Qiu
Ming Yin
Mengdi Wang
Peter L. Bartlett
Andrea Zanette
BDL
40
28
0
26 Oct 2024
AI, Global Governance, and Digital Sovereignty
AI, Global Governance, and Digital Sovereignty
Swati Srivastava
Justin Bullock
22
0
0
23 Oct 2024
Looking Inward: Language Models Can Learn About Themselves by
  Introspection
Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix J Binder
James Chua
Tomek Korbak
Henry Sleight
John Hughes
Robert Long
Ethan Perez
Miles Turpin
Owain Evans
KELM
AIFin
LRM
35
12
0
17 Oct 2024
FairMindSim: Alignment of Behavior, Emotion, and Belief in Humans and
  LLM Agents Amid Ethical Dilemmas
FairMindSim: Alignment of Behavior, Emotion, and Belief in Humans and LLM Agents Amid Ethical Dilemmas
Yu Lei
Hao Liu
Chengxing Xie
Songjia Liu
Zhiyu Yin
Canyu Chen
G. Li
Philip H. S. Torr
Zhen Wu
26
2
0
14 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
50
7
0
10 Oct 2024
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Michael Lan
Philip H. S. Torr
Austin Meek
Ashkan Khakzar
David M. Krueger
Fazl Barez
28
9
0
09 Oct 2024
Towards Measuring Goal-Directedness in AI Systems
Towards Measuring Goal-Directedness in AI Systems
Dylan Xu
Juan-Pablo Rivera
22
3
0
07 Oct 2024
Moral Alignment for LLM Agents
Moral Alignment for LLM Agents
Elizaveta Tennant
Stephen Hailes
Mirco Musolesi
35
0
0
02 Oct 2024
TracrBench: Generating Interpretability Testbeds with Large Language
  Models
TracrBench: Generating Interpretability Testbeds with Large Language Models
Hannes Thurnherr
Jérémy Scheurer
41
3
0
07 Sep 2024
On the Generalization of Preference Learning with DPO
On the Generalization of Preference Learning with DPO
Shawn Im
Yixuan Li
41
1
0
06 Aug 2024
The Sociolinguistic Foundations of Language Modeling
The Sociolinguistic Foundations of Language Modeling
Jack Grieve
Sara Bartl
Matteo Fuoli
Jason Grafmiller
Weihang Huang
A. Jawerbaum
Akira Murakami
Marcus Perlman
Dana Roemling
Bodo Winter
33
7
0
12 Jul 2024
AI Safety in Generative AI Large Language Models: A Survey
AI Safety in Generative AI Large Language Models: A Survey
Jaymari Chua
Yun Yvonna Li
Shiyi Yang
Chen Wang
Lina Yao
LM&MA
34
12
0
06 Jul 2024
On scalable oversight with weak LLMs judging strong LLMs
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton
Noah Y. Siegel
János Kramár
Jonah Brown-Cohen
Samuel Albanie
...
Rishabh Agarwal
David Lindner
Yunhao Tang
Noah D. Goodman
Rohin Shah
ELM
35
29
0
05 Jul 2024
ProductAgent: Benchmarking Conversational Product Search Agent with
  Asking Clarification Questions
ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions
Jingheng Ye
Yong Jiang
Xiaobin Wang
Yinghui Li
Yangning Li
Hai-Tao Zheng
Pengjun Xie
Fei Huang
33
2
0
01 Jul 2024
Towards shutdownable agents via stochastic choice
Towards shutdownable agents via stochastic choice
Elliott Thornley
Alexander Roman
Christos Ziakas
Leyton Ho
Louis Thomson
25
0
0
30 Jun 2024
Aligning Model Properties via Conformal Risk Control
Aligning Model Properties via Conformal Risk Control
William Overman
Jacqueline Jil Vallon
Mohsen Bayati
33
2
0
26 Jun 2024
WARP: On the Benefits of Weight Averaged Rewarded Policies
WARP: On the Benefits of Weight Averaged Rewarded Policies
Alexandre Ramé
Johan Ferret
Nino Vieillard
Robert Dadashi
Léonard Hussenot
Pierre-Louis Cedoz
Pier Giuseppe Sessa
Sertan Girgin
Arthur Douillard
Olivier Bachem
50
13
0
24 Jun 2024
It Takes Two: On the Seamlessness between Reward and Policy Model in
  RLHF
It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF
Taiming Lu
Lingfeng Shen
Xinyu Yang
Weiting Tan
Beidi Chen
Huaxiu Yao
50
2
0
12 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
37
23
0
11 Jun 2024
Aligning Large Language Models with Representation Editing: A Control
  Perspective
Aligning Large Language Models with Representation Editing: A Control Perspective
Lingkai Kong
Haorui Wang
Wenhao Mu
Yuanqi Du
Yuchen Zhuang
Yifei Zhou
Yue Song
Rongzhi Zhang
Kai Wang
Chao Zhang
20
22
0
10 Jun 2024
Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning
Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning
Andreas Schlaginhaufen
Maryam Kamgarpour
OffRL
21
1
0
03 Jun 2024
Stress-Testing Capability Elicitation With Password-Locked Models
Stress-Testing Capability Elicitation With Password-Locked Models
Ryan Greenblatt
Fabien Roger
Dmitrii Krasheninnikov
David M. Krueger
30
13
0
29 May 2024
The Dual Imperative: Innovation and Regulation in the AI Era
The Dual Imperative: Innovation and Regulation in the AI Era
Paulo Carvao
26
0
0
23 May 2024
LIRE: listwise reward enhancement for preference alignment
LIRE: listwise reward enhancement for preference alignment
Mingye Zhu
Yi Liu
Lei Zhang
Junbo Guo
Zhendong Mao
22
7
0
22 May 2024
Wav-KAN: Wavelet Kolmogorov-Arnold Networks
Wav-KAN: Wavelet Kolmogorov-Arnold Networks
Zavareh Bozorgasl
Hao Chen
30
96
0
21 May 2024
Can Language Models Explain Their Own Classification Behavior?
Can Language Models Explain Their Own Classification Behavior?
Dane Sherburn
Bilal Chughtai
Owain Evans
28
1
0
13 May 2024
People cannot distinguish GPT-4 from a human in a Turing test
People cannot distinguish GPT-4 from a human in a Turing test
Cameron R. Jones
Benjamin K. Bergen
ELM
DeLMO
32
30
0
09 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
38
111
0
22 Apr 2024
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from
  Human Feedback for LLMs
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
Shreyas Chaudhari
Pranjal Aggarwal
Vishvak Murahari
Tanmay Rajpurohit
A. Kalyan
Karthik Narasimhan
A. Deshpande
Bruno Castro da Silva
21
34
0
12 Apr 2024
The Probabilities Also Matter: A More Faithful Metric for Faithfulness
  of Free-Text Explanations in Large Language Models
The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models
Noah Y. Siegel
Oana-Maria Camburu
N. Heess
Maria Perez-Ortiz
18
8
0
04 Apr 2024
123
Next