Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.18807
Cited By
Mitigating Deceptive Alignment via Self-Monitoring
24 May 2025
Jiaming Ji
Wenqi Chen
Kaile Wang
Donghai Hong
Sitong Fang
Boyuan Chen
Jiayi Zhou
Juntao Dai
Sirui Han
Yike Guo
Yaodong Yang
LRM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Mitigating Deceptive Alignment via Self-Monitoring"
10 / 10 papers shown
Title
SafeLawBench: Towards Safe Alignment of Large Language Models
Chuxue Cao
Han Zhu
Jiaming Ji
Qichao Sun
Z. Zhu
Yinyu Wu
Juntao Dai
Yaodong Yang
Sirui Han
Yike Guo
AILaw
ALM
ELM
24
0
0
07 Jun 2025
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation
Yichen Wu
Xudong Pan
Geng Hong
Min Yang
LLMAG
73
3
0
18 Apr 2025
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
LRM
186
38
0
14 Mar 2025
Policy Frameworks for Transparent Chain-of-Thought Reasoning in Large Language Models
Yihang Chen
Haikang Deng
Kaiqiao Han
Qingyue Zhao
LRM
143
1
0
14 Mar 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Richard Ren
Arunim Agarwal
Mantas Mazeika
Cristina Menghini
Robert Vacareanu
...
Matias Geralnik
Adam Khoja
Dean Lee
Summer Yue
Dan Hendrycks
HILM
ALM
173
3
0
05 Mar 2025
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
Fengqing Jiang
Zhangchen Xu
Yuetai Li
Luyao Niu
Zhen Xiang
Yue Liu
Bill Yuchen Lin
Radha Poovendran
KELM
ELM
LRM
157
28
0
17 Feb 2025
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Yufei He
Yuexin Li
Jiaying Wu
Yuan Sui
Yulin Chen
Bryan Hooi
ALM
151
8
0
16 Feb 2025
International AI Safety Report
Yoshua Bengio
Sören Mindermann
Daniel Privitera
T. Besiroglu
Rishi Bommasani
...
Ciarán Seoighe
Jerry Sheehan
Haroon Sheikh
Denise Wong
Yi Zeng
107
27
0
29 Jan 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
...
Shiyu Wang
S. Yu
Shunfeng Zhou
Shuting Pan
S.S. Li
ReLM
VLM
OffRL
AI4TS
LRM
392
2,024
0
22 Jan 2025
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
89
31
0
11 Jun 2024
1