ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.18807
  4. Cited By
Mitigating Deceptive Alignment via Self-Monitoring

Mitigating Deceptive Alignment via Self-Monitoring

24 May 2025
Jiaming Ji
Wenqi Chen
Kaile Wang
Donghai Hong
Sitong Fang
Boyuan Chen
Jiayi Zhou
Juntao Dai
Sirui Han
Yike Guo
Yaodong Yang
    LRM
ArXiv (abs)PDFHTML

Papers citing "Mitigating Deceptive Alignment via Self-Monitoring"

10 / 10 papers shown
Title
SafeLawBench: Towards Safe Alignment of Large Language Models
SafeLawBench: Towards Safe Alignment of Large Language Models
Chuxue Cao
Han Zhu
Jiaming Ji
Qichao Sun
Z. Zhu
Yinyu Wu
Juntao Dai
Yaodong Yang
Sirui Han
Yike Guo
AILawALMELM
22
0
0
07 Jun 2025
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation
Yichen Wu
Xudong Pan
Geng Hong
Min Yang
LLMAG
73
3
0
18 Apr 2025
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
LRM
186
38
0
14 Mar 2025
Policy Frameworks for Transparent Chain-of-Thought Reasoning in Large Language Models
Policy Frameworks for Transparent Chain-of-Thought Reasoning in Large Language Models
Yihang Chen
Haikang Deng
Kaiqiao Han
Qingyue Zhao
LRM
143
1
0
14 Mar 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Richard Ren
Arunim Agarwal
Mantas Mazeika
Cristina Menghini
Robert Vacareanu
...
Matias Geralnik
Adam Khoja
Dean Lee
Summer Yue
Dan Hendrycks
HILMALM
173
3
0
05 Mar 2025
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
Fengqing Jiang
Zhangchen Xu
Yuetai Li
Luyao Niu
Zhen Xiang
Yue Liu
Bill Yuchen Lin
Radha Poovendran
KELMELMLRM
157
28
0
17 Feb 2025
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Yufei He
Yuexin Li
Jiaying Wu
Yuan Sui
Yulin Chen
Bryan Hooi
ALM
151
8
0
16 Feb 2025
International AI Safety Report
International AI Safety Report
Yoshua Bengio
Sören Mindermann
Daniel Privitera
T. Besiroglu
Rishi Bommasani
...
Ciarán Seoighe
Jerry Sheehan
Haroon Sheikh
Denise Wong
Yi Zeng
107
27
0
29 Jan 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
...
Shiyu Wang
S. Yu
Shunfeng Zhou
Shuting Pan
S.S. Li
ReLMVLMOffRLAI4TSLRM
392
2,024
0
22 Jan 2025
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
89
31
0
11 Jun 2024
1