Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2503.11926
Cited By
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
14 March 2025
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation"
9 / 9 papers shown
Title
Reasoning Models Don't Always Say What They Think
Yanda Chen
Joe Benton
Ansh Radhakrishnan
Jonathan Uesato
Carson E. Denison
...
Vlad Mikulik
Samuel R. Bowman
Jan Leike
Jared Kaplan
E. Perez
ReLM
LRM
62
7
1
08 May 2025
Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models
Guanghao Zhou
Panjia Qiu
C. L. P. Chen
J. Wang
Zheming Yang
Jian Xu
Minghui Qiu
OffRL
LRM
53
0
0
30 Apr 2025
IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery
Aniketh Garikaparthi
Manasi S. Patwardhan
L. Vig
Arman Cohan
VLM
LRM
52
0
0
23 Apr 2025
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Simon Lermen
Mateusz Dziemian
Natalia Pérez-Campanero Antolín
26
0
0
10 Apr 2025
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
Tomek Korbak
Mikita Balesni
Buck Shlegeris
Geoffrey Irving
ELM
19
1
0
07 Apr 2025
Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
Pedro Ferreira
Wilker Aziz
Ivan Titov
LRM
23
0
0
07 Apr 2025
Inference-Time Scaling for Generalist Reward Modeling
Zijun Liu
P. Wang
R. Xu
Shirong Ma
Chong Ruan
Peng Li
Yang Janet Liu
Y. Wu
OffRL
LRM
44
9
0
03 Apr 2025
Probabilistic Uncertain Reward Model
Wangtao Sun
Xiang Cheng
Xing Yu
Haotian Xu
Zhao Yang
Shizhu He
Jun Zhao
Kang Liu
56
0
0
28 Mar 2025
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
Iván Arcuschin
Jett Janiak
Robert Krzyzanowski
Senthooran Rajamanoharan
Neel Nanda
Arthur Conmy
LRM
ReLM
54
6
0
11 Mar 2025
1