v1v2 (latest)
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Tomek Korbak
Mikita Balesni
Elizabeth Barnes
Yoshua Bengio
Joe Benton
Joseph Bloom
Mark Chen
Alan Cooney
Allan Dafoe
Anca Dragan
Scott Emmons
Owain Evans
David Farhi
Ryan Greenblatt
Dan Hendrycks
Marius Hobbhahn
Evan Hubinger
Geoffrey Irving
Erik Jenner
Daniel Kokotajlo
Victoria Krakovna
Shane Legg
David Lindner
David Luan
Aleksander Mądry
Julian Michael
Neel Nanda
Dave Orr
Jakub Pachocki
Ethan Perez
Mary Phuong
Fabien Roger
Joshua Saxe
Buck Shlegeris
Martín Soto
Eric Steinberger
Jasmine Wang
Wojciech Zaremba
Bowen Baker
Rohin Shah
Vlad Mikulik
- LRM
Main:7 Pages
1 Figures
Bibliography:4 Pages
Abstract
AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
View on arXivComments on this paper
