ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.06942
  4. Cited By
AI Control: Improving Safety Despite Intentional Subversion

AI Control: Improving Safety Despite Intentional Subversion

12 December 2023
Ryan Greenblatt
Buck Shlegeris
Kshitij Sachan
Fabien Roger
ArXivPDFHTML

Papers citing "AI Control: Improving Safety Despite Intentional Subversion"

30 / 30 papers shown
Title
The Steganographic Potentials of Language Models
The Steganographic Potentials of Language Models
Artem Karpov
Tinuade Adeleke
Seong Hah Cho
Natalia Perez-Campanero
20
0
0
06 May 2025
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents
Christian Schroeder de Witt
AAML
AI4CE
62
0
0
04 May 2025
Scaling Laws For Scalable Oversight
Scaling Laws For Scalable Oversight
Joshua Engels
David D. Baek
Subhash Kantamneni
Max Tegmark
ELM
70
0
0
25 Apr 2025
Ctrl-Z: Controlling AI Agents via Resampling
Ctrl-Z: Controlling AI Agents via Resampling
Aryan Bhatt
Cody Rushing
Adam Kaufman
Tyler Tracy
Vasil Georgiev
David Matolcsi
Akbir Khan
B. S.
AAML
25
1
0
14 Apr 2025
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
Tomek Korbak
Mikita Balesni
Buck Shlegeris
Geoffrey Irving
ELM
27
1
0
07 Apr 2025
Among Us: A Sandbox for Agentic Deception
Among Us: A Sandbox for Agentic Deception
Satvik Golechha
Adrià Garriga-Alonso
LLMAG
44
2
0
05 Apr 2025
A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management
A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management
Simeon Campos
Henry Papadatos
Fabien Roger
Chloé Touzet
Malcolm Murray
Otter Quarks
76
2
0
20 Feb 2025
A sketch of an AI control safety case
A sketch of an AI control safety case
Tomek Korbak
Joshua Clymer
Benjamin Hilton
Buck Shlegeris
Geoffrey Irving
75
5
0
28 Jan 2025
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Sebastian Farquhar
Vikrant Varma
David Lindner
David Elson
Caleb Biddulph
Ian Goodfellow
Rohin Shah
79
1
0
22 Jan 2025
Lies, Damned Lies, and Distributional Language Statistics: Persuasion
  and Deception with Large Language Models
Lies, Damned Lies, and Distributional Language Statistics: Persuasion and Deception with Large Language Models
Cameron R. Jones
Benjamin Bergen
67
4
0
22 Dec 2024
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Jiaxin Wen
Vivek Hebbar
Caleb Larson
Aryan Bhatt
Ansh Radhakrishnan
...
Shi Feng
He He
Ethan Perez
Buck Shlegeris
Akbir Khan
AAML
68
8
0
26 Nov 2024
A dataset of questions on decision-theoretic reasoning in Newcomb-like
  problems
A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
Caspar Oesterheld
Emery Cooper
Miles Kodama
Linh Chi Nguyen
Ethan Perez
29
1
0
15 Nov 2024
Towards evaluations-based safety cases for AI scheming
Towards evaluations-based safety cases for AI scheming
Mikita Balesni
Marius Hobbhahn
David Lindner
Alexander Meinke
Tomek Korbak
...
Dan Braun
Bilal Chughtai
Owain Evans
Daniel Kokotajlo
Lucius Bushnaq
ELM
37
9
0
29 Oct 2024
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and
  Ethical Considerations
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations
Tarun Raheja
Nilay Pochhi
AAML
46
1
0
09 Oct 2024
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion
  in LLMs
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
Yohan Mathew
Ollie Matthews
Robert McCarthy
Joan Velja
Christian Schroeder de Witt
Dylan R. Cope
Nandi Schoots
19
3
0
02 Oct 2024
FlyAI -- The Next Level of Artificial Intelligence is Unpredictable!
  Injecting Responses of a Living Fly into Decision Making
FlyAI -- The Next Level of Artificial Intelligence is Unpredictable! Injecting Responses of a Living Fly into Decision Making
Denys J. C. Matthies
Ruben Schlonsak
Hanzhi Zhuang
Rui Song
19
0
0
30 Sep 2024
Games for AI Control: Models of Safety Evaluations of AI Deployment
  Protocols
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
Charlie Griffin
Louis Thomson
Buck Shlegeris
Alessandro Abate
15
5
0
12 Sep 2024
Adversaries Can Misuse Combinations of Safe Models
Adversaries Can Misuse Combinations of Safe Models
Erik Jones
Anca Dragan
Jacob Steinhardt
40
6
0
20 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
32
23
0
11 Jun 2024
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Joshua Clymer
Caden Juang
Severin Field
CVBM
27
1
0
08 May 2024
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
S. Motwani
Mikhail Baranchuk
Martin Strohmeier
Vijay Bolina
Philip H. S. Torr
Lewis Hammond
Christian Schroeder de Witt
26
4
0
12 Feb 2024
Weak-to-Strong Jailbreaking on Large Language Models
Weak-to-Strong Jailbreaking on Large Language Models
Xuandong Zhao
Xianjun Yang
Tianyu Pang
Chao Du
Lei Li
Yu-Xiang Wang
William Yang Wang
26
52
0
30 Jan 2024
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Michael Feffer
Anusha Sinha
Wesley Hanwen Deng
Zachary Chase Lipton
Hoda Heidari
AAML
25
66
0
29 Jan 2024
Visibility into AI Agents
Visibility into AI Agents
Alan Chan
Carson Ezell
Max Kaufmann
K. Wei
Lewis Hammond
...
Nitarshan Rajkumar
David M. Krueger
Noam Kolt
Lennart Heim
Markus Anderljung
13
29
0
23 Jan 2024
Quantifying stability of non-power-seeking in artificial agents
Quantifying stability of non-power-seeking in artificial agents
Evan Ryan Gunter
Yevgeny Liokumovich
Victoria Krakovna
13
1
0
07 Jan 2024
Scheming AIs: Will AIs fake alignment during training in order to get
  power?
Scheming AIs: Will AIs fake alignment during training in order to get power?
Joe Carlsmith
53
30
0
14 Nov 2023
Preventing Language Models From Hiding Their Reasoning
Preventing Language Models From Hiding Their Reasoning
Fabien Roger
Ryan Greenblatt
LRM
16
16
0
27 Oct 2023
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
303
11,730
0
04 Mar 2022
Measuring Coding Challenge Competence With APPS
Measuring Coding Challenge Competence With APPS
Dan Hendrycks
Steven Basart
Saurav Kadavath
Mantas Mazeika
Akul Arora
...
Collin Burns
Samir Puranik
Horace He
D. Song
Jacob Steinhardt
ELM
AIMat
ALM
194
614
0
20 May 2021
AI safety via debate
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
199
199
0
02 May 2018
1