Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1606.06565
Cited By
v1
v2 (latest)
Concrete Problems in AI Safety
21 June 2016
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Concrete Problems in AI Safety"
50 / 1,379 papers shown
Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges
Christian Bluethgen
Dave Van Veen
Daniel Truhn
Jakob Nikolas Kather
Michael Moor
...
Akshay S. Chaudhari
Thomas Frauenfelder
C. Langlotz
Michael Krauthammer
Farhad Nooralahzadeh
LM&MA
AI4CE
285
0
0
10 Oct 2025
Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B
Nisar Ahmed
Muhammad Imran Zaman
Gulshan Saleem
Ali Hassan
LRM
125
0
0
08 Oct 2025
Label Semantics for Robust Hyperspectral Image Classification
Rafin Hassan
Zarin Tasnim Roshni
Rafiqul Bari
Alimul Islam
Nabeel Mohammed
Moshiur Farazi
Shafin Rahman
VLM
118
1
0
08 Oct 2025
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails
Siwei Han
Jiaqi Liu
Yaofeng Su
Wenbo Duan
Xinyuan Liu
Cihang Xie
Mohit Bansal
Mingyu Ding
Linjun Zhang
Huaxiu Yao
142
1
0
06 Oct 2025
HybridFlow: Quantification of Aleatoric and Epistemic Uncertainty with a Single Hybrid Model
Peter Van Katwyk
Karianne J. Bergen
196
0
0
06 Oct 2025
Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
Radha Gulhane
Sathish Reddy Indurthi
OffRL
LRM
92
0
0
06 Oct 2025
Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning
Yunghwei Lai
Kaiming Liu
Ziyue Wang
Weizhi Ma
Yang Liu
LM&MA
152
1
0
05 Oct 2025
Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention
Santhosh Kumar Ravindran
161
0
0
05 Oct 2025
Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
Wenhao Deng
Long Wei
Chenglei Yu
Tailin Wu
OffRL
ReLM
LRM
272
2
0
04 Oct 2025
LegalSim: Multi-Agent Simulation of Legal Systems for Discovering Procedural Exploits
Sanket Badhe
AILaw
164
1
0
03 Oct 2025
Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization
Antoine Maier
Aude Maier
Tom David
100
0
0
03 Oct 2025
Reward Models are Metrics in a Trench Coat
Sebastian Gehrmann
147
0
0
03 Oct 2025
Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
Qiyuan Liu
Hao Xu
Xuhong Chen
Wei Chen
Yee Whye Teh
Ning Miao
ReLM
LRM
AI4CE
278
0
0
02 Oct 2025
Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed
Isha Gupta
Rylan Schaeffer
Joshua Kazdan
Katja Filippova
Sanmi Koyejo
OOD
AAML
290
1
0
01 Oct 2025
Alignment-Aware Decoding
Frédéric Berdoz
Luca A. Lanzendörfer
René Caky
Roger Wattenhofer
164
0
0
30 Sep 2025
When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets
Zeshi Dai
Zimo Peng
Zerui Cheng
Ryan Yihe Li
AAML
AIFin
ELM
101
1
0
30 Sep 2025
Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks
Peiran Xu
Ruoyao Xiao
Xiaoying Xing
Guannan Zhang
Debiao Li
Kunyu Shi
OffRL
LRM
117
2
0
29 Sep 2025
VFSI: Validity First Spatial Intelligence for Constraint-Guided Traffic Diffusion
Kargi Chauhan
Leilani H. Gilpin
91
0
0
28 Sep 2025
On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization
Janvijay Singh
Austin Xu
Yilun Zhou
Yefan Zhou
Dilek Hakkani-Tur
Shafiq Joty
ELM
123
1
0
28 Sep 2025
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
Simon Schrodi
Elias Kempf
Fazl Barez
Thomas Brox
FedML
135
1
0
28 Sep 2025
Causally-Enhanced Reinforcement Policy Optimization
Xiangqi Wang
Yue Huang
Yujun Zhou
Xiaonan Luo
Kehan Guo
Xiangliang Zhang
OffRL
LRM
213
0
0
27 Sep 2025
Enhancing Blind Face Restoration through Online Reinforcement Learning
Bin Wu
Yahui Liu
Chi Zhang
Yao-Min Zhao
Wei Wang
CVBM
OffRL
CLL
OnRL
432
0
0
27 Sep 2025
Learnable Conformal Prediction with Context-Aware Nonconformity Functions for Robotic Planning and Perception
Divake Kumar
Sina Tayebati
Francesco Migliarba
Ranganath Krishnan
A. R. Trivedi
145
1
0
26 Sep 2025
MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems
Yuki Ichihara
Yuu Jinnai
Tetsuro Morimura
Mitsuki Sakamoto
Ryota Mitsuhashi
Eiji Uchibe
167
2
0
26 Sep 2025
Limitations on Safe, Trusted, Artificial General Intelligence
Rina Panigrahy
Willie Neiswanger
109
0
0
25 Sep 2025
Failure Modes of Maximum Entropy RLHF
Ömer Veysel Çağatan
Barış Akgün
120
0
0
24 Sep 2025
Responsible AI Technical Report
Soonmin Bae
Wanjin Park
Jeongyeop Kim
Yunjin Park
Jungwon Yoon
...
Sujin Kim
Youngchol Kim
Somin Lee
Wonyoung Lee
Minsung Noh
195
0
0
24 Sep 2025
SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer
Yarden As
Chengrui Qu
Benjamin Unger
Dongho Kang
Max van der Hart
Laixi Shi
Stelian Coros
Adam Wierman
Andreas Krause
OffRL
330
0
0
23 Sep 2025
Probabilistic Runtime Verification, Evaluation and Risk Assessment of Visual Deep Learning Systems
Birk Torpmann-Hagen
Pål Halvorsen
Michael A. Riegler
Dag Johansen
117
0
0
23 Sep 2025
The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind
Caleb DeLeeuw
Gaurav Chawla
Aniket Sharma
Vanessa Dietze
111
1
0
23 Sep 2025
FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Debarpan Bhattacharya
Apoorva Kulkarni
Sriram Ganapathy
276
0
0
20 Sep 2025
The Alignment Bottleneck
Wenjun Cao
224
0
0
19 Sep 2025
Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins
Erblin Isaku
C. Gomes
Shaukat Ali
Beatriz Sanguino
Tongtong Wang
Guoyuan Li
Houxiang Zhang
Thomas Peyrucain
179
2
0
16 Sep 2025
Secure Human Oversight of AI: Exploring the Attack Surface of Human Oversight
Jonas C. Ditz
Veronika Lazar
Elmar Lichtmeß
Carola Plesch
Matthias Heck
Kevin Baum
Markus Langer
AAML
187
0
0
15 Sep 2025
CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI
Hasin Jawad Ali
Ilhamul Azam
Ajwad Abrar
Md. Kamrul Hasan
H. Mahmud
89
0
0
14 Sep 2025
Mutual Information Tracks Policy Coherence in Reinforcement Learning
Cameron Reid
Wael Hafez
Amirhossein Nazeri
127
0
0
12 Sep 2025
Interpretability as Alignment: Making Internal Understanding a Design Principle
Aadit Sengupta
Pratinav Seth
Vinay Kumar Sankarapu
AI4CE
AAML
142
0
0
10 Sep 2025
Symmetry-Guided Multi-Agent Inverse Reinforcement Learning
Yongkai Tian
Yirong Qi
Xin Yu
Wenjun Wu
Jie Luo
163
1
0
10 Sep 2025
ACE and Diverse Generalization via Selective Disagreement
Oliver Daniels
Stuart Armstrong
Alexandre Maranhao
Mahirah Fairuz Rahman
Benjamin M. Marlin
Rebecca Gorman
OODD
242
0
0
09 Sep 2025
Collaborate, Deliberate, Evaluate: How LLM Alignment Affects Coordinated Multi-Agent Outcomes
Abhijnan Nath
Carine Graff
Nikhil Krishnaswamy
LLMAG
161
3
0
07 Sep 2025
Murphys Laws of AI Alignment: Why the Gap Always Wins
Madhava Gaikwad
ALM
272
1
0
04 Sep 2025
What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?
Ibne Farabi Shihab
Sanjeda Akter
Anuj Sharma
OffRL
145
0
0
04 Sep 2025
Beyond expected value: geometric mean optimization for long-term policy performance in reinforcement learning
Xinyi Sheng
Dominik Baumann
155
1
0
29 Aug 2025
ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety
Luke Bates
Max Glockner
Preslav Nakov
Iryna Gurevych
109
0
0
28 Aug 2025
Embodied AI: Emerging Risks and Opportunities for Policy Action
Jared Perlo
Alexander Robey
Fazl Barez
Luciano Floridi
Jakob Mokander
293
2
0
28 Aug 2025
Servant, Stalker, Predator: How An Honest, Helpful, And Harmless (3H) Agent Unlocks Adversarial Skills
David Noever
146
0
0
27 Aug 2025
Democracy-in-Silico: Institutional Design as Alignment in AI-Governed Polities
Trisanth Srinivasan
Santosh Patapati
102
0
0
27 Aug 2025
Reliable Weak-to-Strong Monitoring of LLM Agents
Neil Kale
Chen Bo Calvin Zhang
Kevin Zhu
Ankit Aich
Paula Rodriguez
Scale Red Team
Christina Q. Knight
Zifan Wang
184
2
0
26 Aug 2025
A Defect Classification Framework for AI-Based Software Systems (AI-ODC)
Mohammed O. Alannsary
49
0
0
25 Aug 2025
ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts
Darpan Aswal
Céline Hudelot
171
0
0
22 Aug 2025
Previous
1
2
3
4
5
...
26
27
28
Next
Page 2 of 28
Page
of 28
Go