Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1606.06565
Cited By
v1
v2 (latest)
Concrete Problems in AI Safety
21 June 2016
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Concrete Problems in AI Safety"
50 / 1,374 papers shown
Title
Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
Radha Gulhane
Sathish Reddy Indurthi
OffRL
LRM
56
0
0
06 Oct 2025
Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention
Santhosh Kumar Ravindran
96
0
0
05 Oct 2025
Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning
Yunghwei Lai
Kaiming Liu
Ziyue Wang
Weizhi Ma
Yang Liu
LM&MA
143
0
0
05 Oct 2025
Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
Wenhao Deng
Long Wei
Chenglei Yu
Tailin Wu
OffRL
ReLM
LRM
237
2
0
04 Oct 2025
Reward Models are Metrics in a Trench Coat
Sebastian Gehrmann
136
0
0
03 Oct 2025
LegalSim: Multi-Agent Simulation of Legal Systems for Discovering Procedural Exploits
Sanket Badhe
AILaw
129
0
0
03 Oct 2025
Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization
Antoine Maier
Aude Maier
Tom David
88
0
0
03 Oct 2025
Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
Qiyuan Liu
Hao Xu
Xuhong Chen
Wei Chen
Yee Whye Teh
Ning Miao
ReLM
LRM
AI4CE
274
0
0
02 Oct 2025
Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed
Isha Gupta
Rylan Schaeffer
Joshua Kazdan
Katja Filippova
Sanmi Koyejo
OOD
AAML
252
1
0
01 Oct 2025
When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets
Zeshi Dai
Zimo Peng
Zerui Cheng
Ryan Yihe Li
AAML
AIFin
ELM
69
0
0
30 Sep 2025
Alignment-Aware Decoding
Frédéric Berdoz
Luca A. Lanzendörfer
René Caky
Roger Wattenhofer
124
0
0
30 Sep 2025
Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks
Peiran Xu
Ruoyao Xiao
Xiaoying Xing
Guannan Zhang
Debiao Li
Kunyu Shi
OffRL
LRM
92
1
0
29 Sep 2025
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
Simon Schrodi
Elias Kempf
Fazl Barez
Thomas Brox
FedML
104
0
0
28 Sep 2025
VFSI: Validity First Spatial Intelligence for Constraint-Guided Traffic Diffusion
Kargi Chauhan
Leilani H. Gilpin
76
0
0
28 Sep 2025
On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization
Janvijay Singh
Austin Xu
Yilun Zhou
Yefan Zhou
Dilek Hakkani-Tur
Shafiq Joty
ELM
112
1
0
28 Sep 2025
Causally-Enhanced Reinforcement Policy Optimization
Xiangqi Wang
Yue Huang
Yujun Zhou
Xiaonan Luo
Kehan Guo
Xiangliang Zhang
OffRL
LRM
209
0
0
27 Sep 2025
Enhancing Blind Face Restoration through Online Reinforcement Learning
Bin Wu
Yahui Liu
Chi Zhang
Yao-Min Zhao
Wei Wang
CVBM
OffRL
CLL
OnRL
416
0
0
27 Sep 2025
MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems
Yuki Ichihara
Yuu Jinnai
Tetsuro Morimura
Mitsuki Sakamoto
Ryota Mitsuhashi
Eiji Uchibe
137
0
0
26 Sep 2025
Learnable Conformal Prediction with Context-Aware Nonconformity Functions for Robotic Planning and Perception
Divake Kumar
Sina Tayebati
Francesco Migliarba
Ranganath Krishnan
A. R. Trivedi
117
1
0
26 Sep 2025
Limitations on Safe, Trusted, Artificial General Intelligence
Rina Panigrahy
Willie Neiswanger
94
0
0
25 Sep 2025
Responsible AI Technical Report
Soonmin Bae
Wanjin Park
Jeongyeop Kim
Yunjin Park
Jungwon Yoon
...
Sujin Kim
Youngchol Kim
Somin Lee
Wonyoung Lee
Minsung Noh
143
0
0
24 Sep 2025
Failure Modes of Maximum Entropy RLHF
Ömer Veysel Çağatan
Barış Akgün
85
0
0
24 Sep 2025
Probabilistic Runtime Verification, Evaluation and Risk Assessment of Visual Deep Learning Systems
Birk Torpmann-Hagen
Pål Halvorsen
Michael A. Riegler
Dag Johansen
105
0
0
23 Sep 2025
SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer
Yarden As
Chengrui Qu
Benjamin Unger
Dongho Kang
Max van der Hart
Laixi Shi
Stelian Coros
Adam Wierman
Andreas Krause
OffRL
296
0
0
23 Sep 2025
The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind
Caleb DeLeeuw
Gaurav Chawla
Aniket Sharma
Vanessa Dietze
80
0
0
23 Sep 2025
FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Debarpan Bhattacharya
Apoorva Kulkarni
Sriram Ganapathy
245
0
0
20 Sep 2025
The Alignment Bottleneck
Wenjun Cao
196
0
0
19 Sep 2025
Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins
Erblin Isaku
C. Gomes
Shaukat Ali
Beatriz Sanguino
Tongtong Wang
Guoyuan Li
Houxiang Zhang
Thomas Peyrucain
160
2
0
16 Sep 2025
Secure Human Oversight of AI: Exploring the Attack Surface of Human Oversight
Jonas C. Ditz
Veronika Lazar
Elmar Lichtmeß
Carola Plesch
Matthias Heck
Kevin Baum
Markus Langer
AAML
158
0
0
15 Sep 2025
CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI
Hasin Jawad Ali
Ilhamul Azam
Ajwad Abrar
Md. Kamrul Hasan
H. Mahmud
72
0
0
14 Sep 2025
Mutual Information Tracks Policy Coherence in Reinforcement Learning
Cameron Reid
Wael Hafez
Amirhossein Nazeri
82
0
0
12 Sep 2025
Symmetry-Guided Multi-Agent Inverse Reinforcement Learning
Yongkai Tian
Yirong Qi
Xin Yu
Wenjun Wu
Jie Luo
127
1
0
10 Sep 2025
Interpretability as Alignment: Making Internal Understanding a Design Principle
Aadit Sengupta
Pratinav Seth
Vinay Kumar Sankarapu
AI4CE
AAML
121
0
0
10 Sep 2025
ACE and Diverse Generalization via Selective Disagreement
Oliver Daniels
Stuart Armstrong
Alexandre Maranhao
Mahirah Fairuz Rahman
Benjamin M. Marlin
Rebecca Gorman
OODD
215
0
0
09 Sep 2025
Let's Roleplay: Examining LLM Alignment in Collaborative Dialogues
Abhijnan Nath
Carine Graff
Nikhil Krishnaswamy
LLMAG
124
2
0
07 Sep 2025
What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?
Ibne Farabi Shihab
Sanjeda Akter
Anuj Sharma
OffRL
111
0
0
04 Sep 2025
Murphys Laws of AI Alignment: Why the Gap Always Wins
Madhava Gaikwad
ALM
205
1
0
04 Sep 2025
Beyond expected value: geometric mean optimization for long-term policy performance in reinforcement learning
Xinyi Sheng
Dominik Baumann
128
0
0
29 Aug 2025
ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety
Luke Bates
Max Glockner
Preslav Nakov
Iryna Gurevych
92
0
0
28 Aug 2025
Embodied AI: Emerging Risks and Opportunities for Policy Action
Jared Perlo
Alexander Robey
Fazl Barez
Luciano Floridi
Jakob Mokander
214
2
0
28 Aug 2025
Democracy-in-Silico: Institutional Design as Alignment in AI-Governed Polities
Trisanth Srinivasan
Santosh Patapati
76
0
0
27 Aug 2025
Servant, Stalker, Predator: How An Honest, Helpful, And Harmless (3H) Agent Unlocks Adversarial Skills
David Noever
92
0
0
27 Aug 2025
Reliable Weak-to-Strong Monitoring of LLM Agents
Neil Kale
Chen Bo Calvin Zhang
Kevin Zhu
Ankit Aich
Paula Rodriguez
Scale Red Team
Christina Q. Knight
Zifan Wang
160
1
0
26 Aug 2025
A Defect Classification Framework for AI-Based Software Systems (AI-ODC)
Mohammed O. Alannsary
24
0
0
25 Aug 2025
ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts
Darpan Aswal
Céline Hudelot
152
0
0
22 Aug 2025
Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets
Chenlin Liu
Minghui Fang
Patrick Zhang
Wei Zhou
Jie Gao
Jiqing Han
163
1
0
21 Aug 2025
CIA+TA Risk Assessment for AI Reasoning Vulnerabilities
Yuksel Aydin
100
0
0
19 Aug 2025
Out-of-Distribution Detection using Counterfactual Distance
Maria Stoica
Francesco Leofante
Alessio Lomuscio
OODD
148
0
0
13 Aug 2025
Never Compromise to Vulnerabilities: A Comprehensive Survey on AI Governance
Yuchu Jiang
Jian Zhao
Yuchen Yuan
Tianle Zhang
Yao Huang
...
Ya Zhang
Shuicheng Yan
Chi Zhang
Z. He
Xuelong Li
SILM
442
2
0
12 Aug 2025
Conformal Prediction and Trustworthy AI
Anthony Bellotti
Xindi Zhao
AI4CE
56
0
0
09 Aug 2025
Previous
1
2
3
4
5
...
26
27
28
Next