Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
1606.06565
Cited By

Concrete Problems in AI Safety

v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016

Jacob Steinhardt

Paul Christiano

Dandelion Mané

ArXiv (abs)PDF HTML

Papers citing "Concrete Problems in AI Safety"

50 / 1,379 papers shown

Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges

Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges

Christian Bluethgen

Jakob Nikolas Kather

...

Akshay S. Chaudhari

Thomas Frauenfelder

Michael Krauthammer

Farhad Nooralahzadeh

285

0

0

10 Oct 2025

Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B

Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B

Muhammad Imran Zaman

125

0

0

08 Oct 2025

Label Semantics for Robust Hyperspectral Image Classification

Label Semantics for Robust Hyperspectral Image Classification

Zarin Tasnim Roshni

Nabeel Mohammed

118

1

0

08 Oct 2025

Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

142

1

0

06 Oct 2025

HybridFlow: Quantification of Aleatoric and Epistemic Uncertainty with a Single Hybrid Model

HybridFlow: Quantification of Aleatoric and Epistemic Uncertainty with a Single Hybrid Model

Peter Van Katwyk

Karianne J. Bergen

196

0

0

06 Oct 2025

Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment

Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment

Sathish Reddy Indurthi

92

0

0

06 Oct 2025

Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning

Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning

152

1

0

05 Oct 2025

Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention

Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention

Santhosh Kumar Ravindran

161

0

0

05 Oct 2025

Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

272

2

0

04 Oct 2025

LegalSim: Multi-Agent Simulation of Legal Systems for Discovering Procedural Exploits

LegalSim: Multi-Agent Simulation of Legal Systems for Discovering Procedural Exploits

164

1

0

03 Oct 2025

Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization

Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization

100

0

0

03 Oct 2025

Reward Models are Metrics in a Trench Coat

Reward Models are Metrics in a Trench Coat

Sebastian Gehrmann

147

0

0

03 Oct 2025

Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

278

0

0

02 Oct 2025

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Rylan Schaeffer

Katja Filippova

290

1

0

01 Oct 2025

Alignment-Aware Decoding

Alignment-Aware Decoding

Frédéric Berdoz

Luca A. Lanzendörfer

Roger Wattenhofer

164

0

0

30 Sep 2025

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

101

1

0

30 Sep 2025

Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

117

2

0

29 Sep 2025

VFSI: Validity First Spatial Intelligence for Constraint-Guided Traffic Diffusion

VFSI: Validity First Spatial Intelligence for Constraint-Guided Traffic Diffusion

Leilani H. Gilpin

91

0

0

28 Sep 2025

On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

Dilek Hakkani-Tur

123

1

0

28 Sep 2025

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

135

1

0

28 Sep 2025

Causally-Enhanced Reinforcement Policy Optimization

Causally-Enhanced Reinforcement Policy Optimization

Xiangliang Zhang

213

0

0

27 Sep 2025

Enhancing Blind Face Restoration through Online Reinforcement Learning

Enhancing Blind Face Restoration through Online Reinforcement Learning

CVBM OffRL CLL OnRL

432

0

0

27 Sep 2025

Learnable Conformal Prediction with Context-Aware Nonconformity Functions for Robotic Planning and Perception

Learnable Conformal Prediction with Context-Aware Nonconformity Functions for Robotic Planning and Perception

Francesco Migliarba

Ranganath Krishnan

145

1

0

26 Sep 2025

MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems

MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems

Tetsuro Morimura

Mitsuki Sakamoto

Ryota Mitsuhashi

167

2

0

26 Sep 2025

Limitations on Safe, Trusted, Artificial General Intelligence

Limitations on Safe, Trusted, Artificial General Intelligence

Willie Neiswanger

109

0

0

25 Sep 2025

Failure Modes of Maximum Entropy RLHF

Failure Modes of Maximum Entropy RLHF

Ömer Veysel Çağatan

120

0

0

24 Sep 2025

Responsible AI Technical Report

Responsible AI Technical Report

...

195

0

0

24 Sep 2025

SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer

SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer

Max van der Hart

330

0

0

23 Sep 2025

Probabilistic Runtime Verification, Evaluation and Risk Assessment of Visual Deep Learning Systems

Probabilistic Runtime Verification, Evaluation and Risk Assessment of Visual Deep Learning Systems

Birk Torpmann-Hagen

Michael A. Riegler

117

0

0

23 Sep 2025

The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind

The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind

111

1

0

23 Sep 2025

FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs

FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMsConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

Debarpan Bhattacharya

Apoorva Kulkarni

Sriram Ganapathy

276

0

0

20 Sep 2025

The Alignment Bottleneck

The Alignment Bottleneck

224

0

0

19 Sep 2025

Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins

Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins

Beatriz Sanguino

Thomas Peyrucain

179

2

0

16 Sep 2025

Secure Human Oversight of AI: Exploring the Attack Surface of Human Oversight

Secure Human Oversight of AI: Exploring the Attack Surface of Human Oversight

Elmar Lichtmeß

187

0

0

15 Sep 2025

CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI

CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI

Hasin Jawad Ali

Md. Kamrul Hasan

89

0

0

14 Sep 2025

Mutual Information Tracks Policy Coherence in Reinforcement Learning

Mutual Information Tracks Policy Coherence in Reinforcement Learning

Amirhossein Nazeri

127

0

0

12 Sep 2025

Interpretability as Alignment: Making Internal Understanding a Design Principle

Interpretability as Alignment: Making Internal Understanding a Design Principle

Vinay Kumar Sankarapu

142

0

0

10 Sep 2025

Symmetry-Guided Multi-Agent Inverse Reinforcement Learning

Symmetry-Guided Multi-Agent Inverse Reinforcement Learning

163

1

0

10 Sep 2025

ACE and Diverse Generalization via Selective Disagreement

ACE and Diverse Generalization via Selective Disagreement

Stuart Armstrong

Alexandre Maranhao

Mahirah Fairuz Rahman

Benjamin M. Marlin

242

0

0

09 Sep 2025

Collaborate, Deliberate, Evaluate: How LLM Alignment Affects Coordinated Multi-Agent Outcomes

Collaborate, Deliberate, Evaluate: How LLM Alignment Affects Coordinated Multi-Agent Outcomes

Nikhil Krishnaswamy

161

3

0

07 Sep 2025

Murphys Laws of AI Alignment: Why the Gap Always Wins

Murphys Laws of AI Alignment: Why the Gap Always Wins

Madhava Gaikwad

272

1

0

04 Sep 2025

What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?

What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?

Ibne Farabi Shihab

145

0

0

04 Sep 2025

Beyond expected value: geometric mean optimization for long-term policy performance in reinforcement learning

Beyond expected value: geometric mean optimization for long-term policy performance in reinforcement learning

Dominik Baumann

155

1

0

29 Aug 2025

ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety

ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety

109

0

0

28 Aug 2025

Embodied AI: Emerging Risks and Opportunities for Policy Action

Embodied AI: Emerging Risks and Opportunities for Policy Action

Alexander Robey

Luciano Floridi

293

2

0

28 Aug 2025

Servant, Stalker, Predator: How An Honest, Helpful, And Harmless (3H) Agent Unlocks Adversarial Skills

Servant, Stalker, Predator: How An Honest, Helpful, And Harmless (3H) Agent Unlocks Adversarial Skills

146

0

0

27 Aug 2025

Democracy-in-Silico: Institutional Design as Alignment in AI-Governed Polities

Democracy-in-Silico: Institutional Design as Alignment in AI-Governed Polities

Trisanth Srinivasan

Santosh Patapati

102

0

0

27 Aug 2025

Reliable Weak-to-Strong Monitoring of LLM Agents

Reliable Weak-to-Strong Monitoring of LLM Agents

Chen Bo Calvin Zhang

Paula Rodriguez

Christina Q. Knight

184

2

0

26 Aug 2025

A Defect Classification Framework for AI-Based Software Systems (AI-ODC)

A Defect Classification Framework for AI-Based Software Systems (AI-ODC)

Mohammed O. Alannsary

49

0

0

25 Aug 2025

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

Céline Hudelot

171

0

0

22 Aug 2025

1 2 3 4 5...26 27 28

Page 2 of 28

Pageof 28