ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1606.06565
  4. Cited By
Concrete Problems in AI Safety
v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
ArXiv (abs)PDFHTML

Papers citing "Concrete Problems in AI Safety"

50 / 1,371 papers shown
Title
Dataset Poisoning Attacks on Behavioral Cloning Policies
Dataset Poisoning Attacks on Behavioral Cloning Policies
Akansha Kalra
Soumil Datta
Ethan Gilmore
Duc La
Guanhong Tao
Daniel S. Brown
AAMLOffRL
159
0
0
26 Nov 2025
The Horcrux: Mechanistically Interpretable Task Decomposition for Detecting and Mitigating Reward Hacking in Embodied AI Systems
The Horcrux: Mechanistically Interpretable Task Decomposition for Detecting and Mitigating Reward Hacking in Embodied AI Systems
Subramanyam Sahoo
Jared Junkin
81
0
0
22 Nov 2025
Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning
Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning
Radman Rakhshandehroo
Daniel Coombs
32
0
0
22 Nov 2025
Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research
Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research
Ninell Oldenburg
Ruchira Dhar
Anders Søgaard
113
0
0
19 Nov 2025
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems
Brendan Gho
Suman Muppavarapu
Afnan Shaik
Tyson Tsay
James Begin
Kevin Zhu
Archana Vaidheeswaran
Vasu Sharma
LLMAG
116
0
0
18 Nov 2025
Robust Experimental Design via Generalised Bayesian Inference
Robust Experimental Design via Generalised Bayesian Inference
Yasir Zubayr Barlas
Sabina J. Sloman
Samuel Kaski
72
0
0
10 Nov 2025
Large Language Models Develop Novel Social Biases Through Adaptive Exploration
Large Language Models Develop Novel Social Biases Through Adaptive Exploration
Addison J. Wu
Ryan Liu
Xuechunzi Bai
Thomas Griffiths
104
0
0
08 Nov 2025
Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains
Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains
Mohammed Musthafa Rafi
Adarsh Krishnamurthy
Aditya Balu
72
0
0
06 Nov 2025
Sparse, self-organizing ensembles of local kernels detect rare statistical anomalies
Sparse, self-organizing ensembles of local kernels detect rare statistical anomalies
Gaia Grosso
Sai Sumedh R. Hindupur
Thomas Fel
Samuel Bright-Thonney
Philip Harris
Demba Ba
217
1
0
05 Nov 2025
Trustworthy Quantum Machine Learning: A Roadmap for Reliability, Robustness, and Security in the NISQ Era
Trustworthy Quantum Machine Learning: A Roadmap for Reliability, Robustness, and Security in the NISQ Era
Ferhat Ozgur Catak
Jungwon Seo
Umit Cali
56
0
0
04 Nov 2025
Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences
Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences
Joshua Ashkinaze
Hua Shen
Sai Avula
Eric Gilbert
Ceren Budak
VLM
239
0
0
03 Nov 2025
Lyapunov Stability Learning with Nonlinear Control via Inductive Biases
Lyapunov Stability Learning with Nonlinear Control via Inductive Biases
Yupu Lu
Shijie Lin
Hao Xu
Zeqing Zhang
Jia Pan
52
0
0
03 Nov 2025
Human-AI Complementarity: A Goal for Amplified Oversight
Human-AI Complementarity: A Goal for Amplified Oversight
Rishub Jain
Sophie Bridgers
Lili Janzer
Rory Greig
Tian Huey Teh
Vladimir Mikulik
77
2
0
30 Oct 2025
Enhancing ECG Classification Robustness with Lightweight Unsupervised Anomaly Detection Filters
Enhancing ECG Classification Robustness with Lightweight Unsupervised Anomaly Detection Filters
Mustafa Fuad Rifet Ibrahim
Maurice Meijer
Alexander Schlaefer
Peer Stelldinger
122
0
0
30 Oct 2025
The Information-Theoretic Imperative: Compression and the Epistemic Foundations of Intelligence
The Information-Theoretic Imperative: Compression and the Epistemic Foundations of Intelligence
Christian Dittrich
Jennifer Flygare Kinne
CML
167
0
0
29 Oct 2025
Decision-Making Amid Information-Based Threats in Sociotechnical Systems: A Review
Decision-Making Amid Information-Based Threats in Sociotechnical Systems: A Review
Aaron R. Allred
Erin E. Richardson
Sarah R. Bostrom
James Crum
Cara Spencer
Chad Tossell
Richard E. Niemeyer
Leanne Hirshfield
Allison P.A. Hayman
44
0
0
28 Oct 2025
Learning "Partner-Aware" Collaborators in Multi-Party Collaboration
Learning "Partner-Aware" Collaborators in Multi-Party Collaboration
Abhijnan Nath
Nikhil Krishnaswamy
80
0
0
26 Oct 2025
Scalable Oversight via Partitioned Human Supervision
Scalable Oversight via Partitioned Human Supervision
Ren Yin
Takashi Ishida
Masashi Sugiyama
76
0
0
26 Oct 2025
Weak-to-Strong Generalization under Distribution Shifts
Weak-to-Strong Generalization under Distribution Shifts
Myeongho Jeon
Jan Sobotka
Suhwan Choi
Maria Brbić
OOD
144
0
0
24 Oct 2025
Towards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection
Towards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection
Yongqiang Chen
Gang Niu
James Cheng
Bo Han
Masashi Sugiyama
76
0
0
23 Oct 2025
ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases
ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases
Ziqian Zhong
Aditi Raghunathan
Nicholas Carlini
52
2
0
23 Oct 2025
Ask a Strong LLM Judge when Your Reward Model is Uncertain
Ask a Strong LLM Judge when Your Reward Model is Uncertain
Zhenghao Xu
Qin Lu
Qingru Zhang
Liang Qiu
Ilgee Hong
...
Yao Liu
Haoming Jiang
Lihong Li
Hyokun Yun
Tuo Zhao
88
0
0
23 Oct 2025
The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems
The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems
Bentley DeVilling
ReLMLRM
301
0
0
23 Oct 2025
Subliminal Corruption: Mechanisms, Thresholds, and Interpretability
Subliminal Corruption: Mechanisms, Thresholds, and Interpretability
Reya Vir
Sarvesh Bhatnagar
52
0
0
22 Oct 2025
Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation
Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation
Daniel Bethell
Simos Gerasimou
R. Calinescu
Calum Imrie
60
0
0
21 Oct 2025
Rectifying Shortcut Behaviors in Preference-based Reward Learning
Rectifying Shortcut Behaviors in Preference-based Reward Learning
Wenqian Ye
Guangtao Zheng
Aidong Zhang
76
0
0
21 Oct 2025
Beyond Binary Out-of-Distribution Detection: Characterizing Distributional Shifts with Multi-Statistic Diffusion Trajectories
Beyond Binary Out-of-Distribution Detection: Characterizing Distributional Shifts with Multi-Statistic Diffusion Trajectories
Achref Jaziri
Martin Rogmann
Martin Mundt
Visvanathan Ramesh
189
0
0
20 Oct 2025
Consistent Zero-Shot Imitation with Contrastive Goal Inference
Consistent Zero-Shot Imitation with Contrastive Goal Inference
Kathryn Wantlin
Chongyi Zheng
Benjamin Eysenbach
132
0
0
20 Oct 2025
RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation
RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation
Yuquan Xue
Guanxing Lu
Zhenyu Wu
Chuanrui Zhang
Bofang Jia
Zhengyi Gu
Yansong Tang
Ziwei Wang
138
0
0
20 Oct 2025
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
Yu Ying Chiu
Michael S. Lee
Rachel Calcott
Brandon Handoko
Paul de Font-Reaulx
...
Mantas Mazeika
Bing Liu
Yejin Choi
Mitchell L. Gordon
Sydney Levine
ELMLRM
93
0
0
18 Oct 2025
Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals
Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals
Andrejs Sorstkins
Omer Tariq
Muhammad Bilal
OffRL
88
0
0
16 Oct 2025
Restoring Noisy Demonstration for Imitation Learning With Diffusion Models
Restoring Noisy Demonstration for Imitation Learning With Diffusion ModelsIEEE Transactions on Neural Networks and Learning Systems (IEEE TNNLS), 2025
Shang-Fu Chen
Co Yong
Shao-Hua Sun
DiffM
88
0
0
16 Oct 2025
Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism
Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism
Xiaoshu Chen
Sihang Zhou
Ke Liang
Duanyang Yuan
Haoyuan Chen
Xiaoyu Sun
Linyuan Meng
Xinwang Liu
ReLMLRM
185
0
0
15 Oct 2025
AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
María Victoria Carro
Denise Alejandra Mester
Facundo Nieto
Oscar Agustín Stanchi
Guido Ernesto Bergman
...
Luca Nicolás Forziati Gangi
Francisca Gauna Selasco
Juan Gustavo Corvalán
Gerardo Simari
Maria Vanina Martinez
130
0
0
15 Oct 2025
Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking
Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking
Stephane Hatgis-Kessell
Logan Mondal Bhamidipaty
Emma Brunskill
81
0
0
14 Oct 2025
SafeMT: Multi-turn Safety for Multimodal Language Models
SafeMT: Multi-turn Safety for Multimodal Language Models
Han Zhu
Juntao Dai
Jiaming Ji
Haoran Li
Chengkun Cai
...
Chi-Min Chan
Boyuan Chen
Yaodong Yang
Sirui Han
Yike Guo
102
0
0
14 Oct 2025
From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models
From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models
Imran Khan
LRM
72
1
0
14 Oct 2025
Cog-Rethinker: Hierarchical Metacognitive Reinforcement Learning for LLM Reasoning
Cog-Rethinker: Hierarchical Metacognitive Reinforcement Learning for LLM Reasoning
Zexu Sun
Yongcheng Zeng
Erxue Min
Heyang Gao
Bokai Ji
Xu Chen
OffRLReLMLRM
155
0
0
13 Oct 2025
PoU: Proof-of-Use to Counter Tool-Call Hacking in DeepResearch Agents
PoU: Proof-of-Use to Counter Tool-Call Hacking in DeepResearch Agents
Shengjie Ma
Chenlong Deng
Jiaxin Mao
J. Huang
Teng Wang
Junjie Wu
Changwang Zhang
Jun Wang
72
1
0
13 Oct 2025
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
Murad Dawood
Usama Ahmed Siddiquie
Shahram Khorshidi
Maren Bennewitz
108
0
0
13 Oct 2025
Source-Free Object Detection with Detection Transformer
Source-Free Object Detection with Detection TransformerIEEE Transactions on Image Processing (IEEE TIP), 2025
Huizai Yao
Sicheng Zhao
Shuo Lu
Hui Chen
Yangyang Li
Guoping Liu
Tengfei Xing
C. Yan
Jianhua Tao
Guiguang Ding
ViT
69
1
0
13 Oct 2025
Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges
Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges
Christian Bluethgen
Dave Van Veen
Daniel Truhn
Jakob Nikolas Kather
Michael Moor
...
Akshay S. Chaudhari
Thomas Frauenfelder
C. Langlotz
Michael Krauthammer
Farhad Nooralahzadeh
LM&MAAI4CE
237
0
0
10 Oct 2025
Token Is All You Price
Token Is All You Price
Weijie Zhong
41
0
0
10 Oct 2025
Label Semantics for Robust Hyperspectral Image Classification
Label Semantics for Robust Hyperspectral Image Classification
Rafin Hassan
Zarin Tasnim Roshni
Rafiqul Bari
Alimul Islam
Nabeel Mohammed
Moshiur Farazi
Shafin Rahman
VLM
62
1
0
08 Oct 2025
Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B
Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B
Nisar Ahmed
Muhammad Imran Zaman
Gulshan Saleem
Ali Hassan
LRM
91
0
0
08 Oct 2025
HybridFlow: Quantification of Aleatoric and Epistemic Uncertainty with a Single Hybrid Model
HybridFlow: Quantification of Aleatoric and Epistemic Uncertainty with a Single Hybrid Model
Peter Van Katwyk
Karianne J. Bergen
149
0
0
06 Oct 2025
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails
Siwei Han
Jiaqi Liu
Yaofeng Su
Wenbo Duan
Xinyuan Liu
Cihang Xie
Mohit Bansal
Mingyu Ding
Linjun Zhang
Huaxiu Yao
104
1
0
06 Oct 2025
Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
Radha Gulhane
Sathish Reddy Indurthi
OffRLLRM
52
0
0
06 Oct 2025
Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning
Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning
Yunghwei Lai
Kaiming Liu
Ziyue Wang
Weizhi Ma
Yang Liu
LM&MA
119
0
0
05 Oct 2025
Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention
Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention
Santhosh Kumar Ravindran
88
0
0
05 Oct 2025
1234...262728
Next