ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1606.06565
  4. Cited By
Concrete Problems in AI Safety
v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
ArXiv (abs)PDFHTML

Papers citing "Concrete Problems in AI Safety"

50 / 1,371 papers shown
Title
Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking
Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking
Haoyu Wang
Chris M. Poskitt
Jun Sun
Jiali Wei
168
1
0
01 Aug 2025
Magentic-UI: Towards Human-in-the-loop Agentic Systems
Magentic-UI: Towards Human-in-the-loop Agentic Systems
Hussein Mozannar
Gagan Bansal
Cheng Tan
Adam Fourney
Victor C. Dibia
...
Friederike Niedtner
Ece Kamar
Maya Murad
Rafah Hosn
Saleema Amershi
LLMAGLM&Ro
138
12
0
30 Jul 2025
Safe Deployment of Offline Reinforcement Learning via Input Convex Action Correction
Safe Deployment of Offline Reinforcement Learning via Input Convex Action Correction
Alex Durkin
Jasper Stolte
Matthew Jones
Raghuraman Pitchumani
Bei Li
Christian Michler
Mehmet Mercangöz
OffRLOnRL
124
1
0
30 Jul 2025
Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation
Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation
Daniil Gurgurov
Katharina Trinley
Yusser Al Ghussin
Tanja Baeumel
Josef van Genabith
Simon Ostermann
MILM
133
1
0
30 Jul 2025
Goal Alignment in LLM-Based User Simulators for Conversational AI
Goal Alignment in LLM-Based User Simulators for Conversational AI
Shuhaib Mehri
Xiaocheng Yang
Takyoung Kim
Gokhan Tur
Shikib Mehri
Dilek Hakkani-Tur
LLMAG
111
2
0
27 Jul 2025
Technological folie à deux: Feedback Loops Between AI Chatbots and Mental Illness
Technological folie à deux: Feedback Loops Between AI Chatbots and Mental Illness
Sebastian Dohnány
Zeb Kurth-Nelson
Eleanor Spens
Lennart Luettgau
Alastair Reid
Iason Gabriel
Christopher Summerfield
Murray Shanahan
Matthew M Nour
AI4MH
168
0
0
25 Jul 2025
Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement
Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement
Víctor Gallego
LRM
86
0
0
24 Jul 2025
AlphaGo Moment for Model Architecture Discovery
AlphaGo Moment for Model Architecture Discovery
Yixiu Liu
Yang Nan
Weixian Xu
Xiangkun Hu
Lyumanshan Ye
Zhen Qin
Pengfei Liu
AI4CE
206
2
0
24 Jul 2025
NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback
NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback
Madhava Gaikwad
Ashwini Ramchandra Doke
181
0
0
22 Jul 2025
Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems
Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems
Andrii Balashov
Olena Ponomarova
Xiaohua Zhai
AAMLSILM
101
0
0
21 Jul 2025
AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?
AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?
Ori Press
Brandon Amos
Haoyu Zhao
Yikai Wu
Samuel K. Ainsworth
...
K. Lieret
Hanlin Zhang
Shirley Huang
Matthias Bethge
Ofir Press
ALMELMLM&MA
214
2
0
19 Jul 2025
Using cognitive models to reveal value trade-offs in language models
Using cognitive models to reveal value trade-offs in language models
Sonia K. Murthy
Rosie Zhao
Jennifer Hu
Sham Kakade
Markus Wulfmeier
Peng Qian
Tomer Ullman
173
1
0
25 Jun 2025
Inference-Time Reward Hacking in Large Language Models
Inference-Time Reward Hacking in Large Language Models
Hadi Khalaf
C. M. Verdun
Alex Oesterling
Himabindu Lakkaraju
Flavio du Pin Calmon
161
1
0
24 Jun 2025
HiPreNets: High-Precision Neural Networks through Progressive Training
HiPreNets: High-Precision Neural Networks through Progressive Training
Ethan Mulle
W. Kang
Q. Gong
142
0
0
18 Jun 2025
Uncertainty-Aware Graph Neural Networks: A Multi-Hop Evidence Fusion Approach
Uncertainty-Aware Graph Neural Networks: A Multi-Hop Evidence Fusion ApproachIEEE Transactions on Neural Networks and Learning Systems (IEEE TNNLS), 2025
Qingfeng Chen
Shiyuan Li
Yixin Liu
Shirui Pan
Geoffrey I. Webb
Shichao Zhang
EDL
246
7
0
16 Jun 2025
CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making
CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making
Songtao Jiang
Yuan Wang
Ruizhe Chen
Yan Zhang
Ruilin Luo
...
Sibo Song
Yang Feng
Jimeng Sun
Jian Wu
Zuozhu Liu
OffRLLRM
150
4
0
15 Jun 2025
Reversing the Paradigm: Building AI-First Systems with Human Guidance
Reversing the Paradigm: Building AI-First Systems with Human Guidance
Cosimo Spera
Garima Agrawal
88
1
0
13 Jun 2025
The Alignment Trap: Complexity Barriers
The Alignment Trap: Complexity Barriers
Jasper Yao
229
0
0
12 Jun 2025
Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design
Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design
Andreas Schlaginhaufen
Reda Ouhamma
Maryam Kamgarpour
180
1
0
11 Jun 2025
Policy Search, Retrieval, and Composition via Task Similarity in Collaborative Agentic Systems
Policy Search, Retrieval, and Composition via Task Similarity in Collaborative Agentic Systems
Saptarshi Nath
Christos Peridis
Eseoghene Benjamin
Hengrong Du
Soheil Kolouri
Peter Kinnell
Zexin Li
Cong Liu
Shirin Dora
Andrea Soltoggio
198
0
0
05 Jun 2025
RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation
RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation
Tianjiao Li
Mengran Yu
Chenyu Shi
Yanjun Zhao
Xiaojing Liu
Qiang Zhang
Qi Zhang
Qi Zhang
Jiayin Wang
239
0
0
05 Jun 2025
Trustworthy Medical Question Answering: An Evaluation-Centric Survey
Trustworthy Medical Question Answering: An Evaluation-Centric Survey
Yinuo Wang
Robert E. Mercer
Frank Rudzicz
Sudipta Singha Roy
Sudipta Singha Roy
Pengjie Ren
Zhumin Chen
Xindi Wang
ELM
192
2
0
04 Jun 2025
Misalignment or misuse? The AGI alignment tradeoff
Misalignment or misuse? The AGI alignment tradeoffPhilosophical Studies (Philos. Stud.), 2025
Max Hellrigel-Holderbaum
Leonard Dung
226
2
0
04 Jun 2025
The Ultimate Test of Superintelligent AI Agents: Can an AI Balance Care and Control in Asymmetric Relationships?
The Ultimate Test of Superintelligent AI Agents: Can an AI Balance Care and Control in Asymmetric Relationships?
Djallel Bouneffouf
Matthew D Riemer
Kush R. Varshney
217
0
0
02 Jun 2025
A Descriptive and Normative Theory of Human Beliefs in RLHF
A Descriptive and Normative Theory of Human Beliefs in RLHF
Sylee Dandekar
Shripad Deshmukh
Frank Chiu
W. B. Knox
S. Niekum
154
0
0
02 Jun 2025
HADA: Human-AI Agent Decision Alignment Architecture
Tapio Pitkäranta
Leena Pitkäranta
117
1
0
01 Jun 2025
Accelerated Learning with Linear Temporal Logic using Differentiable Simulation
Accelerated Learning with Linear Temporal Logic using Differentiable Simulation
Alper Kamil Bozkurt
Calin Belta
Ming C. Lin
187
0
0
01 Jun 2025
Chameleon: A MatMul-Free Temporal Convolutional Network Accelerator for End-to-End Few-Shot and Continual Learning from Sequential Data
Chameleon: A MatMul-Free Temporal Convolutional Network Accelerator for End-to-End Few-Shot and Continual Learning from Sequential Data
Douwe den Blanken
Charlotte Frenkel
164
0
0
30 May 2025
The Road to Generalizable Neuro-Symbolic Learning Should be Paved with Foundation Models
The Road to Generalizable Neuro-Symbolic Learning Should be Paved with Foundation Models
Adam Stein
Aaditya Naik
Neelay Velingker
Mayur Naik
Eric Wong
NAIAI4CE
147
2
0
30 May 2025
Emergent Risk Awareness in Rational Agents under Resource Constraints
Emergent Risk Awareness in Rational Agents under Resource Constraints
Daniel Jarne Ornia
Nicholas Bishop
Joel Dyer
Wei-Chen Lee
Ani Calinescu
Doyne Farme
Michael Wooldridge
289
1
0
29 May 2025
Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures
Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures
Yu He
Yingxi Li
Colin White
Ellen Vitercik
ELMLRM
150
1
0
29 May 2025
Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies
Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies
Chenruo Liu
Kenan Tang
Yao Qin
Qi Lei
202
1
0
28 May 2025
Enhancing Uncertainty Estimation and Interpretability via Bayesian Non-negative Decision Layer
Enhancing Uncertainty Estimation and Interpretability via Bayesian Non-negative Decision LayerInternational Conference on Learning Representations (ICLR), 2025
Xinyue Hu
Zhibin Duan
Bo Chen
Mingyuan Zhou
UQCVBDL
316
1
0
28 May 2025
Apprenticeship learning with prior beliefs using inverse optimization
Apprenticeship learning with prior beliefs using inverse optimization
Mauricio Junca
Esteban Leiva
157
0
0
27 May 2025
Can Large Reasoning Models Self-Train?
Can Large Reasoning Models Self-Train?
Sheikh Shafayat
Fahim Tajwar
Ruslan Salakhutdinov
J. Schneider
Andrea Zanette
ReLMOffRLLRM
333
18
0
27 May 2025
The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels
The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels
Jiaming Ji
Sitong Fang
Wenjing Cao
Jiahao Li
Xuyao Wang
Juntao Dai
Chi-Min Chan
Sirui Han
Wenhan Luo
Yaodong Yang
LRM
154
0
0
26 May 2025
WQLCP: Weighted Adaptive Conformal Prediction for Robust Uncertainty Quantification Under Distribution Shifts
WQLCP: Weighted Adaptive Conformal Prediction for Robust Uncertainty Quantification Under Distribution Shifts
Shadi Alijani
Homayoun Najjaran
329
0
0
26 May 2025
TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning
TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning
Yuhui Chen
Haoran Li
Zhennan Jiang
Haowei Wen
Dongbin Zhao
186
2
0
26 May 2025
Logic Gate Neural Networks are Good for Verification
Logic Gate Neural Networks are Good for Verification
Fabian Kresse
Emily Yu
Christoph H. Lampert
T. Henzinger
142
3
0
26 May 2025
Security Concerns for Large Language Models: A Survey
Security Concerns for Large Language Models: A Survey
Miles Q. Li
Benjamin C. M. Fung
PILMELM
593
11
0
24 May 2025
Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models
Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models
Haoyuan Sun
Jiaqi Wu
Bo Xia
Yifu Luo
Yifei Zhao
Kai Qin
Xufei Lv
Tiantian Zhang
Yongzhe Chang
Xueqian Wang
OffRLLRM
380
7
0
24 May 2025
The Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas
The Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas
Ya Wu
Qiang Sheng
Danding Wang
Guang Yang
Yifan Sun
Zhengjia Wang
Yuyan Bu
Juan Cao
134
4
0
23 May 2025
Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning
Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning
Shicheng Xu
Liang Pang
Yunchang Zhu
Jia Gu
Zihao Wei
Jingcheng Deng
Feiyang Pan
Huawei Shen
Xueqi Cheng
OffRLLRM
347
2
0
22 May 2025
Backdoors in DRL: Four Environments Focusing on In-distribution Triggers
Backdoors in DRL: Four Environments Focusing on In-distribution Triggers
C. Ashcraft
Ted Staley
Josh Carney
Cameron Hickert
Derek Juba
Kiran Karra
AAML
190
0
0
22 May 2025
A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety
A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety
Ankita Kushwaha
Kiran Ravish
Preeti Lamba
Pawan Kumar
132
4
0
22 May 2025
NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning
NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning
Wei Liu
Siya Qi
Xinyu Wang
Chen Qian
Yali Du
Petr Slovak
OffRLLRM
244
3
0
21 May 2025
Counter-Inferential Behavior in Natural and Artificial Cognitive Systems
Counter-Inferential Behavior in Natural and Artificial Cognitive Systems
Serge Dolgikh
186
0
0
19 May 2025
The Traitors: Deception and Trust in Multi-Agent Language Model Simulations
The Traitors: Deception and Trust in Multi-Agent Language Model Simulations
Pedro M. P. Curvo
LLMAG
191
9
0
19 May 2025
"There Is No Such Thing as a Dumb Question," But There Are Good Ones
"There Is No Such Thing as a Dumb Question," But There Are Good Ones
Minjung Shin
Donghyun Kim
Jeh-Kwang Ryu
ELM
157
0
0
15 May 2025
Formalising Human-in-the-Loop: Computational Reductions, Failure Modes, and Legal-Moral Responsibility
Formalising Human-in-the-Loop: Computational Reductions, Failure Modes, and Legal-Moral Responsibility
Maurice Chiodo
Dennis Müller
Paul Siewert
Jean-Luc Wetherall
Zoya Yasmine
John Burden
178
7
0
15 May 2025
Previous
123456...262728
Next