ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1606.06565
  4. Cited By
Concrete Problems in AI Safety

Concrete Problems in AI Safety

21 June 2016
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
ArXivPDFHTML

Papers citing "Concrete Problems in AI Safety"

50 / 460 papers shown
Title
The Traitors: Deception and Trust in Multi-Agent Language Model Simulations
The Traitors: Deception and Trust in Multi-Agent Language Model Simulations
Pedro M. P. Curvo
LLMAG
7
0
0
19 May 2025
Counter-Inferential Behavior in Natural and Artificial Cognitive Systems
Counter-Inferential Behavior in Natural and Artificial Cognitive Systems
Serge Dolgikh
7
0
0
19 May 2025
"There Is No Such Thing as a Dumb Question," But There Are Good Ones
"There Is No Such Thing as a Dumb Question," But There Are Good Ones
Minjung Shin
Donghyun Kim
Jeh-Kwang Ryu
ELM
31
0
0
15 May 2025
Belief Injection for Epistemic Control in Linguistic State Space
Belief Injection for Epistemic Control in Linguistic State Space
Sebastian Dumbrava
16
0
0
12 May 2025
RefPentester: A Knowledge-Informed Self-Reflective Penetration Testing Framework Based on Large Language Models
RefPentester: A Knowledge-Informed Self-Reflective Penetration Testing Framework Based on Large Language Models
Hanzheng Dai
Yuanliang Li
Zhibo Zhang
Jun Yan
28
0
0
11 May 2025
Beyond $\tilde{O}(\sqrt{T})$ Constraint Violation for Online Convex Optimization with Adversarial Constraints
Beyond O~(T)\tilde{O}(\sqrt{T})O~(T​) Constraint Violation for Online Convex Optimization with Adversarial Constraints
Abhishek Sinha
Rahul Vaze
26
0
0
10 May 2025
Engineering Risk-Aware, Security-by-Design Frameworks for Assurance of Large-Scale Autonomous AI Models
Engineering Risk-Aware, Security-by-Design Frameworks for Assurance of Large-Scale Autonomous AI Models
Krti Tallam
31
0
0
09 May 2025
WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales
WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales
Drew Prinster
Xing Han
Anqi Liu
Suchi Saria
35
0
0
07 May 2025
Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models
Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models
Lars Malmqvist
24
0
0
07 May 2025
An alignment safety case sketch based on debate
An alignment safety case sketch based on debate
Marie Davidsen Buhl
Jacob Pfau
Benjamin Hilton
Geoffrey Irving
38
0
0
06 May 2025
Knowledge Augmented Complex Problem Solving with Large Language Models: A Survey
Knowledge Augmented Complex Problem Solving with Large Language Models: A Survey
Da Zheng
Lun Du
Junwei Su
Yuchen Tian
Yuqi Zhu
Jintian Zhang
Lanning Wei
Ningyu Zhang
H. Chen
LRM
61
0
0
06 May 2025
What Is AI Safety? What Do We Want It to Be?
What Is AI Safety? What Do We Want It to Be?
Jacqueline Harding
Cameron Domenico Kirk-Giannini
78
0
0
05 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
41
1
0
05 May 2025
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
76
2
0
05 May 2025
Phi-4-reasoning Technical Report
Phi-4-reasoning Technical Report
Marah Abdin
Sahaj Agarwal
Ahmed Hassan Awadallah
Vidhisha Balachandran
Harkirat Singh Behl
...
Vaishnavi Shrivastava
Vibhav Vineet
Yue Wu
Safoora Yousefi
Guoqing Zheng
ReLM
LRM
90
3
0
30 Apr 2025
A Domain-Agnostic Scalable AI Safety Ensuring Framework
A Domain-Agnostic Scalable AI Safety Ensuring Framework
Beomjun Kim
Kangyeon Kim
Sunwoo Kim
Heejin Ahn
57
0
0
29 Apr 2025
A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning
A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning
Greg Gluch
Shafi Goldwasser
AAML
37
0
0
28 Apr 2025
KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation
KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation
Jiabin Fan
Guoqing Luo
Michael Bowling
Lili Mou
OffRL
68
0
0
26 Apr 2025
AI Awareness
AI Awareness
Xianrui Li
Haoyuan Shi
Rongwu Xu
Wei Xu
59
0
0
25 Apr 2025
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society
Feifei Zhao
Yufei Wang
Enmeng Lu
Dongcheng Zhao
Bing Han
...
Chao Liu
Yaodong Yang
Yi Zeng
Boyuan Chen
Jinyu Fan
83
0
0
24 Apr 2025
AlphaGrad: Non-Linear Gradient Normalization Optimizer
AlphaGrad: Non-Linear Gradient Normalization Optimizer
Soham Sane
ODL
56
0
0
22 Apr 2025
Learning to Reason under Off-Policy Guidance
Learning to Reason under Off-Policy Guidance
Jianhao Yan
Yafu Li
Zican Hu
Zhi Wang
Ganqu Cui
Xiaoye Qu
Yu Cheng
Yue Zhang
OffRL
LRM
44
0
0
21 Apr 2025
SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM
SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM
X. Zhang
Jie Wang
Zifei Cheng
Wenhao Zhuang
Zheng Lin
...
Shouyu Yin
Chaohang Wen
Haotian Zhang
Bin Chen
Bing Yu
LRM
43
5
0
19 Apr 2025
Adversarial Training of Reward Models
Adversarial Training of Reward Models
Alexander Bukharin
Haifeng Qian
Shengyang Sun
Adithya Renduchintala
Soumye Singhal
Zihan Wang
Oleksii Kuchaiev
Olivier Delalleau
T. Zhao
AAML
32
0
0
08 Apr 2025
VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection
VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection
Bin Zhang
Xiaoyang Qu
Guokuan Li
Jiguang Wan
Jianzong Wang
VLM
59
0
0
28 Mar 2025
Probabilistic Uncertain Reward Model
Probabilistic Uncertain Reward Model
Wangtao Sun
Xiang Cheng
Xing Yu
Haotian Xu
Zhao Yang
Shizhu He
Jun Zhao
Kang Liu
60
0
0
28 Mar 2025
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu
Zhe Zhang
Ruofei Zhu
Yufeng Yuan
Xiaochen Zuo
...
Ya-Qin Zhang
Lin Yan
Mu Qiao
Yonghui Wu
Mingxuan Wang
OffRL
LRM
69
54
0
18 Mar 2025
Superalignment with Dynamic Human Values
Florian Mai
David Kaczér
Nicholas Kluge Corrêa
Lucie Flek
60
0
0
17 Mar 2025
From Autonomous Agents to Integrated Systems, A New Paradigm: Orchestrated Distributed Intelligence
From Autonomous Agents to Integrated Systems, A New Paradigm: Orchestrated Distributed Intelligence
Krti Tallam
AI4CE
50
2
0
17 Mar 2025
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang
Zhangyi Jiang
Zhenqi He
Wenhan Yang
Yanan Zheng
Zeyu Li
Zifan He
Shenyang Tong
Hailei Gong
LRM
90
2
0
16 Mar 2025
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
LRM
77
13
0
14 Mar 2025
Fine-Tuning Diffusion Generative Models via Rich Preference Optimization
Fine-Tuning Diffusion Generative Models via Rich Preference Optimization
Hanyang Zhao
Haoxian Chen
Yucheng Guo
Genta Indra Winata
Tingting Ou
Ziyu Huang
D. Yao
Wenpin Tang
59
0
0
13 Mar 2025
Mitigating Preference Hacking in Policy Optimization with Pessimism
Dhawal Gupta
Adam Fisch
Christoph Dann
Alekh Agarwal
76
0
0
10 Mar 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
203
0
0
27 Feb 2025
HALO: Robust Out-of-Distribution Detection via Joint Optimisation
HALO: Robust Out-of-Distribution Detection via Joint Optimisation
Hugo Lyons Keenan
S. Erfani
Christopher Leckie
OODD
212
0
0
27 Feb 2025
Decoupled Graph Energy-based Model for Node Out-of-Distribution Detection on Heterophilic Graphs
Decoupled Graph Energy-based Model for Node Out-of-Distribution Detection on Heterophilic Graphs
Yuhan Chen
Yihong Luo
Yifan Song
Pengwen Dai
Jing Tang
Xiaochun Cao
OODD
48
2
0
25 Feb 2025
Logit Disagreement: OoD Detection with Bayesian Neural Networks
Logit Disagreement: OoD Detection with Bayesian Neural Networks
Kevin Raina
UQCV
BDL
UD
PER
66
0
0
24 Feb 2025
Robust Concept Erasure Using Task Vectors
Robust Concept Erasure Using Task Vectors
Minh Pham
Kelly O. Marshall
Chinmay Hegde
Niv Cohen
123
18
0
21 Feb 2025
A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics
A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics
Ting-Ruen Wei
Haowei Liu
Xuyang Wu
Yi Fang
LRM
AI4CE
ReLM
KELM
226
2
0
21 Feb 2025
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Yufei He
Yuexin Li
Jiaying Wu
Yuan Sui
Yulin Chen
Bryan Hooi
ALM
94
5
0
16 Feb 2025
FairDropout: Using Example-Tied Dropout to Enhance Generalization of Minority Groups
Géraldin Nanfack
Eugene Belilovsky
71
0
0
10 Feb 2025
Barriers and Pathways to Human-AI Alignment: A Game-Theoretic Approach
Barriers and Pathways to Human-AI Alignment: A Game-Theoretic Approach
Aran Nayebi
87
1
0
09 Feb 2025
Why human-AI relationships need socioaffective alignment
Why human-AI relationships need socioaffective alignment
Hannah Rose Kirk
Iason Gabriel
Chris Summerfield
Bertie Vidgen
Scott A. Hale
46
6
0
04 Feb 2025
Process-Supervised Reinforcement Learning for Code Generation
Process-Supervised Reinforcement Learning for Code Generation
Yufan Ye
Ting Zhang
Wenbin Jiang
Hua Huang
OffRL
LRM
SyDa
63
1
0
03 Feb 2025
A statistically consistent measure of Semantic Variability using Language Models
A statistically consistent measure of Semantic Variability using Language Models
Yi Liu
76
0
0
01 Feb 2025
Temporal Logic Specification-Conditioned Decision Transformer for Offline Safe Reinforcement Learning
Temporal Logic Specification-Conditioned Decision Transformer for Offline Safe Reinforcement Learning
Zijian Guo
Weichao Zhou
Wenchao Li
OffRL
105
2
0
28 Jan 2025
Constrained Hybrid Metaheuristic Algorithm for Probabilistic Neural Networks Learning
Piotr A. Kowalski
Szymon Kucharczyk
Jacek Mańdziuk
31
0
0
28 Jan 2025
BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models
BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models
Yibin Wang
Haizhou Shi
Ligong Han
Dimitris N. Metaxas
Hao Wang
BDL
UQLM
116
8
0
28 Jan 2025
Evolution and The Knightian Blindspot of Machine Learning
Evolution and The Knightian Blindspot of Machine Learning
Joel Lehman
Elliot Meyerson
Tarek El-Gaaly
Kenneth O. Stanley
Tarin Ziyaee
91
2
0
22 Jan 2025
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Sebastian Farquhar
Vikrant Varma
David Lindner
David Elson
Caleb Biddulph
Ian Goodfellow
Rohin Shah
94
1
0
22 Jan 2025
1234...8910
Next