ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1606.06565
  4. Cited By
Concrete Problems in AI Safety
v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
ArXiv (abs)PDFHTML

Papers citing "Concrete Problems in AI Safety"

50 / 1,379 papers shown
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu
Zheng Zhang
Ruofei Zhu
Yufeng Yuan
Xiaochen Zuo
...
Ya Zhang
Lin Yan
Mu Qiao
Yonghui Wu
Mingxuan Wang
OffRLLRM
626
987
0
18 Mar 2025
Superalignment with Dynamic Human Values
Florian Mai
David Kaczér
Nicholas Kluge Corrêa
Lucie Flek
302
1
0
17 Mar 2025
From Autonomous Agents to Integrated Systems, A New Paradigm: Orchestrated Distributed Intelligence
From Autonomous Agents to Integrated Systems, A New Paradigm: Orchestrated Distributed Intelligence
Krti Tallam
AI4CE
349
9
0
17 Mar 2025
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang
Zhangyi Jiang
Zhenqi He
Shenyang Tong
Wenhan Yang
...
Zifan He
Hailei Gong
Zewen Ye
Shengjie Ma
Jianping Zhang
LRM
498
8
0
16 Mar 2025
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
LRM
441
126
0
14 Mar 2025
NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models
Mert Albaba
Chenhao Li
Markos Diomataris
Omid Taheri
Andreas Krause
M. Black
VGen
259
6
0
13 Mar 2025
Ensemble Learning for Large Language Models in Text and Code Generation: A Survey
Ensemble Learning for Large Language Models in Text and Code Generation: A Survey
Mari Ashiga
Wei Jie
Fan Wu
Vardan K. Voskanyan
Fateme Dinmohammadi
P. Brookes
Jingzhi Gong
Zheng Wang
332
8
0
13 Mar 2025
RPO: Fine-Tuning Visual Generative Models via Rich Vision-Language Preferences
RPO: Fine-Tuning Visual Generative Models via Rich Vision-Language Preferences
Hanyang Zhao
Haoxian Chen
Yucheng Guo
Genta Indra Winata
Tingting Ou
Ziyu Huang
D. Yao
Wenpin Tang
573
4
0
13 Mar 2025
Generating Robot Constitutions & Benchmarks for Semantic Safety
P. Sermanet
Anirudha Majumdar
A. Irpan
Dmitry Kalashnikov
Vikas Sindhwani
LM&Ro
405
11
0
11 Mar 2025
Mitigating Preference Hacking in Policy Optimization with Pessimism
Dhawal Gupta
Adam Fisch
Christoph Dann
Alekh Agarwal
291
2
0
10 Mar 2025
RePO: Understanding Preference Learning Through ReLU-Based Optimization
RePO: Understanding Preference Learning Through ReLU-Based Optimization
Junkang Wu
Kexin Huang
Qingsong Wen
Jinyang Gao
Bolin Ding
Jiancan Wu
Xiangnan He
Xiang Wang
308
3
0
10 Mar 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
HyunJin Kim
Xiaoyuan Yi
Jing Yao
Muhua Huang
Jinyeong Bak
James Evans
Xing Xie
306
0
0
08 Mar 2025
Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners
Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners
Calarina Muslimani
Kerrick Johnstonbaugh
Suyog Chandramouli
Serena Booth
W. B. Knox
Matthew E. Taylor
186
3
0
08 Mar 2025
Blockchain As a Platform For Artificial Intelligence (AI) Transparency
Afroja Akther
Ayesha Arobee
Abdullah Al Adnan
Omum Auyon
ASM Johirul Islam
Farhad Akter
230
5
0
07 Mar 2025
ValuePilot: A Two-Phase Framework for Value-Driven Decision-Making
Yitong Luo
Hou Hei Lam
Ziang Chen
Zhenliang Zhang
Xue Feng
308
0
0
06 Mar 2025
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
Borong Zhang
Yuhao Zhang
Yalan Qin
Yingshan Lei
Josef Dai
Yuanpei Chen
Yaodong Yang
523
4
0
05 Mar 2025
Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models
Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models
Dilxat Muhtar
Enzhuo Zhang
Zhenshi Li
Feng-Xue Gu
Yanglangxing He
Pengfeng Xiao
Xueliang Zhang
294
8
0
02 Mar 2025
HALO: Robust Out-of-Distribution Detection via Joint Optimisation
HALO: Robust Out-of-Distribution Detection via Joint Optimisation
Hugo Lyons Keenan
S. Erfani
Christopher Leckie
OODD
537
0
0
27 Feb 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
1.0K
3
0
27 Feb 2025
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers
Shalev Lifshitz
Sheila A. McIlraith
Yilun Du
LRM
407
26
0
27 Feb 2025
RIZE: Adaptive Regularization for Imitation Learning
RIZE: Adaptive Regularization for Imitation Learning
Adib Karimi
Mohammad Mehdi Ebadzadeh
OOD
273
1
0
27 Feb 2025
Reward Shaping to Mitigate Reward Hacking in RLHF
Reward Shaping to Mitigate Reward Hacking in RLHF
Jiayi Fu
Xuandong Zhao
Chengyuan Yao
Han Wang
Qi Han
Yanghua Xiao
615
43
0
26 Feb 2025
Decoupled Graph Energy-based Model for Node Out-of-Distribution Detection on Heterophilic Graphs
Decoupled Graph Energy-based Model for Node Out-of-Distribution Detection on Heterophilic GraphsInternational Conference on Learning Representations (ICLR), 2025
Yuhan Chen
Yihong Luo
Yifan Song
Pengwen Dai
Jing Tang
Xiaochun Cao
OODD
450
6
0
25 Feb 2025
Logit Disagreement: OoD Detection with Bayesian Neural Networks
Logit Disagreement: OoD Detection with Bayesian Neural Networks
Kevin Raina
UQCVBDLUDPER
424
1
0
24 Feb 2025
A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics
A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics
Ting-Ruen Wei
Haowei Liu
Xuyang Wu
Yi Fang
LRMAI4CEReLMKELM
742
8
0
21 Feb 2025
Robust Concept Erasure Using Task Vectors
Robust Concept Erasure Using Task Vectors
Minh Pham
Kelly O. Marshall
Chinmay Hegde
Niv Cohen
450
25
0
21 Feb 2025
Alignment, Agency and Autonomy in Frontier AI: A Systems Engineering Perspective
Krti Tallam
195
7
0
20 Feb 2025
Leveraging Intermediate Representations for Better Out-of-Distribution Detection
Leveraging Intermediate Representations for Better Out-of-Distribution Detection
Gianluca Guglielmo
Marc Masana
OODD
274
1
0
18 Feb 2025
Transformer Dynamics: A neuroscientific approach to interpretability of large language models
Transformer Dynamics: A neuroscientific approach to interpretability of large language models
Jesseba Fernando
Grigori Guitchounts
AI4CE
237
5
0
17 Feb 2025
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Yufei He
Yuexin Li
Jiaying Wu
Yuan Sui
Yulin Chen
Bryan Hooi
ALM
523
17
0
16 Feb 2025
FairDropout: Using Example-Tied Dropout to Enhance Generalization of Minority Groups
Géraldin Nanfack
Eugene Belilovsky
291
1
0
10 Feb 2025
Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis
Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis
Aran Nayebi
571
1
0
09 Feb 2025
Why human-AI relationships need socioaffective alignment
Why human-AI relationships need socioaffective alignmentHumanities and Social Sciences Communications (HSSC), 2025
Hannah Rose Kirk
Iason Gabriel
Chris Summerfield
Bertie Vidgen
Scott A. Hale
236
56
0
04 Feb 2025
Process-Supervised Reinforcement Learning for Code Generation
Process-Supervised Reinforcement Learning for Code Generation
Yufan Ye
Ting Zhang
Wenbin Jiang
Hua Huang
OffRLLRMSyDa
354
15
0
03 Feb 2025
A statistically consistent measure of semantic uncertainty using Language Models
A statistically consistent measure of semantic uncertainty using Language Models
Yi Liu
332
0
0
01 Feb 2025
Constrained Hybrid Metaheuristic Algorithm for Probabilistic Neural Networks LearningInformation Sciences (Inf. Sci.), 2025
Piotr A. Kowalski
Szymon Kucharczyk
Jacek Mańdziuk
274
7
0
28 Jan 2025
The Trust Calibration Maturity Model for Characterizing and Communicating Trustworthiness of AI Systems
The Trust Calibration Maturity Model for Characterizing and Communicating Trustworthiness of AI Systems
Scott T Steinmetz
Asmeret Naugle
Paul Schutte
Matt Sweitzer
Alex Washburne
Lisa Linville
Daniel Krofcheck
Michal Kucer
Samuel Myren
265
2
0
28 Jan 2025
Temporal Logic Specification-Conditioned Decision Transformer for Offline Safe Reinforcement Learning
Temporal Logic Specification-Conditioned Decision Transformer for Offline Safe Reinforcement LearningInternational Conference on Machine Learning (ICML), 2024
Zijian Guo
Weichao Zhou
Wenchao Li
OffRL
293
3
0
28 Jan 2025
BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models
BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language ModelsNeural Information Processing Systems (NeurIPS), 2024
Yibin Wang
Haizhou Shi
Ligong Han
Dimitris N. Metaxas
Hao Wang
BDLUQLM
730
23
0
28 Jan 2025
Evolution and The Knightian Blindspot of Machine Learning
Evolution and The Knightian Blindspot of Machine Learning
Joel Lehman
Elliot Meyerson
Tarek El-Gaaly
Kenneth O. Stanley
Tarin Ziyaee
338
7
0
22 Jan 2025
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Sebastian Farquhar
Vikrant Varma
David Lindner
David Elson
Caleb Biddulph
Ian Goodfellow
Rohin Shah
429
10
0
22 Jan 2025
Topology of Out-of-Distribution Examples in Deep Neural Networks
Topology of Out-of-Distribution Examples in Deep Neural Networks
Esha Datta
Johanna Hennig
Eva Domschot
Connor Mattes
Michael R. Smith
251
1
0
21 Jan 2025
A margin-based replacement for cross-entropy loss
A margin-based replacement for cross-entropy loss
Michael W. Spratling
Heiko H. Schütt
318
0
0
21 Jan 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
457
5
0
20 Jan 2025
Two Types of AI Existential Risk: Decisive and Accumulative
Two Types of AI Existential Risk: Decisive and AccumulativePhilosophical Studies (Philos. Stud.), 2024
Atoosa Kasirzadeh
490
41
0
20 Jan 2025
Learning to Assist Humans without Inferring Rewards
Learning to Assist Humans without Inferring RewardsNeural Information Processing Systems (NeurIPS), 2024
Vivek Myers
Evan Ellis
Sergey Levine
Benjamin Eysenbach
Anca Dragan
571
10
0
17 Jan 2025
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Simon Mahns
Zhuokai Zhao
Yibo Jiang
Zhaorun Chen
Chen Zhu
...
Jiayi Liu
Lizhu Zhang
Xiangjun Fan
Hao Ma
Sinong Wang
460
23
0
16 Jan 2025
Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision
Iterative Label Refinement Matters More than Preference Optimization under Weak SupervisionInternational Conference on Learning Representations (ICLR), 2025
Yaowen Ye
Cassidy Laidlaw
Jacob Steinhardt
ALM
239
3
0
14 Jan 2025
Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models
Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language ModelsAdvanced Video and Signal Based Surveillance (AVSS), 2025
Y. Ranasinghe
Vibashan Vs
James Uplinger
C. D. Melo
Vishal M. Patel
268
2
0
13 Jan 2025
Large Language Models for Bioinformatics
Large Language Models for BioinformaticsQuantitative Biology (QB), 2025
Wei Ruan
Yanjun Lyu
Jing Zhang
Jianfeng Cai
Peng Shu
...
Ping Ma
Hongtu Zhu
Yajun Yan
D. Zhu
Tianming Liu
AI4CELM&MA
177
10
0
10 Jan 2025
Previous
123456...262728
Next