ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1606.06565
  4. Cited By
Concrete Problems in AI Safety
v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
ArXiv (abs)PDFHTML

Papers citing "Concrete Problems in AI Safety"

50 / 1,374 papers shown
Title
Fine-Tuning Diffusion Generative Models via Rich Preference Optimization
Fine-Tuning Diffusion Generative Models via Rich Preference Optimization
Hanyang Zhao
Haoxian Chen
Yucheng Guo
Genta Indra Winata
Tingting Ou
Ziyu Huang
D. Yao
Wenpin Tang
478
3
0
13 Mar 2025
NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models
Mert Albaba
Chenhao Li
Markos Diomataris
Omid Taheri
Andreas Krause
M. Black
VGen
234
6
0
13 Mar 2025
Ensemble Learning for Large Language Models in Text and Code Generation: A Survey
Ensemble Learning for Large Language Models in Text and Code Generation: A Survey
Mari Ashiga
Wei Jie
Fan Wu
Vardan K. Voskanyan
Fateme Dinmohammadi
P. Brookes
Jingzhi Gong
Zheng Wang
307
6
0
13 Mar 2025
Generating Robot Constitutions & Benchmarks for Semantic Safety
P. Sermanet
Anirudha Majumdar
A. Irpan
Dmitry Kalashnikov
Vikas Sindhwani
LM&Ro
355
10
0
11 Mar 2025
Mitigating Preference Hacking in Policy Optimization with Pessimism
Dhawal Gupta
Adam Fisch
Christoph Dann
Alekh Agarwal
253
2
0
10 Mar 2025
RePO: Understanding Preference Learning Through ReLU-Based Optimization
RePO: Understanding Preference Learning Through ReLU-Based Optimization
Junkang Wu
Kexin Huang
Qingsong Wen
Jinyang Gao
Bolin Ding
Jiancan Wu
Xiangnan He
Xiang Wang
261
3
0
10 Mar 2025
Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners
Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners
Calarina Muslimani
Kerrick Johnstonbaugh
Suyog Chandramouli
Serena Booth
W. B. Knox
Matthew E. Taylor
152
2
0
08 Mar 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
HyunJin Kim
Xiaoyuan Yi
Jing Yao
Muhua Huang
Jinyeong Bak
James Evans
Xing Xie
275
0
0
08 Mar 2025
Blockchain As a Platform For Artificial Intelligence (AI) Transparency
Afroja Akther
Ayesha Arobee
Abdullah Al Adnan
Omum Auyon
ASM Johirul Islam
Farhad Akter
195
4
0
07 Mar 2025
ValuePilot: A Two-Phase Framework for Value-Driven Decision-Making
Yitong Luo
Hou Hei Lam
Ziang Chen
Zhenliang Zhang
Xue Feng
288
0
0
06 Mar 2025
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
Borong Zhang
Yuhao Zhang
Yalan Qin
Yingshan Lei
Josef Dai
Yuanpei Chen
Yaodong Yang
427
4
0
05 Mar 2025
Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models
Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models
Dilxat Muhtar
Enzhuo Zhang
Zhenshi Li
Feng-Xue Gu
Yanglangxing He
Pengfeng Xiao
Xueliang Zhang
234
7
0
02 Mar 2025
HALO: Robust Out-of-Distribution Detection via Joint Optimisation
HALO: Robust Out-of-Distribution Detection via Joint Optimisation
Hugo Lyons Keenan
S. Erfani
Christopher Leckie
OODD
496
0
0
27 Feb 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
992
3
0
27 Feb 2025
RIZE: Adaptive Regularization for Imitation Learning
RIZE: Adaptive Regularization for Imitation Learning
Adib Karimi
Mohammad Mehdi Ebadzadeh
OOD
245
1
0
27 Feb 2025
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers
Shalev Lifshitz
Sheila A. McIlraith
Yilun Du
LRM
350
24
0
27 Feb 2025
Reward Shaping to Mitigate Reward Hacking in RLHF
Reward Shaping to Mitigate Reward Hacking in RLHF
Jiayi Fu
Xuandong Zhao
Chengyuan Yao
Han Wang
Qi Han
Yanghua Xiao
505
42
0
26 Feb 2025
Decoupled Graph Energy-based Model for Node Out-of-Distribution Detection on Heterophilic Graphs
Decoupled Graph Energy-based Model for Node Out-of-Distribution Detection on Heterophilic GraphsInternational Conference on Learning Representations (ICLR), 2025
Yuhan Chen
Yihong Luo
Yifan Song
Pengwen Dai
Jing Tang
Xiaochun Cao
OODD
390
6
0
25 Feb 2025
Logit Disagreement: OoD Detection with Bayesian Neural Networks
Logit Disagreement: OoD Detection with Bayesian Neural Networks
Kevin Raina
UQCVBDLUDPER
391
1
0
24 Feb 2025
A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics
A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics
Ting-Ruen Wei
Haowei Liu
Xuyang Wu
Yi Fang
LRMAI4CEReLMKELM
713
8
0
21 Feb 2025
Robust Concept Erasure Using Task Vectors
Robust Concept Erasure Using Task Vectors
Minh Pham
Kelly O. Marshall
Chinmay Hegde
Niv Cohen
429
24
0
21 Feb 2025
Alignment, Agency and Autonomy in Frontier AI: A Systems Engineering Perspective
Krti Tallam
157
6
0
20 Feb 2025
Leveraging Intermediate Representations for Better Out-of-Distribution Detection
Leveraging Intermediate Representations for Better Out-of-Distribution Detection
Gianluca Guglielmo
Marc Masana
OODD
246
1
0
18 Feb 2025
Transformer Dynamics: A neuroscientific approach to interpretability of large language models
Transformer Dynamics: A neuroscientific approach to interpretability of large language models
Jesseba Fernando
Grigori Guitchounts
AI4CE
225
4
0
17 Feb 2025
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Yufei He
Yuexin Li
Jiaying Wu
Yuan Sui
Yulin Chen
Bryan Hooi
ALM
469
15
0
16 Feb 2025
FairDropout: Using Example-Tied Dropout to Enhance Generalization of Minority Groups
Géraldin Nanfack
Eugene Belilovsky
265
1
0
10 Feb 2025
Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis
Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis
Aran Nayebi
524
3
0
09 Feb 2025
Why human-AI relationships need socioaffective alignment
Why human-AI relationships need socioaffective alignmentHumanities and Social Sciences Communications (HSSC), 2025
Hannah Rose Kirk
Iason Gabriel
Chris Summerfield
Bertie Vidgen
Scott A. Hale
228
49
0
04 Feb 2025
Process-Supervised Reinforcement Learning for Code Generation
Process-Supervised Reinforcement Learning for Code Generation
Yufan Ye
Ting Zhang
Wenbin Jiang
Hua Huang
OffRLLRMSyDa
286
13
0
03 Feb 2025
A statistically consistent measure of semantic uncertainty using Language Models
A statistically consistent measure of semantic uncertainty using Language Models
Yi Liu
309
0
0
01 Feb 2025
Constrained Hybrid Metaheuristic Algorithm for Probabilistic Neural Networks LearningInformation Sciences (Inf. Sci.), 2025
Piotr A. Kowalski
Szymon Kucharczyk
Jacek Mańdziuk
249
6
0
28 Jan 2025
Temporal Logic Specification-Conditioned Decision Transformer for Offline Safe Reinforcement Learning
Temporal Logic Specification-Conditioned Decision Transformer for Offline Safe Reinforcement LearningInternational Conference on Machine Learning (ICML), 2024
Zijian Guo
Weichao Zhou
Wenchao Li
OffRL
254
3
0
28 Jan 2025
BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models
BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language ModelsNeural Information Processing Systems (NeurIPS), 2024
Yibin Wang
Haizhou Shi
Ligong Han
Dimitris N. Metaxas
Hao Wang
BDLUQLM
678
20
0
28 Jan 2025
The Trust Calibration Maturity Model for Characterizing and Communicating Trustworthiness of AI Systems
The Trust Calibration Maturity Model for Characterizing and Communicating Trustworthiness of AI Systems
Scott T Steinmetz
Asmeret Naugle
Paul Schutte
Matt Sweitzer
Alex Washburne
Lisa Linville
Daniel Krofcheck
Michal Kucer
Samuel Myren
229
2
0
28 Jan 2025
Evolution and The Knightian Blindspot of Machine Learning
Evolution and The Knightian Blindspot of Machine Learning
Joel Lehman
Elliot Meyerson
Tarek El-Gaaly
Kenneth O. Stanley
Tarin Ziyaee
316
7
0
22 Jan 2025
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Sebastian Farquhar
Vikrant Varma
David Lindner
David Elson
Caleb Biddulph
Ian Goodfellow
Rohin Shah
384
10
0
22 Jan 2025
Topology of Out-of-Distribution Examples in Deep Neural Networks
Topology of Out-of-Distribution Examples in Deep Neural Networks
Esha Datta
Johanna Hennig
Eva Domschot
Connor Mattes
Michael R. Smith
213
1
0
21 Jan 2025
A margin-based replacement for cross-entropy loss
A margin-based replacement for cross-entropy loss
Michael W. Spratling
Heiko H. Schütt
302
0
0
21 Jan 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
403
5
0
20 Jan 2025
Two Types of AI Existential Risk: Decisive and Accumulative
Two Types of AI Existential Risk: Decisive and AccumulativePhilosophical Studies (Philos. Stud.), 2024
Atoosa Kasirzadeh
428
36
0
20 Jan 2025
Learning to Assist Humans without Inferring Rewards
Learning to Assist Humans without Inferring RewardsNeural Information Processing Systems (NeurIPS), 2024
Vivek Myers
Evan Ellis
Sergey Levine
Benjamin Eysenbach
Anca Dragan
546
10
0
17 Jan 2025
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Simon Mahns
Zhuokai Zhao
Yibo Jiang
Zhaorun Chen
Chen Zhu
...
Jiayi Liu
Lizhu Zhang
Xiangjun Fan
Hao Ma
Sinong Wang
436
21
0
16 Jan 2025
Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision
Iterative Label Refinement Matters More than Preference Optimization under Weak SupervisionInternational Conference on Learning Representations (ICLR), 2025
Yaowen Ye
Cassidy Laidlaw
Jacob Steinhardt
ALM
198
2
0
14 Jan 2025
Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models
Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language ModelsAdvanced Video and Signal Based Surveillance (AVSS), 2025
Y. Ranasinghe
Vibashan Vs
James Uplinger
C. D. Melo
Vishal M. Patel
235
2
0
13 Jan 2025
Large Language Models for Bioinformatics
Large Language Models for BioinformaticsQuantitative Biology (QB), 2025
Wei Ruan
Yanjun Lyu
Jing Zhang
Jianfeng Cai
Peng Shu
...
Ping Ma
Hongtu Zhu
Yajun Yan
D. Zhu
Tianming Liu
AI4CELM&MA
141
10
0
10 Jan 2025
Predictable Artificial Intelligence
Predictable Artificial Intelligence
Lexin Zhou
Pablo Antonio Moreno Casares
Fernando Martínez-Plumed
John Burden
Ryan Burnell
...
Seán Ó hÉigeartaigh
Danaja Rutar
Wout Schellaert
Konstantinos Voudouris
José Hernández-Orallo
437
6
0
08 Jan 2025
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Ruilin Luo
Zhuofan Zheng
Yifan Wang
Xinzhe Ni
Zicheng Lin
...
Yiyao Yu
C. Shi
Ruihang Chu
Jin Zeng
Yujiu Yang
LRM
698
34
0
08 Jan 2025
FairSense: Long-Term Fairness Analysis of ML-Enabled Systems
FairSense: Long-Term Fairness Analysis of ML-Enabled SystemsInternational Conference on Software Engineering (ICSE), 2025
Yining She
Sumon Biswas
Jane Hsieh
Eunsuk Kang
245
2
0
03 Jan 2025
Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications
Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning ApplicationsIEEE Access (IEEE Access), 2024
Sinan Ibrahim
Mostafa Mostafa
Ali Jnadi
Hadi Salloum
Pavel Osinenko
OffRL
270
47
0
31 Dec 2024
Uncertainty quantification for improving radiomic-based models in radiation pneumonitis prediction
Uncertainty quantification for improving radiomic-based models in radiation pneumonitis prediction
Chanon Puttanawarut
Romen Samuel Wabina
Nat Sirirutbunkajorn
AI4CE
358
0
0
27 Dec 2024
Previous
123456...262728
Next