ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1606.06565
  4. Cited By
Concrete Problems in AI Safety
v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
ArXiv (abs)PDFHTML

Papers citing "Concrete Problems in AI Safety"

50 / 1,371 papers shown
Title
COOD: Combined out-of-distribution detection using multiple measures for
  anomaly & novel class detection in large-scale hierarchical classification
COOD: Combined out-of-distribution detection using multiple measures for anomaly & novel class detection in large-scale hierarchical classification
L. E. Hogeweg
R. Gangireddy
D. Brunink
Vincent J. Kalkman
L. Cornelissen
J. W. Kamminga
OODD
160
5
0
11 Mar 2024
ALaRM: Align Language Models via Hierarchical Rewards Modeling
ALaRM: Align Language Models via Hierarchical Rewards ModelingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yuhang Lai
Siyuan Wang
Shujun Liu
Xuanjing Huang
Zhongyu Wei
192
6
0
11 Mar 2024
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
Cassidy Laidlaw
Shivam Singhal
Anca Dragan
AAML
278
11
0
05 Mar 2024
Breaking Down the Defenses: A Comparative Survey of Attacks on Large
  Language Models
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
Arijit Ghosh Chowdhury
Md. Mofijul Islam
Vaibhav Kumar
F. H. Shezan
Vaibhav Kumar
Vinija Jain
Vasu Sharma
AAMLPILM
224
46
0
03 Mar 2024
Sample-Efficient Preference-based Reinforcement Learning with Dynamics
  Aware Rewards
Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards
Katherine Metcalf
Miguel Sarabia
Natalie Mackraz
B. Theobald
176
10
0
28 Feb 2024
Monitoring Fidelity of Online Reinforcement Learning Algorithms in
  Clinical Trials
Monitoring Fidelity of Online Reinforcement Learning Algorithms in Clinical Trials
Anna L. Trella
Kelly W. Zhang
Inbal Nahum-Shani
Vivek Shetty
Iris Yan
Finale Doshi-Velez
Susan A. Murphy
OffRLOnRL
174
4
0
26 Feb 2024
Rethinking Software Engineering in the Foundation Model Era: A Curated
  Catalogue of Challenges in the Development of Trustworthy FMware
Rethinking Software Engineering in the Foundation Model Era: A Curated Catalogue of Challenges in the Development of Trustworthy FMware
Ahmed E. Hassan
Dayi Lin
Gopi Krishnan Rajbahadur
Keheliya Gallaba
F. Côgo
...
Kishanthan Thangarajah
G. Oliva
Jiahuei Lin
Wali Mohammad Abdullah
Zhen Ming Jiang
173
8
0
25 Feb 2024
Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form
  Medical Question Answering Applications and Beyond
Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond
Zhiyuan Wang
Jinhao Duan
Chenxi Yuan
Qingyu Chen
Tianlong Chen
Huaxiu Yao
Yue Zhang
Ren Wang
Kaidi Xu
Xiaoshuang Shi
UQLM
318
22
0
22 Feb 2024
Roadmap on Incentive Compatibility for AI Alignment and Governance in Sociotechnical Systems
Roadmap on Incentive Compatibility for AI Alignment and Governance in Sociotechnical Systems
Zhaowei Zhang
Fengshuo Bai
Mingzhi Wang
Haoyang Ye
Chengdong Ma
Yaodong Yang
315
6
0
20 Feb 2024
Direct Preference Optimization with an Offset
Direct Preference Optimization with an Offset
Afra Amini
Tim Vieira
Robert Bamler
207
97
0
16 Feb 2024
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large
  Language Models
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
Minsuk Kahng
Ian Tenney
Mahima Pushkarna
Michael Xieyang Liu
James Wexler
Emily Reif
Krystal Kallarackal
Minsuk Chang
Michael Terry
Lucas Dixon
239
32
0
16 Feb 2024
On Formally Undecidable Traits of Intelligent Machines
On Formally Undecidable Traits of Intelligent Machines
Matthew Fox
85
0
0
14 Feb 2024
Mapping the Ethics of Generative AI: A Comprehensive Scoping Review
Mapping the Ethics of Generative AI: A Comprehensive Scoping Review
Thilo Hagendorff
185
76
0
13 Feb 2024
In-Context Learning Can Re-learn Forbidden Tasks
In-Context Learning Can Re-learn Forbidden Tasks
Sophie Xhonneux
David Dobre
Jian Tang
Gauthier Gidel
Dhanya Sridhar
161
6
0
08 Feb 2024
Language-Based Augmentation to Address Shortcut Learning in Object Goal
  Navigation
Language-Based Augmentation to Address Shortcut Learning in Object Goal Navigation
Dennis Hoftijzer
Gertjan J. Burghouts
Luuk J. Spreeuwers
201
3
0
07 Feb 2024
Explaining Learned Reward Functions with Counterfactual Trajectories
Explaining Learned Reward Functions with Counterfactual Trajectories
Jan Wehner
Frans Oliehoek
Luciano Cavalcante Siebert
135
0
0
07 Feb 2024
Direct Language Model Alignment from Online AI Feedback
Direct Language Model Alignment from Online AI Feedback
Shangmin Guo
Biao Zhang
Tianlin Liu
Tianqi Liu
Misha Khalman
...
Thomas Mesnard
Yao-Min Zhao
Bilal Piot
Johan Ferret
Mathieu Blondel
ALM
189
206
0
07 Feb 2024
Reinforcement Learning with Ensemble Model Predictive Safety
  Certification
Reinforcement Learning with Ensemble Model Predictive Safety Certification
Sven Gronauer
Tom Haider
Felippe Schmoeller da Roza
Klaus Diepold
130
3
0
06 Feb 2024
Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy
Risks of AI Scientists: Prioritizing Safeguarding Over AutonomyNature Communications (Nat. Commun.), 2024
Xiangru Tang
Qiao Jin
Kunlun Zhu
Tongxin Yuan
Yichi Zhang
...
Jian Tang
Zhuosheng Zhang
Arman Cohan
Zhiyong Lu
Mark B. Gerstein
LLMAGELM
339
47
0
06 Feb 2024
Online Feature Updates Improve Online (Generalized) Label Shift
  Adaptation
Online Feature Updates Improve Online (Generalized) Label Shift AdaptationNeural Information Processing Systems (NeurIPS), 2024
Ruihan Wu
Siddhartha Datta
Yi Su
Dheeraj Baby
Yu Wang
Kilian Q. Weinberger
138
4
0
05 Feb 2024
Decoding-time Realignment of Language Models
Decoding-time Realignment of Language ModelsInternational Conference on Machine Learning (ICML), 2024
Tianlin Liu
Shangmin Guo
Leonardo Bianco
Daniele Calandriello
Quentin Berthet
Felipe Llinares-López
Jessica Hoffmann
Lucas Dixon
Michal Valko
Mathieu Blondel
AI4CE
223
55
0
05 Feb 2024
Aligner: Efficient Alignment by Learning to Correct
Aligner: Efficient Alignment by Learning to Correct
Jiaming Ji
Boyuan Chen
Hantao Lou
Chongye Guo
Borong Zhang
Xuehai Pan
Juntao Dai
Tianyi Qiu
Yaodong Yang
237
71
0
04 Feb 2024
A Survey of Constraint Formulations in Safe Reinforcement Learning
A Survey of Constraint Formulations in Safe Reinforcement Learning
Akifumi Wachi
Xun Shen
Yanan Sui
294
30
0
03 Feb 2024
Foundation Model Sherpas: Guiding Foundation Models through Knowledge
  and Reasoning
Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning
D. Bhattacharjya
Junkyu Lee
Don Joven Agravante
Balaji Ganesan
Radu Marinescu
LLMAG
168
2
0
02 Feb 2024
Rethinking the Role of Proxy Rewards in Language Model Alignment
Rethinking the Role of Proxy Rewards in Language Model Alignment
Sungdong Kim
Minjoon Seo
SyDaALM
208
5
0
02 Feb 2024
LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELMLM&MA
535
80
0
02 Feb 2024
Continuous Unsupervised Domain Adaptation Using Stabilized
  Representations and Experience Replay
Continuous Unsupervised Domain Adaptation Using Stabilized Representations and Experience Replay
Mohammad Rostami
CLL
255
3
0
31 Jan 2024
Rethinking Interpretability in the Era of Large Language Models
Rethinking Interpretability in the Era of Large Language Models
Chandan Singh
J. Inala
Michel Galley
Rich Caruana
Jianfeng Gao
LRMAI4CE
236
101
0
30 Jan 2024
Improving Reinforcement Learning from Human Feedback with Efficient
  Reward Model Ensemble
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble
Shun Zhang
Zhenfang Chen
Sunli Chen
Yikang Shen
Zhiqing Sun
Chuang Gan
183
35
0
30 Jan 2024
Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods
Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods
Yotam Wolf
Noam Wies
Dorin Shteyman
Binyamin Rothberg
Yoav Levine
Amnon Shashua
LLMSV
469
18
0
29 Jan 2024
Off-Policy Primal-Dual Safe Reinforcement Learning
Off-Policy Primal-Dual Safe Reinforcement LearningInternational Conference on Learning Representations (ICLR), 2024
Zifan Wu
Bo Tang
Qian Lin
Chao Yu
Shangqin Mao
Qianlong Xie
Xingxing Wang
Dong Wang
OffRL
250
7
0
26 Jan 2024
Towards Consistent Natural-Language Explanations via
  Explanation-Consistency Finetuning
Towards Consistent Natural-Language Explanations via Explanation-Consistency FinetuningInternational Conference on Computational Linguistics (COLING), 2024
Yanda Chen
Chandan Singh
Xiaodong Liu
Simiao Zuo
Bin Yu
He He
Jianfeng Gao
LRM
136
22
0
25 Jan 2024
Towards Socially and Morally Aware RL agent: Reward Design With LLM
Towards Socially and Morally Aware RL agent: Reward Design With LLM
Zhaoyue Wang
199
4
0
23 Jan 2024
WARM: On the Benefits of Weight Averaged Reward Models
WARM: On the Benefits of Weight Averaged Reward ModelsInternational Conference on Machine Learning (ICML), 2024
Alexandre Ramé
Nino Vieillard
Léonard Hussenot
Robert Dadashi
Geoffrey Cideron
Olivier Bachem
Johan Ferret
300
129
0
22 Jan 2024
Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image
  Labeling
Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image LabelingInternational Conference on Human Factors in Computing Systems (CHI), 2024
Dongping Zhang
Angelos Chatzimparmpas
Negar Kamali
Jessica Hullman
467
12
0
16 Jan 2024
Reinforcement Learning from LLM Feedback to Counteract Goal
  Misgeneralization
Reinforcement Learning from LLM Feedback to Counteract Goal Misgeneralization
Houda Nait El Barj
Théophile Sautory
244
6
0
14 Jan 2024
Scalable and Efficient Methods for Uncertainty Estimation and Reduction
  in Deep Learning
Scalable and Efficient Methods for Uncertainty Estimation and Reduction in Deep Learning
Soyed Tuhin Ahmed
BDL
99
0
0
13 Jan 2024
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
The Unreasonable Effectiveness of Easy Training Data for Hard TasksAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Peter Hase
Mohit Bansal
Peter Clark
Sarah Wiegreffe
247
41
0
12 Jan 2024
Long-term Safe Reinforcement Learning with Binary Feedback
Long-term Safe Reinforcement Learning with Binary FeedbackAAAI Conference on Artificial Intelligence (AAAI), 2024
Akifumi Wachi
Wataru Hashimoto
Kazumune Hashimoto
OffRL
306
6
0
08 Jan 2024
A Heterogeneous RISC-V based SoC for Secure Nano-UAV Navigation
A Heterogeneous RISC-V based SoC for Secure Nano-UAV Navigation
Luca Valente
Alessandro Nadalini
Asif Veeran
Mattia Sinigaglia
Bruno Sá
...
Baker Mohammad
Sandro Pinto
Daniele Palossi
Luca Benini
Davide Rossi
144
14
0
07 Jan 2024
Human-in-the-Loop Policy Optimization for Preference-Based
  Multi-Objective Reinforcement Learning
Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning
Ke Li
Han Guo
142
2
0
04 Jan 2024
Tractable Function-Space Variational Inference in Bayesian Neural
  Networks
Tractable Function-Space Variational Inference in Bayesian Neural Networks
Tim G. J. Rudner
Zonghao Chen
Yee Whye Teh
Y. Gal
210
52
0
28 Dec 2023
LLM-SAP: Large Language Models Situational Awareness Based Planning
LLM-SAP: Large Language Models Situational Awareness Based Planning
Liman Wang
Hanyang Zhong
LLMAG
340
6
0
26 Dec 2023
Measuring Value Alignment
Measuring Value Alignment
Fazl Barez
Juil Sock
83
5
0
23 Dec 2023
HyperMix: Out-of-Distribution Detection and Classification in Few-Shot
  Settings
HyperMix: Out-of-Distribution Detection and Classification in Few-Shot Settings
Nikhil Mehta
Kevin J. Liang
Jing Huang
Fu-Jen Chu
Li Yin
Tal Hassner
OODD
142
3
0
22 Dec 2023
Toward Responsible AI Use: Considerations for Sustainability Impact
  Assessment
Toward Responsible AI Use: Considerations for Sustainability Impact Assessment
Eva Thelisson
Grzegorz Mika
Quentin Schneiter
Kirtan Padh
Himanshu Verma
90
0
0
19 Dec 2023
Concrete Problems in AI Safety, Revisited
Concrete Problems in AI Safety, Revisited
Inioluwa Deborah Raji
Roel Dobbe
133
24
0
18 Dec 2023
On a Functional Definition of Intelligence
On a Functional Definition of Intelligence
Warisa Sritriratanarak
Paulo Garcia
91
0
0
15 Dec 2023
CERN for AI: A Theoretical Framework for Autonomous Simulation-Based Artificial Intelligence Testing and Alignment
CERN for AI: A Theoretical Framework for Autonomous Simulation-Based Artificial Intelligence Testing and AlignmentEuropean Journal of Futures Research (EJFR), 2023
Ljubiša Bojić
Matteo Cinelli
D. Ćulibrk
Boris Delibasic
149
0
0
14 Dec 2023
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate
  Reward Hacking
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
Jacob Eisenstein
Chirag Nagpal
Alekh Agarwal
Ahmad Beirami
Alex DÁmour
...
Katherine Heller
Stephen Pfohl
Deepak Ramachandran
Peter Shaw
Jonathan Berant
416
136
0
14 Dec 2023
Previous
123...8910...262728
Next