v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016

Papers citing "Concrete Problems in AI Safety"

50 / 1,371 papers shown

Title
COOD: Combined out-of-distribution detection using multiple measures for anomaly & novel class detection in large-scale hierarchical classification L. E. Hogeweg R. Gangireddy D. Brunink Vincent J. Kalkman L. Cornelissen J. W. Kamminga OODD 160 5 0 11 Mar 2024
ALaRM: Align Language Models via Hierarchical Rewards ModelingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Yuhang Lai Siyuan Wang Shujun Liu Xuanjing Huang Zhongyu Wei 192 6 0 11 Mar 2024
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking Cassidy Laidlaw Shivam Singhal Anca Dragan AAML 278 11 0 05 Mar 2024
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models Arijit Ghosh Chowdhury Md. Mofijul Islam Vaibhav Kumar F. H. Shezan Vaibhav Kumar Vinija Jain Vasu Sharma AAML PILM 224 46 0 03 Mar 2024
Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards Katherine Metcalf Miguel Sarabia Natalie Mackraz B. Theobald 176 10 0 28 Feb 2024
Monitoring Fidelity of Online Reinforcement Learning Algorithms in Clinical Trials Anna L. Trella Kelly W. Zhang Inbal Nahum-Shani Vivek Shetty Iris Yan Finale Doshi-Velez Susan A. Murphy OffRL OnRL 174 4 0 26 Feb 2024
Rethinking Software Engineering in the Foundation Model Era: A Curated Catalogue of Challenges in the Development of Trustworthy FMware Ahmed E. Hassan Dayi Lin Gopi Krishnan Rajbahadur Keheliya Gallaba F. Côgo ... Kishanthan Thangarajah G. Oliva Jiahuei Lin Wali Mohammad Abdullah Zhen Ming Jiang 173 8 0 25 Feb 2024
Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond Zhiyuan Wang Jinhao Duan Chenxi Yuan Qingyu Chen Tianlong Chen Huaxiu Yao Yue Zhang Ren Wang Kaidi Xu Xiaoshuang Shi UQLM 318 22 0 22 Feb 2024
Roadmap on Incentive Compatibility for AI Alignment and Governance in Sociotechnical Systems Zhaowei Zhang Fengshuo Bai Mingzhi Wang Haoyang Ye Chengdong Ma Yaodong Yang 315 6 0 20 Feb 2024
Direct Preference Optimization with an Offset Afra Amini Tim Vieira Robert Bamler 207 97 0 16 Feb 2024
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models Minsuk Kahng Ian Tenney Mahima Pushkarna Michael Xieyang Liu James Wexler Emily Reif Krystal Kallarackal Minsuk Chang Michael Terry Lucas Dixon 239 32 0 16 Feb 2024
On Formally Undecidable Traits of Intelligent Machines Matthew Fox 85 0 0 14 Feb 2024
Mapping the Ethics of Generative AI: A Comprehensive Scoping Review Thilo Hagendorff 185 76 0 13 Feb 2024
In-Context Learning Can Re-learn Forbidden Tasks Sophie Xhonneux David Dobre Jian Tang Gauthier Gidel Dhanya Sridhar 161 6 0 08 Feb 2024
Language-Based Augmentation to Address Shortcut Learning in Object Goal Navigation Dennis Hoftijzer Gertjan J. Burghouts Luuk J. Spreeuwers 201 3 0 07 Feb 2024
Explaining Learned Reward Functions with Counterfactual Trajectories Jan Wehner Frans Oliehoek Luciano Cavalcante Siebert 135 0 0 07 Feb 2024
Direct Language Model Alignment from Online AI Feedback Shangmin Guo Biao Zhang Tianlin Liu Tianqi Liu Misha Khalman ... Thomas Mesnard Yao-Min Zhao Bilal Piot Johan Ferret Mathieu Blondel ALM 189 206 0 07 Feb 2024
Reinforcement Learning with Ensemble Model Predictive Safety Certification Sven Gronauer Tom Haider Felippe Schmoeller da Roza Klaus Diepold 130 3 0 06 Feb 2024
Risks of AI Scientists: Prioritizing Safeguarding Over AutonomyNature Communications (Nat. Commun.), 2024 Xiangru Tang Qiao Jin Kunlun Zhu Tongxin Yuan Yichi Zhang ... Jian Tang Zhuosheng Zhang Arman Cohan Zhiyong Lu Mark B. Gerstein LLMAG ELM 339 47 0 06 Feb 2024
Online Feature Updates Improve Online (Generalized) Label Shift AdaptationNeural Information Processing Systems (NeurIPS), 2024 Ruihan Wu Siddhartha Datta Yi Su Dheeraj Baby Yu Wang Kilian Q. Weinberger 138 4 0 05 Feb 2024
Decoding-time Realignment of Language ModelsInternational Conference on Machine Learning (ICML), 2024 Tianlin Liu Shangmin Guo Leonardo Bianco Daniele Calandriello Quentin Berthet Felipe Llinares-López Jessica Hoffmann Lucas Dixon Michal Valko Mathieu Blondel AI4CE 223 55 0 05 Feb 2024
Aligner: Efficient Alignment by Learning to Correct Jiaming Ji Boyuan Chen Hantao Lou Chongye Guo Borong Zhang Xuehai Pan Juntao Dai Tianyi Qiu Yaodong Yang 237 71 0 04 Feb 2024
A Survey of Constraint Formulations in Safe Reinforcement Learning Akifumi Wachi Xun Shen Yanan Sui 294 30 0 03 Feb 2024
Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning D. Bhattacharjya Junkyu Lee Don Joven Agravante Balaji Ganesan Radu Marinescu LLMAG 168 2 0 02 Feb 2024
Rethinking the Role of Proxy Rewards in Language Model Alignment Sungdong Kim Minjoon Seo SyDa ALM 208 5 0 02 Feb 2024
LLM-based NLG Evaluation: Current Status and Challenges Mingqi Gao Xinyu Hu Jie Ruan Xiao Pu Xiaojun Wan ELM LM&MA 535 80 0 02 Feb 2024
Continuous Unsupervised Domain Adaptation Using Stabilized Representations and Experience Replay Mohammad Rostami CLL 255 3 0 31 Jan 2024
Rethinking Interpretability in the Era of Large Language Models Chandan Singh J. Inala Michel Galley Rich Caruana Jianfeng Gao LRM AI4CE 236 101 0 30 Jan 2024
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble Shun Zhang Zhenfang Chen Sunli Chen Yikang Shen Zhiqing Sun Chuang Gan 183 35 0 30 Jan 2024
Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods Yotam Wolf Noam Wies Dorin Shteyman Binyamin Rothberg Yoav Levine Amnon Shashua LLMSV 469 18 0 29 Jan 2024
Off-Policy Primal-Dual Safe Reinforcement LearningInternational Conference on Learning Representations (ICLR), 2024 Zifan Wu Bo Tang Qian Lin Chao Yu Shangqin Mao Qianlong Xie Xingxing Wang Dong Wang OffRL 250 7 0 26 Jan 2024
Towards Consistent Natural-Language Explanations via Explanation-Consistency FinetuningInternational Conference on Computational Linguistics (COLING), 2024 Yanda Chen Chandan Singh Xiaodong Liu Simiao Zuo Bin Yu He He Jianfeng Gao LRM 136 22 0 25 Jan 2024
Towards Socially and Morally Aware RL agent: Reward Design With LLM Zhaoyue Wang 199 4 0 23 Jan 2024
WARM: On the Benefits of Weight Averaged Reward ModelsInternational Conference on Machine Learning (ICML), 2024 Alexandre Ramé Nino Vieillard Léonard Hussenot Robert Dadashi Geoffrey Cideron Olivier Bachem Johan Ferret 300 129 0 22 Jan 2024
Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image LabelingInternational Conference on Human Factors in Computing Systems (CHI), 2024 Dongping Zhang Angelos Chatzimparmpas Negar Kamali Jessica Hullman 467 12 0 16 Jan 2024
Reinforcement Learning from LLM Feedback to Counteract Goal Misgeneralization Houda Nait El Barj Théophile Sautory 244 6 0 14 Jan 2024
Scalable and Efficient Methods for Uncertainty Estimation and Reduction in Deep Learning Soyed Tuhin Ahmed BDL 99 0 0 13 Jan 2024
The Unreasonable Effectiveness of Easy Training Data for Hard TasksAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Peter Hase Mohit Bansal Peter Clark Sarah Wiegreffe 247 41 0 12 Jan 2024
Long-term Safe Reinforcement Learning with Binary FeedbackAAAI Conference on Artificial Intelligence (AAAI), 2024 Akifumi Wachi Wataru Hashimoto Kazumune Hashimoto OffRL 306 6 0 08 Jan 2024
A Heterogeneous RISC-V based SoC for Secure Nano-UAV Navigation Luca Valente Alessandro Nadalini Asif Veeran Mattia Sinigaglia Bruno Sá ... Baker Mohammad Sandro Pinto Daniele Palossi Luca Benini Davide Rossi 144 14 0 07 Jan 2024
Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning Ke Li Han Guo 142 2 0 04 Jan 2024
Tractable Function-Space Variational Inference in Bayesian Neural Networks Tim G. J. Rudner Zonghao Chen Yee Whye Teh Y. Gal 210 52 0 28 Dec 2023
LLM-SAP: Large Language Models Situational Awareness Based Planning Liman Wang Hanyang Zhong LLMAG 340 6 0 26 Dec 2023
Measuring Value Alignment Fazl Barez Juil Sock 83 5 0 23 Dec 2023
HyperMix: Out-of-Distribution Detection and Classification in Few-Shot Settings Nikhil Mehta Kevin J. Liang Jing Huang Fu-Jen Chu Li Yin Tal Hassner OODD 142 3 0 22 Dec 2023
Toward Responsible AI Use: Considerations for Sustainability Impact Assessment Eva Thelisson Grzegorz Mika Quentin Schneiter Kirtan Padh Himanshu Verma 90 0 0 19 Dec 2023
Concrete Problems in AI Safety, Revisited Inioluwa Deborah Raji Roel Dobbe 133 24 0 18 Dec 2023
On a Functional Definition of Intelligence Warisa Sritriratanarak Paulo Garcia 91 0 0 15 Dec 2023
CERN for AI: A Theoretical Framework for Autonomous Simulation-Based Artificial Intelligence Testing and AlignmentEuropean Journal of Futures Research (EJFR), 2023 Ljubiša Bojić Matteo Cinelli D. Ćulibrk Boris Delibasic 149 0 0 14 Dec 2023
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking Jacob Eisenstein Chirag Nagpal Alekh Agarwal Ahmad Beirami Alex DÁmour ... Katherine Heller Stephen Pfohl Deepak Ramachandran Peter Shaw Jonathan Berant 416 136 0 14 Dec 2023