Unsolved Problems in ML Safety

Unsolved Problems in ML Safety

28 September 2021

Nicholas Carlini

Jacob Steinhardt

Papers citing "Unsolved Problems in ML Safety"

10 / 10 papers shown

Title
What Is AI Safety? What Do We Want It to Be? Jacqueline Harding Cameron Domenico Kirk-Giannini 11 0 0 05 May 2025
Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review Toghrul Abbasli Kentaroh Toyoda Yuan Wang Leon Witt Muhammad Asif Ali Yukai Miao Dan Li Qingsong Wei UQCV 55 0 0 25 Apr 2025
Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models Tri Nguyen Lohith Srikanth Pentapalli Magnus Sieverding Laurah Turner Seth Overla ... Michael Gharib Matt Kelleher Michael Shukis Cameron Pawlik Kelly Cohen 9 0 0 21 Apr 2025
Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment Qizhang Feng Siva Rajesh Kasa Santhosh Kumar Kasa Hyokun Yun C. Teo S. Bodapati 50 5 0 08 Jul 2024
Out-of-Distribution Dynamics Detection: RL-Relevant Benchmarks and Results Mohamad H. Danesh Alan Fern 79 10 0 11 Jul 2021
Emerging Properties in Self-Supervised Vision Transformers Mathilde Caron Hugo Touvron Ishan Misra Hervé Jégou Julien Mairal Piotr Bojanowski Armand Joulin 260 4,299 0 29 Apr 2021
Measuring and Improving Consistency in Pretrained Language Models Yanai Elazar Nora Kassner Shauli Ravfogel Abhilasha Ravichander Eduard H. Hovy Hinrich Schütze Yoav Goldberg HILM 226 273 0 01 Feb 2021
RobustBench: a standardized adversarial robustness benchmark Francesco Croce Maksym Andriushchenko Vikash Sehwag Edoardo Debenedetti Nicolas Flammarion M. Chiang Prateek Mittal Matthias Hein VLM 190 554 0 19 Oct 2020
AI safety via debate G. Irving Paul Christiano Dario Amodei 183 148 0 02 May 2018
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles Balaji Lakshminarayanan Alexander Pritzel Charles Blundell UQCV BDL 251 4,940 0 05 Dec 2016