Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2109.13916
Cited By
Unsolved Problems in ML Safety
28 September 2021
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unsolved Problems in ML Safety"
10 / 10 papers shown
Title
What Is AI Safety? What Do We Want It to Be?
Jacqueline Harding
Cameron Domenico Kirk-Giannini
11
0
0
05 May 2025
Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review
Toghrul Abbasli
Kentaroh Toyoda
Yuan Wang
Leon Witt
Muhammad Asif Ali
Yukai Miao
Dan Li
Qingsong Wei
UQCV
55
0
0
25 Apr 2025
Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models
Tri Nguyen
Lohith Srikanth Pentapalli
Magnus Sieverding
Laurah Turner
Seth Overla
...
Michael Gharib
Matt Kelleher
Michael Shukis
Cameron Pawlik
Kelly Cohen
9
0
0
21 Apr 2025
Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment
Qizhang Feng
Siva Rajesh Kasa
Santhosh Kumar Kasa
Hyokun Yun
C. Teo
S. Bodapati
50
5
0
08 Jul 2024
Out-of-Distribution Dynamics Detection: RL-Relevant Benchmarks and Results
Mohamad H. Danesh
Alan Fern
79
10
0
11 Jul 2021
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron
Hugo Touvron
Ishan Misra
Hervé Jégou
Julien Mairal
Piotr Bojanowski
Armand Joulin
260
4,299
0
29 Apr 2021
Measuring and Improving Consistency in Pretrained Language Models
Yanai Elazar
Nora Kassner
Shauli Ravfogel
Abhilasha Ravichander
Eduard H. Hovy
Hinrich Schütze
Yoav Goldberg
HILM
226
273
0
01 Feb 2021
RobustBench: a standardized adversarial robustness benchmark
Francesco Croce
Maksym Andriushchenko
Vikash Sehwag
Edoardo Debenedetti
Nicolas Flammarion
M. Chiang
Prateek Mittal
Matthias Hein
VLM
190
554
0
19 Oct 2020
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
183
148
0
02 May 2018
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Balaji Lakshminarayanan
Alexander Pritzel
Charles Blundell
UQCV
BDL
251
4,940
0
05 Dec 2016
1