Title
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems David Dalrymple Joar Skalse Yoshua Bengio Stuart J. Russell Max Tegmark ... Clark Barrett Ding Zhao Zhi-Xuan Tan Jeannette Wing Joshua Tenenbaum 44 51 0 10 May 2024
Watermark-based Detection and Attribution of AI-Generated Content Zhengyuan Jiang Moyang Guo Yuepeng Hu Neil Zhenqiang Gong 16 5 0 05 Apr 2024
Publicly-Detectable Watermarking for Language Models Jaiden Fairoze Sanjam Garg Somesh Jha Saeed Mahloujifar Mohammad Mahmoody Mingyuan Wang WaLM 139 45 0 27 Oct 2023
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks Erfan Shayegani Md Abdullah Al Mamun Yu Fu Pedram Zaree Yue Dong Nael B. Abu-Ghazaleh AAML 144 139 0 16 Oct 2023
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 117 314 0 21 Sep 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Deep Ganguli Liane Lovitt John Kernion Amanda Askell Yuntao Bai ... Nicholas Joseph Sam McCandlish C. Olah Jared Kaplan Jack Clark 218 441 0 23 Aug 2022
Diffusion Models for Adversarial Purification Weili Nie Brandon Guo Yujia Huang Chaowei Xiao Arash Vahdat Anima Anandkumar WIGM 184 410 0 16 May 2022
Towards A Rigorous Science of Interpretable Machine Learning Finale Doshi-Velez Been Kim XAI FaML 225 3,658 0 28 Feb 2017