Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield

31 October 2023

Papers citing "Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield"

2 / 2 papers shown

Title
Recent Advances in Attack and Defense Approaches of Large Language Models Jing Cui Yishi Xu Zhewei Huang Shuchang Zhou Jianbin Jiao Junge Zhang PILM AAML 45 1 0 05 Sep 2024
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Deep Ganguli Liane Lovitt John Kernion Amanda Askell Yuntao Bai ... Nicholas Joseph Sam McCandlish C. Olah Jared Kaplan Jack Clark 213 327 0 23 Aug 2022