Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield

31 October 2023

Papers citing "Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield"

1 / 1 papers shown

Title
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Deep Ganguli Liane Lovitt John Kernion Amanda Askell Yuntao Bai ... Nicholas Joseph Sam McCandlish C. Olah Jared Kaplan Jack Clark 213 327 0 23 Aug 2022