A generative approach to LLM harmfulness detection with special red flag tokens

22 February 2025

Papers citing "A generative approach to LLM harmfulness detection with special red flag tokens"

1 / 1 papers shown

Title
Safety Pretraining: Toward the Next Generation of Safe AI Pratyush Maini Sachin Goyal Dylan Sam Alex Robey Yash Savani Yiding Jiang Andy Zou Zacharcy C. Lipton J. Zico Kolter 45 0 0 23 Apr 2025