BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

3 June 2024

Papers citing "BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards"

2 / 2 papers shown

Title
Output Scouting: Auditing Large Language Models for Catastrophic Responses Andrew Bell João Fonseca KELM 38 1 0 04 Oct 2024
Feedback Loops With Language Models Drive In-Context Reward Hacking Alexander Pan Erik Jones Meena Jagadeesan Jacob Steinhardt KELM 42 25 0 09 Feb 2024