ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context

9 July 2024

Papers citing "ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context"

3 / 3 papers shown

Title
Does Refusal Training in LLMs Generalize to the Past Tense? Maksym Andriushchenko Nicolas Flammarion 42 27 0 16 Jul 2024
Dialect prejudice predicts AI decisions about people's character, employability, and criminality Valentin Hofmann Pratyusha Kalluri Dan Jurafsky Sharese King 75 38 0 01 Mar 2024
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 303 11,730 0 04 Mar 2022