Representation Bending for Large Language Model SafetyAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 |
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat TemplatesAAAI Conference on Artificial Intelligence (AAAI), 2024 |
Safety Alignment Should Be Made More Than Just a Few Tokens DeepInternational Conference on Learning Representations (ICLR), 2024 |
Improving Alignment and Robustness with Circuit BreakersNeural Information Processing Systems (NeurIPS), 2024 |
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive AttacksInternational Conference on Learning Representations (ICLR), 2024 |