CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal RepresentationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 |
Spectral Insights into Data-Oblivious Critical Layers in Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 |
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model EditingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 |
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack DefenseNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025 |
On the Role of Attention Heads in Large Language Model SafetyInternational Conference on Learning Representations (ICLR), 2024 |
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner Xunguang Wang Daoyuan Wu Zhenlan Ji Zongjie Li Pingchuan Ma Shuai Wang Yingjiu Li Yang Liu Ning Liu Juergen Rahmel |