SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2025 |
Representation Bending for Large Language Model SafetyAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 |
Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented ApproachAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 |
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian DistributionInternational Conference on Learning Representations (ICLR), 2024 |
Programming Refusal with Conditional Activation SteeringInternational Conference on Learning Representations (ICLR), 2024 |
LEACE: Perfect linear concept erasure in closed formNeural Information Processing Systems (NeurIPS), 2023 |