
Title |
|---|
![]() FADE: Why Bad Descriptions Happen to Good FeaturesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 |
![]() Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous WordsInternational Conference on Learning Representations (ICLR), 2025 |
![]() Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting
Rare Concepts in Foundation ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 |
![]() Attention Speaks Volumes: Localizing and Mitigating Bias in Language
ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 |
![]() Analyzing (In)Abilities of SAEs via Formal LanguagesNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 |
![]() Recurrent Neural Networks Learn to Store and Generate Sequences using
Non-Linear RepresentationsBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024 |
![]() Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024 |
![]() Identifying Functionally Important Features with End-to-End Sparse
Dictionary LearningNeural Information Processing Systems (NeurIPS), 2024 |
![]() LEACE: Perfect linear concept erasure in closed formNeural Information Processing Systems (NeurIPS), 2023 |