Title
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang J. Zhang Xipeng Qiu 59 0 0 29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video Sonia Joseph Praneet Suresh Lorenz Hufe Edward Stevinson Robert Graham Yash Vadi Danilo Bzdok Sebastian Lapuschkin Lee Sharkey Blake A. Richards 60 34 0 28 Apr 2025
Naturally Computed Scale Invariance in the Residual Stream of ResNet18 André Longon 46 22 0 22 Apr 2025
Towards Combinatorial Interpretability of Neural Computation Micah Adler Dan Alistarh Nir Shavit FAtt 67 1 0 10 Apr 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models Thomas Winninger Boussad Addad Katarzyna Kapusta AAML 59 0 0 08 Mar 2025
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 39 35 0 30 Sep 2024
Navigating Neural Space: Revisiting Concept Activation Vectors to Overcome Directional Divergence Frederik Pahde Maximilian Dreyer Leander Weber Moritz Weckbecker Christopher J. Anders Thomas Wiegand Wojciech Samek Sebastian Lapuschkin 32 7 0 07 Feb 2022