Generalizing Backpropagation for Gradient-Based Interpretability

6 July 2023

Papers citing "Generalizing Backpropagation for Gradient-Based Interpretability"

5 / 5 papers shown

Title
The Gradient of Algebraic Model Counting Jaron Maene Luc de Raedt 49 0 0 25 Feb 2025
Activation Scaling for Steering and Interpreting Language Models Niklas Stoehr Kevin Du Vésteinn Snæbjarnarson Robert West Ryan Cotterell Aaron Schein LLMSV LRM 29 4 0 07 Oct 2024
Exploring the Trade-off Between Model Performance and Explanation Plausibility of Text Classifiers Using Human Rationales Lucas Resck Marcos M. Raimundo Jorge Poco 24 1 0 03 Apr 2024
Localizing Paragraph Memorization in Language Models Niklas Stoehr Mitchell Gordon Chiyuan Zhang Owen Lewis MU 38 13 0 28 Mar 2024
Successor Heads: Recurring, Interpretable Attention Heads In The Wild Rhys Gould Euan Ong George Ogden Arthur Conmy LRM 6 44 0 14 Dec 2023