A Multimodal Automated Interpretability Agent

A Multimodal Automated Interpretability Agent

22 April 2024

Tamar Rott Shaham

Sarah Schwettmann

Achyuta Rajaram

Antonio Torralba

Papers citing "A Multimodal Automated Interpretability Agent"

19 / 19 papers shown

Title
Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers Quentin Guimard Moreno DÍncà Massimiliano Mancini Elisa Ricci SSL 72 0 0 29 Apr 2025
AUTHENTICATION: Identifying Rare Failure Modes in Autonomous Vehicle Perception Systems using Adversarially Guided Diffusion Models Mohammad Zarei Melanie A Jutras Eliana Evans Mike Tan Omid Aaramoon AAML DiffM 40 0 0 24 Apr 2025
Interpreting the Linear Structure of Vision-language Model Embedding Spaces Isabel Papadimitriou Huangyuan Su Thomas Fel Naomi Saphra Sham Kakade Stephanie Gil VLM 40 0 0 16 Apr 2025
Uncovering Branch specialization in InceptionV1 using k sparse autoencoders Matthew Bozoukov 14 0 0 14 Apr 2025
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems Simon Lermen Mateusz Dziemian Natalia Pérez-Campanero Antolín 31 0 0 10 Apr 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun Jing Huang Sidharth Baskaran Karel DÓosterlinck Christopher Potts Michael Sklar Atticus Geiger AI4CE 57 0 0 13 Mar 2025
TinyEmo: Scaling down Emotional Reasoning via Metric Projection Cristian Gutierrez LRM 62 0 0 17 Feb 2025
Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey Yunkai Dang Kaichen Huang Jiahao Huo Yibo Yan S. Huang ... Kun Wang Yong Liu Jing Shao Hui Xiong Xuming Hu LRM 96 14 0 03 Dec 2024
Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language Anthony Costarelli Mat Allen Severin Field 16 1 0 03 Oct 2024
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations Nick Jiang Anish Kachinthaya Suzie Petryk Yossi Gandelsman VLM 32 14 0 03 Oct 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 40 18 0 02 Aug 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Rohan Gupta Iván Arcuschin Thomas Kwa Adrià Garriga-Alonso 40 3 0 19 Jul 2024
Automatic benchmarking of large multimodal models via iterative experiment programming Alessandro Conti Enrico Fini Paolo Rota Yiming Wang Massimiliano Mancini Elisa Ricci 30 0 0 18 Jun 2024
Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents Yoann Poupart 23 0 0 06 Jun 2024
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models Qiji Zhou Ruochen Zhou Zike Hu Panzhong Lu Siyang Gao Yue Zhang LRM 38 12 0 22 May 2024
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 153 186 0 02 May 2023
Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases Aengus Lynch G. Dovonon Jean Kaddour Ricardo M. A. Silva 184 30 0 09 Mar 2023
Natural Language Descriptions of Deep Visual Features Evan Hernandez Sarah Schwettmann David Bau Teona Bagashvili Antonio Torralba Jacob Andreas MILM 194 116 0 26 Jan 2022
Emerging Properties in Self-Supervised Vision Transformers Mathilde Caron Hugo Touvron Ishan Misra Hervé Jégou Julien Mairal Piotr Bojanowski Armand Joulin 283 5,723 0 29 Apr 2021