On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs

27 July 2024

Papers citing "On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs"

21 / 21 papers shown

Title
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers Anton Razzhigaev Matvey Mikhalchuk Temurbek Rahmatullaev Elizaveta Goncharova Polina Druzhinina Ivan V. Oseledets Andrey Kuznetsov 55 1 0 20 Feb 2025
LLMs for Generating and Evaluating Counterfactuals: A Comprehensive Study Van Bach Nguyen Paul Youssef Jorg Schlotterer Christin Seifert 37 14 0 26 Apr 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 41 42 0 01 Mar 2024
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts Mikayel Samvelyan Sharath Chandra Raparthy Andrei Lupu Eric Hambro Aram H. Markosyan ... Minqi Jiang Jack Parker-Holder Jakob Foerster Tim Rocktaschel Roberta Raileanu SyDa 68 61 0 26 Feb 2024
Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism Mansi Sakarvadia Arham Khan Aswathy Ajith Daniel Grzenda Nathaniel Hudson André Bauer Kyle Chard Ian T. Foster 172 9 0 25 Oct 2023
Attribution Patching Outperforms Automated Circuit Discovery Aaquib Syed Can Rager Arthur Conmy 55 53 0 16 Oct 2023
Can LLMs facilitate interpretation of pre-trained language models? Basel Mousi Nadir Durrani Fahim Dalvi 36 12 0 22 May 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 486 0 01 Nov 2022
CORE: A Retrieve-then-Edit Framework for Counterfactual Data Generation Tanay Dixit Bhargavi Paranjape Hannaneh Hajishirzi Luke Zettlemoyer SyDa 130 23 0 10 Oct 2022
Causal Proxy Models for Concept-Based Model Explanations Zhengxuan Wu Karel DÓosterlinck Atticus Geiger Amir Zur Christopher Potts MILM 68 35 0 28 Sep 2022
Towards Faithful Model Explanation in NLP: A Survey Qing Lyu Marianna Apidianaki Chris Callison-Burch XAI 104 105 0 22 Sep 2022
Naturalistic Causal Probing for Morpho-Syntax Afra Amini Tiago Pimentel Clara Meister Ryan Cotterell MILM 98 18 0 14 May 2022
Tailor: Generating and Perturbing Text with Semantic Controls Alexis Ross Tongshuang Wu Hao Peng Matthew E. Peters Matt Gardner 124 77 0 15 Jul 2021
Gradient-based Adversarial Attacks against Text Transformers Chuan Guo Alexandre Sablayrolles Hervé Jégou Douwe Kiela SILM 98 225 0 15 Apr 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 221 402 0 24 Feb 2021
Robustness Gym: Unifying the NLP Evaluation Landscape Karan Goel Nazneen Rajani Jesse Vig Samson Tan Jason M. Wu Stephan Zheng Caiming Xiong Mohit Bansal Christopher Ré AAML OffRL OOD 138 136 0 13 Jan 2021
Topic Modeling with Contextualized Word Representation Clusters Laure Thompson David M. Mimno 94 82 0 23 Oct 2020
On Completeness-aware Concept-Based Explanations in Deep Neural Networks Chih-Kuan Yeh Been Kim Sercan Ö. Arik Chun-Liang Li Tomas Pfister Pradeep Ravikumar FAtt 120 293 0 17 Oct 2019
What you can cram into a single vector: Probing sentence embeddings for linguistic properties Alexis Conneau Germán Kruszewski Guillaume Lample Loïc Barrault Marco Baroni 199 876 0 03 May 2018
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 294 6,927 0 20 Apr 2018
Towards A Rigorous Science of Interpretable Machine Learning Finale Doshi-Velez Been Kim XAI FaML 225 3,658 0 28 Feb 2017