Title
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Kola Ayonrinde Louis Jaburi MILM 60 1 0 01 May 2025
ReSi: A Comprehensive Benchmark for Representational Similarity Measures Max Klabunde Tassilo Wald Tobias Schumacher Klaus H. Maier-Hein Markus Strohmaier Adriana Iamnitchi AI4TS VLM 52 5 0 13 Mar 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment Harrish Thasarathan Julian Forsyth Thomas Fel M. Kowal Konstantinos G. Derpanis 69 7 0 06 Feb 2025
We're Different, We're the Same: Creative Homogeneity Across LLMs Emily Wenger Yoed Kenett 81 3 0 31 Jan 2025
Dimensions underlying the representational alignment of deep neural networks with humans F. Mahner Lukas Muttenthaler Umut Güçlü M. Hebart 28 4 0 28 Jan 2025
Measuring Error Alignment for Decision-Making Systems Binxia Xu Antonis Bikakis Daniel Onah A. Vlachidis Luke Dickens 29 0 0 03 Jan 2025
Quantifying Knowledge Distillation Using Partial Information Decomposition Pasan Dissanayake Faisal Hamman Barproda Halder Ilia Sucholutsky Qiuyi Zhang Sanghamitra Dutta 31 0 0 12 Nov 2024
Emergence of a High-Dimensional Abstraction Phase in Language Transformers Emily Cheng Diego Doimo Corentin Kervadec Iuri Macocco Jade Yu A. Laio Marco Baroni 98 11 0 24 May 2024
Learning with Language-Guided State Abstractions Andi Peng Ilia Sucholutsky Belinda Z. Li T. Sumers Thomas L. Griffiths Jacob Andreas Julie A. Shah LM&Ro 29 8 0 28 Feb 2024
Similarity of Neural Network Models: A Survey of Functional and Representational Measures Max Klabunde Tobias Schumacher M. Strohmaier Florian Lemmerich 38 63 0 10 May 2023
Human Uncertainty in Concept-Based AI Systems Katherine M. Collins Matthew Barker M. Zarlenga Naveen Raman Umang Bhatt M. Jamnik Ilia Sucholutsky Adrian Weller Krishnamurthy Dvijotham 52 39 0 22 Mar 2023
Analyzing Diffusion as Serial Reproduction Raja Marjieh Ilia Sucholutsky Thomas A. Langlois Nori Jacoby Thomas L. Griffiths DiffM 25 4 0 29 Sep 2022
Improving alignment of dialogue agents via targeted human judgements Amelia Glaese Nat McAleese Maja Trkebacz John Aslanides Vlad Firoiu ... John F. J. Mellor Demis Hassabis Koray Kavukcuoglu Lisa Anne Hendricks G. Irving ALM AAML 217 495 0 28 Sep 2022
Concept Embedding Models: Beyond the Accuracy-Explainability Trade-Off M. Zarlenga Pietro Barbiero Gabriele Ciravegna G. Marra Francesco Giannini ... F. Precioso S. Melacci Adrian Weller Pietro Lio' M. Jamnik 47 52 0 19 Sep 2022
The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks Lukas Huber Robert Geirhos Felix Wichmann 35 12 0 20 May 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
Passive Attention in Artificial Neural Networks Predicts Human Visual Selectivity Thomas A. Langlois H. C. Zhao Erin Grant Ishita Dasgupta Thomas L. Griffiths Nori Jacoby 33 13 0 14 Jul 2021
Zero-Shot Text-to-Image Generation Aditya A. Ramesh Mikhail Pavlov Gabriel Goh Scott Gray Chelsea Voss Alec Radford Mark Chen Ilya Sutskever VLM 253 3,790 0 24 Feb 2021
On the surprising similarities between supervised and self-supervised models Robert Geirhos Kantharaju Narayanappa Benjamin Mitzkus Matthias Bethge Felix Wichmann Wieland Brendel OOD SSL DRL 45 45 0 16 Oct 2020
On Completeness-aware Concept-Based Explanations in Deep Neural Networks Chih-Kuan Yeh Been Kim Sercan Ö. Arik Chun-Liang Li Tomas Pfister Pradeep Ravikumar FAtt 115 293 0 17 Oct 2019
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 273 1,561 0 18 Sep 2019
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Chelsea Finn Pieter Abbeel Sergey Levine OOD 234 11,568 0 09 Mar 2017
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles Balaji Lakshminarayanan Alexander Pritzel Charles Blundell UQCV BDL 268 4,940 0 05 Dec 2016