Circumventing interpretability: How to defeat mind-readers

21 December 2022

Papers citing "Circumventing interpretability: How to defeat mind-readers"

5 / 5 papers shown

Title
An Attempt to Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models Jaturong Kongmanee 34 1 0 28 Jan 2025
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 40 111 0 22 Apr 2024
Don't trust your eyes: on the (un)reliability of feature visualizations Robert Geirhos Roland S. Zimmermann Blair Bilodeau Wieland Brendel Been Kim FAtt OOD 27 25 0 07 Jun 2023
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 120 316 0 21 Sep 2022
Building machines that adapt and compute like brains Brenden Lake J. Tenenbaum AI4CE FedML NAI AILaw 245 890 0 11 Nov 2017