Understanding intermediate layers using linear classifier probes

5 October 2016

Papers citing "Understanding intermediate layers using linear classifier probes"

50 / 158 papers shown

Title
Revealing economic facts: LLMs know more than they say Marcus Buckmann Quynh Anh Nguyen Edward Hill 28 0 0 13 May 2025
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation Volodymyr Havrylov Haiwen Huang Dan Zhang Andreas Geiger 111 0 0 04 May 2025
Demystifying optimized prompts in language models Rimon Melamed Lucas H. McCabe H. H. Huang 39 0 0 04 May 2025
Revisiting Diffusion Autoencoder Training for Image Reconstruction Quality Pramook Khungurn Sukit Seripanitkarn Phonphrm Thawatdamrongkit Supasorn Suwajanakorn DiffM 70 0 0 30 Apr 2025
Investigating task-specific prompts and sparse autoencoders for activation monitoring Henk Tillman Dan Mossing LLMSV 45 0 0 28 Apr 2025
V $^2$ R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations Zhiyuan Fan Yumeng Wang Sandeep Polisetty Yi Ren Fung 50 0 0 23 Apr 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control Hannah Cyberey David E. Evans LLMSV 76 0 0 23 Apr 2025
Decoding Vision Transformers: the Diffusion Steering Lens Ryota Takatsuki Sonia Joseph Ippei Fujisawa Ryota Kanai DiffM 30 0 0 18 Apr 2025
Interpreting the Linear Structure of Vision-language Model Embedding Spaces Isabel Papadimitriou Huangyuan Su Thomas Fel Naomi Saphra Sham Kakade Stephanie Gil VLM 42 0 0 16 Apr 2025
Among Us: A Sandbox for Measuring and Detecting Agentic Deception Satvik Golechha Adrià Garriga-Alonso LLMAG 52 2 0 05 Apr 2025
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models Zhanke Zhou Zhaocheng Zhu Xuan Li Mikhail Galkin Xiao Feng Sanmi Koyejo Jian Tang Bo Han LRM 56 0 0 28 Mar 2025
ASIDE: Architectural Separation of Instructions and Data in Language Models Egor Zverev Evgenii Kortukov Alexander Panfilov Soroush Tabesh Alexandra Volkova Sebastian Lapuschkin Wojciech Samek Christoph H. Lampert AAML 52 1 0 13 Mar 2025
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model Qiyuan Deng X. Bai Kehai Chen Yaowei Wang Liqiang Nie Min Zhang OffRL 59 0 0 13 Mar 2025
A Quantitative Evaluation of the Expressivity of BMI, Pose and Gender in Body Embeddings for Recognition and Identification Basudha Pal Siyuan Huang 51 0 0 09 Mar 2025
Statistical Deficiency for Task Inclusion Estimation Loïc Fosse Frédéric Béchet Benoit Favre Géraldine Damnati Gwénolé Lecorvé Maxime Darrin Philippe Formont Pablo Piantanida 130 0 0 07 Mar 2025
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation Jonathan Jacobi Gal Niv LRM ReLM 60 0 0 03 Mar 2025
Linear Representations of Political Perspective Emerge in Large Language Models Junsol Kim James Evans Aaron Schein 75 2 0 03 Mar 2025
Disentangling Visual Transformers: Patch-level Interpretability for Image Classification Guillaume Jeanneret Loïc Simon F. Jurie ViT 44 0 0 24 Feb 2025
Bayesian Comparisons Between Representations Heiko H. Schütt FAtt 152 0 0 20 Feb 2025
Towards Active Participant Centric Vertical Federated Learning: Some Representations May Be All You Need Jon Irureta Jon Imaz Aizea Lojo Javier Fernandez-Marques Marco González Iñigo Perona FedML 85 1 0 20 Feb 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis Ge Lei Samuel J. Cooper KELM 47 0 0 15 Feb 2025
Superpose Singular Features for Model Merging Haiquan Qiu You Wu Quanming Yao MoMe 43 0 0 15 Feb 2025
Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented Approach J. Yang Dapeng Chen Yajing Sun Rongjun Li Zhiyong Feng Wei Peng 51 5 0 19 Jan 2025
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning Eitan Wagner Nitay Alon J. Barnby Omri Abend LRM 85 2 0 18 Dec 2024
Transformers Use Causal World Models in Maze-Solving Tasks Alex F Spies William Edwards Michael I. Ivanitskiy Adrians Skapars Tilman Rauker Katsumi Inoue A. Russo Murray Shanahan 119 1 0 16 Dec 2024
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations Huaizhi Ge Yiming Li Qifan Wang Yongfeng Zhang Ruixiang Tang AAML SILM 78 0 0 19 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 40 5 0 07 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks Nathalie Maria Kirch Constantin Weisser Severin Field Helen Yannakoudakis Stephen Casper 37 2 0 02 Nov 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling Emanuele Marconato Sébastien Lachapelle Sebastian Weichwald Luigi Gresele 66 3 0 30 Oct 2024
Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation Jaechang Kim Jinmin Goh Inseok Hwang Jaewoong Cho Jungseul Ok ELM 28 1 0 28 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels Logan Riggs Max Tegmark LLMSV 57 9 0 18 Oct 2024
Do LLMs "know" internally when they follow instructions? Juyeon Heo Christina Heinze-Deml Oussama Elachqar Shirley Ren Udhay Nallasamy Andy Miller Kwan Ho Ryan Chan Jaya Narain 51 3 0 18 Oct 2024
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization Phillip Guo Aaquib Syed Abhay Sheshadri Aidan Ewart Gintare Karolina Dziugaite KELM MU 31 5 0 16 Oct 2024
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models Kushal Tatariya Vladimir Araujo Thomas Bauwens Miryam de Lhoneux VLM 31 0 0 15 Oct 2024
Temporal Reasoning Transfer from Text to Video Lei Li Yuanxin Liu Linli Yao Peiyuan Zhang Chenxin An Lean Wang Xu Sun Lingpeng Kong Qi Liu LRM 40 7 0 08 Oct 2024
Provable Weak-to-Strong Generalization via Benign Overfitting David X. Wu A. Sahai 63 6 0 06 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 59 2 0 30 Sep 2024
Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis Daoyang Li Mingyu Jin Qingcheng Zeng Mengnan Du 57 2 0 22 Sep 2024
Self-Contrastive Forward-Forward Algorithm Xing Chen Dongshu Liu Jérémie Laydevant Julie Grollier 34 2 0 17 Sep 2024
Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach Tong Nie Junlin He Yuewen Mei Guoyang Qin Guilong Li Jian Sun Wei Ma 32 3 0 30 Aug 2024
LLMs' morphological analyses of complex FST-generated Finnish words Anssi Moisio Mathias Creutz M. Kurimo 44 1 0 11 Jul 2024
Does ChatGPT Have a Mind? Simon Goldstein B. Levinstein AI4MH LRM 34 5 0 27 Jun 2024
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs Jannik Kossen Jiatong Han Muhammed Razzak Lisa Schut Shreshth A. Malik Yarin Gal HILM 55 33 0 22 Jun 2024
Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers Manuel Mondal Ljiljana Dolamic Gérôme Bovet Philippe Cudré-Mauroux Julien Audiffren 38 2 0 21 Jun 2024
Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell Taiming Lu Muhan Gao Kuai Yu Adam Byerly Daniel Khashabi 46 11 0 20 Jun 2024
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models Sree Harsha Tanneru Dan Ley Chirag Agarwal Himabindu Lakkaraju LRM 31 4 0 15 Jun 2024
Designing a Dashboard for Transparency and Control of Conversational AI Yida Chen Aoyu Wu Trevor DePodesta Catherine Yeh Kenneth Li ... Jan Riecke Shivam Raval Olivia Seow Martin Wattenberg Fernanda Viégas 44 16 0 12 Jun 2024
Standards for Belief Representations in LLMs Daniel A. Herrmann B. Levinstein 34 6 0 31 May 2024
On Fairness of Low-Rank Adaptation of Large Models Zhoujie Ding Ken Ziyu Liu Pura Peetathawatchai Berivan Isik Sanmi Koyejo 40 4 0 27 May 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories Tianlong Wang Xianfeng Jiao Yifan He Zhongzhi Chen Yinghao Zhu Xu Chu Junyi Gao Yasha Wang Liantao Ma LLMSV 61 7 0 26 May 2024