ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1610.01644
  4. Cited By
Understanding intermediate layers using linear classifier probes

Understanding intermediate layers using linear classifier probes

5 October 2016
Guillaume Alain
Yoshua Bengio
    FAtt
ArXivPDFHTML

Papers citing "Understanding intermediate layers using linear classifier probes"

50 / 158 papers shown
Title
Revealing economic facts: LLMs know more than they say
Revealing economic facts: LLMs know more than they say
Marcus Buckmann
Quynh Anh Nguyen
Edward Hill
28
0
0
13 May 2025
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation
Volodymyr Havrylov
Haiwen Huang
Dan Zhang
Andreas Geiger
111
0
0
04 May 2025
Demystifying optimized prompts in language models
Demystifying optimized prompts in language models
Rimon Melamed
Lucas H. McCabe
H. H. Huang
39
0
0
04 May 2025
Revisiting Diffusion Autoencoder Training for Image Reconstruction Quality
Revisiting Diffusion Autoencoder Training for Image Reconstruction Quality
Pramook Khungurn
Sukit Seripanitkarn
Phonphrm Thawatdamrongkit
Supasorn Suwajanakorn
DiffM
70
0
0
30 Apr 2025
Investigating task-specific prompts and sparse autoencoders for activation monitoring
Investigating task-specific prompts and sparse autoencoders for activation monitoring
Henk Tillman
Dan Mossing
LLMSV
45
0
0
28 Apr 2025
V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations
V2^22R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations
Zhiyuan Fan
Yumeng Wang
Sandeep Polisetty
Yi Ren Fung
50
0
0
23 Apr 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David E. Evans
LLMSV
76
0
0
23 Apr 2025
Decoding Vision Transformers: the Diffusion Steering Lens
Decoding Vision Transformers: the Diffusion Steering Lens
Ryota Takatsuki
Sonia Joseph
Ippei Fujisawa
Ryota Kanai
DiffM
30
0
0
18 Apr 2025
Interpreting the Linear Structure of Vision-language Model Embedding Spaces
Interpreting the Linear Structure of Vision-language Model Embedding Spaces
Isabel Papadimitriou
Huangyuan Su
Thomas Fel
Naomi Saphra
Sham Kakade
Stephanie Gil
VLM
42
0
0
16 Apr 2025
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
Satvik Golechha
Adrià Garriga-Alonso
LLMAG
52
2
0
05 Apr 2025
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
Zhanke Zhou
Zhaocheng Zhu
Xuan Li
Mikhail Galkin
Xiao Feng
Sanmi Koyejo
Jian Tang
Bo Han
LRM
56
0
0
28 Mar 2025
ASIDE: Architectural Separation of Instructions and Data in Language Models
ASIDE: Architectural Separation of Instructions and Data in Language Models
Egor Zverev
Evgenii Kortukov
Alexander Panfilov
Soroush Tabesh
Alexandra Volkova
Sebastian Lapuschkin
Wojciech Samek
Christoph H. Lampert
AAML
52
1
0
13 Mar 2025
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model
Qiyuan Deng
X. Bai
Kehai Chen
Yaowei Wang
Liqiang Nie
Min Zhang
OffRL
59
0
0
13 Mar 2025
A Quantitative Evaluation of the Expressivity of BMI, Pose and Gender in Body Embeddings for Recognition and Identification
A Quantitative Evaluation of the Expressivity of BMI, Pose and Gender in Body Embeddings for Recognition and Identification
Basudha Pal
Siyuan
Huang
51
0
0
09 Mar 2025
Statistical Deficiency for Task Inclusion Estimation
Loïc Fosse
Frédéric Béchet
Benoit Favre
Géraldine Damnati
Gwénolé Lecorvé
Maxime Darrin
Philippe Formont
Pablo Piantanida
130
0
0
07 Mar 2025
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation
Jonathan Jacobi
Gal Niv
LRM
ReLM
60
0
0
03 Mar 2025
Linear Representations of Political Perspective Emerge in Large Language Models
Linear Representations of Political Perspective Emerge in Large Language Models
Junsol Kim
James Evans
Aaron Schein
75
2
0
03 Mar 2025
Disentangling Visual Transformers: Patch-level Interpretability for Image Classification
Disentangling Visual Transformers: Patch-level Interpretability for Image Classification
Guillaume Jeanneret
Loïc Simon
F. Jurie
ViT
44
0
0
24 Feb 2025
Bayesian Comparisons Between Representations
Bayesian Comparisons Between Representations
Heiko H. Schütt
FAtt
152
0
0
20 Feb 2025
Towards Active Participant Centric Vertical Federated Learning: Some Representations May Be All You Need
Towards Active Participant Centric Vertical Federated Learning: Some Representations May Be All You Need
Jon Irureta
Jon Imaz
Aizea Lojo
Javier Fernandez-Marques
Marco González
Iñigo Perona
FedML
85
1
0
20 Feb 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis
Ge Lei
Samuel J. Cooper
KELM
47
0
0
15 Feb 2025
Superpose Singular Features for Model Merging
Superpose Singular Features for Model Merging
Haiquan Qiu
You Wu
Quanming Yao
MoMe
43
0
0
15 Feb 2025
Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented Approach
Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented Approach
J. Yang
Dapeng Chen
Yajing Sun
Rongjun Li
Zhiyong Feng
Wei Peng
51
5
0
19 Jan 2025
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning
Eitan Wagner
Nitay Alon
J. Barnby
Omri Abend
LRM
85
2
0
18 Dec 2024
Transformers Use Causal World Models in Maze-Solving Tasks
Transformers Use Causal World Models in Maze-Solving Tasks
Alex F Spies
William Edwards
Michael I. Ivanitskiy
Adrians Skapars
Tilman Rauker
Katsumi Inoue
A. Russo
Murray Shanahan
119
1
0
16 Dec 2024
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations
Huaizhi Ge
Yiming Li
Qifan Wang
Yongfeng Zhang
Ruixiang Tang
AAML
SILM
78
0
0
19 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
40
5
0
07 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Maria Kirch
Constantin Weisser
Severin Field
Helen Yannakoudakis
Stephen Casper
37
2
0
02 Nov 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling
Emanuele Marconato
Sébastien Lachapelle
Sebastian Weichwald
Luigi Gresele
66
3
0
30 Oct 2024
Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation
Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation
Jaechang Kim
Jinmin Goh
Inseok Hwang
Jaewoong Cho
Jungseul Ok
ELM
28
1
0
28 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders
Decomposing The Dark Matter of Sparse Autoencoders
Joshua Engels
Logan Riggs
Max Tegmark
LLMSV
57
9
0
18 Oct 2024
Do LLMs "know" internally when they follow instructions?
Do LLMs "know" internally when they follow instructions?
Juyeon Heo
Christina Heinze-Deml
Oussama Elachqar
Shirley Ren
Udhay Nallasamy
Andy Miller
Kwan Ho Ryan Chan
Jaya Narain
51
3
0
18 Oct 2024
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via
  Mechanistic Localization
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
Phillip Guo
Aaquib Syed
Abhay Sheshadri
Aidan Ewart
Gintare Karolina Dziugaite
KELM
MU
31
5
0
16 Oct 2024
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based
  Language Models
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models
Kushal Tatariya
Vladimir Araujo
Thomas Bauwens
Miryam de Lhoneux
VLM
31
0
0
15 Oct 2024
Temporal Reasoning Transfer from Text to Video
Temporal Reasoning Transfer from Text to Video
Lei Li
Yuanxin Liu
Linli Yao
Peiyuan Zhang
Chenxin An
Lean Wang
Xu Sun
Lingpeng Kong
Qi Liu
LRM
40
7
0
08 Oct 2024
Provable Weak-to-Strong Generalization via Benign Overfitting
Provable Weak-to-Strong Generalization via Benign Overfitting
David X. Wu
A. Sahai
63
6
0
06 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao
Heng Zhao
Bo Shen
Ali Payani
Fan Yang
Mengnan Du
59
2
0
30 Sep 2024
Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis
Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis
Daoyang Li
Mingyu Jin
Qingcheng Zeng
Mengnan Du
57
2
0
22 Sep 2024
Self-Contrastive Forward-Forward Algorithm
Self-Contrastive Forward-Forward Algorithm
Xing Chen
Dongshu Liu
Jérémie Laydevant
Julie Grollier
34
2
0
17 Sep 2024
Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach
Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach
Tong Nie
Junlin He
Yuewen Mei
Guoyang Qin
Guilong Li
Jian Sun
Wei Ma
32
3
0
30 Aug 2024
LLMs' morphological analyses of complex FST-generated Finnish words
LLMs' morphological analyses of complex FST-generated Finnish words
Anssi Moisio
Mathias Creutz
M. Kurimo
44
1
0
11 Jul 2024
Does ChatGPT Have a Mind?
Does ChatGPT Have a Mind?
Simon Goldstein
B. Levinstein
AI4MH
LRM
34
5
0
27 Jun 2024
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in
  LLMs
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
Jannik Kossen
Jiatong Han
Muhammed Razzak
Lisa Schut
Shreshth A. Malik
Yarin Gal
HILM
55
33
0
22 Jun 2024
Do Large Language Models Exhibit Cognitive Dissonance? Studying the
  Difference Between Revealed Beliefs and Stated Answers
Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers
Manuel Mondal
Ljiljana Dolamic
Gérôme Bovet
Philippe Cudré-Mauroux
Julien Audiffren
38
2
0
21 Jun 2024
Insights into LLM Long-Context Failures: When Transformers Know but
  Don't Tell
Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell
Taiming Lu
Muhan Gao
Kuai Yu
Adam Byerly
Daniel Khashabi
46
11
0
20 Jun 2024
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language
  Models
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
Sree Harsha Tanneru
Dan Ley
Chirag Agarwal
Himabindu Lakkaraju
LRM
31
4
0
15 Jun 2024
Designing a Dashboard for Transparency and Control of Conversational AI
Designing a Dashboard for Transparency and Control of Conversational AI
Yida Chen
Aoyu Wu
Trevor DePodesta
Catherine Yeh
Kenneth Li
...
Jan Riecke
Shivam Raval
Olivia Seow
Martin Wattenberg
Fernanda Viégas
44
16
0
12 Jun 2024
Standards for Belief Representations in LLMs
Standards for Belief Representations in LLMs
Daniel A. Herrmann
B. Levinstein
34
6
0
31 May 2024
On Fairness of Low-Rank Adaptation of Large Models
On Fairness of Low-Rank Adaptation of Large Models
Zhoujie Ding
Ken Ziyu Liu
Pura Peetathawatchai
Berivan Isik
Sanmi Koyejo
40
4
0
27 May 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
61
7
0
26 May 2024
1234
Next