ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.08112
  4. Cited By
Eliciting Latent Predictions from Transformers with the Tuned Lens

Eliciting Latent Predictions from Transformers with the Tuned Lens

14 March 2023
Nora Belrose
Zach Furman
Logan Smith
Danny Halawi
Igor V. Ostrovsky
Lev McKinney
Stella Biderman
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Eliciting Latent Predictions from Transformers with the Tuned Lens"

37 / 37 papers shown
Title
Demystifying optimized prompts in language models
Demystifying optimized prompts in language models
Rimon Melamed
Lucas H. McCabe
H. H. Huang
39
0
0
04 May 2025
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models
Tyler A. Chang
Benjamin Bergen
46
0
0
21 Apr 2025
Decoding Vision Transformers: the Diffusion Steering Lens
Decoding Vision Transformers: the Diffusion Steering Lens
Ryota Takatsuki
Sonia Joseph
Ippei Fujisawa
Ryota Kanai
DiffM
30
0
0
18 Apr 2025
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation
Jonathan Jacobi
Gal Niv
LRM
ReLM
55
0
0
03 Mar 2025
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Anton Razzhigaev
Matvey Mikhalchuk
Temurbek Rahmatullaev
Elizaveta Goncharova
Polina Druzhinina
Ivan V. Oseledets
Andrey Kuznetsov
57
1
0
20 Feb 2025
ReLearn: Unlearning via Learning for Large Language Models
ReLearn: Unlearning via Learning for Large Language Models
Haoming Xu
Ningyuan Zhao
Liming Yang
Sendong Zhao
Shumin Deng
Mengru Wang
Bryan Hooi
Nay Oo
H. Chen
N. Zhang
KELM
CLL
MU
94
0
0
16 Feb 2025
An Analysis Framework for Understanding Deep Neural Networks Based on Network Dynamics
An Analysis Framework for Understanding Deep Neural Networks Based on Network Dynamics
Yuchen Lin
Yong Zhang
Sihan Feng
Hong Zhao
26
0
0
05 Jan 2025
Transformers Use Causal World Models in Maze-Solving Tasks
Transformers Use Causal World Models in Maze-Solving Tasks
Alex F Spies
William Edwards
Michael I. Ivanitskiy
Adrians Skapars
Tilman Rauker
Katsumi Inoue
A. Russo
Murray Shanahan
99
1
0
16 Dec 2024
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations
Huaizhi Ge
Yiming Li
Qifan Wang
Yongfeng Zhang
Ruixiang Tang
AAML
SILM
72
0
0
19 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Zhibo Wang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
52
3
0
17 Nov 2024
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
73
0
0
12 Nov 2024
Controllable Context Sensitivity and the Knob Behind It
Controllable Context Sensitivity and the Knob Behind It
Julian Minder
Kevin Du
Niklas Stoehr
Giovanni Monea
Chris Wendler
Robert West
Ryan Cotterell
KELM
39
3
0
11 Nov 2024
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
Zhaofeng Wu
Xinyan Velocity Yu
Dani Yogatama
Jiasen Lu
Yoon Kim
AIFin
46
10
0
07 Nov 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
50
7
0
10 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
41
12
0
08 Oct 2024
Mitigating Copy Bias in In-Context Learning through Neuron Pruning
Mitigating Copy Bias in In-Context Learning through Neuron Pruning
Ameen Ali
Lior Wolf
Ivan Titov
27
2
0
02 Oct 2024
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
Christopher Ackerman
Nina Panickssery
DeLMO
21
1
0
02 Oct 2024
Correcting Negative Bias in Large Language Models through Negative Attention Score Alignment
Correcting Negative Bias in Large Language Models through Negative Attention Score Alignment
Sangwon Yu
Jongyoon Song
Bongkyu Hwang
Hoyoung Kang
Sooah Cho
Junhwa Choi
Seongho Joe
Taehee Lee
Youngjune Gwon
Sungroh Yoon
93
4
0
31 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
75
18
0
02 Jul 2024
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in
  LLMs
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
Jannik Kossen
Jiatong Han
Muhammed Razzak
Lisa Schut
Shreshth A. Malik
Yarin Gal
HILM
46
33
0
22 Jun 2024
A Concept-Based Explainability Framework for Large Multimodal Models
A Concept-Based Explainability Framework for Large Multimodal Models
Jayneel Parekh
Pegah Khayatan
Mustafa Shukor
A. Newson
Matthieu Cord
32
16
0
12 Jun 2024
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
Andis Draguns
Andrew Gritsevskiy
S. Motwani
Charlie Rogers-Smith
Jeffrey Ladish
Christian Schroeder de Witt
40
2
0
03 Jun 2024
The Unreasonable Ineffectiveness of the Deeper Layers
The Unreasonable Ineffectiveness of the Deeper Layers
Andrey Gromov
Kushal Tirumala
Hassan Shapourian
Paolo Glorioso
Daniel A. Roberts
41
79
0
26 Mar 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations
  of Language Models
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Asma Ghandeharioun
Avi Caciularu
Adam Pearce
Lucas Dixon
Mor Geva
25
87
0
11 Jan 2024
FlexModel: A Framework for Interpretability of Distributed Large
  Language Models
FlexModel: A Framework for Interpretability of Distributed Large Language Models
Matthew Choi
Muhammad Adil Asif
John Willes
David Emerson
AI4CE
ALM
19
1
0
05 Dec 2023
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia
Giovanni Monea
Maxime Peyrard
Martin Josifoski
Vishrav Chaudhary
Jason Eisner
Emre Kiciman
Hamid Palangi
Barun Patra
Robert West
KELM
47
12
0
04 Dec 2023
HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning
  in Large Language Models
HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models
Yinghui He
Yufan Wu
Yilin Jia
Rada Mihalcea
Yulong Chen
Naihao Deng
LRM
LLMAG
17
21
0
25 Oct 2023
LEACE: Perfect linear concept erasure in closed form
LEACE: Perfect linear concept erasure in closed form
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Ryan Cotterell
Edward Raff
Stella Biderman
KELM
MU
41
102
0
06 Jun 2023
Explaining How Transformers Use Context to Build Predictions
Explaining How Transformers Use Context to Build Predictions
Javier Ferrando
Gerard I. Gállego
Ioannis Tsiamas
Marta R. Costa-jussá
18
31
0
21 May 2023
The MiniPile Challenge for Data-Efficient Language Models
The MiniPile Challenge for Data-Efficient Language Models
Jean Kaddour
MoE
ALM
24
41
0
17 Apr 2023
Localizing Model Behavior with Path Patching
Localizing Model Behavior with Path Patching
Nicholas W. Goldowsky-Dill
Chris MacLeod
L. Sato
Aryaman Arora
8
85
0
12 Apr 2023
Not what you've signed up for: Compromising Real-World LLM-Integrated
  Applications with Indirect Prompt Injection
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Kai Greshake
Sahar Abdelnabi
Shailesh Mishra
C. Endres
Thorsten Holz
Mario Fritz
SILM
26
430
0
23 Feb 2023
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
491
0
01 Nov 2022
Toy Models of Superposition
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
120
316
0
21 Sep 2022
All Bark and No Bite: Rogue Dimensions in Transformer Language Models
  Obscure Representational Quality
All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality
William Timkey
Marten van Schijndel
213
110
0
09 Sep 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
224
402
0
24 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
245
1,986
0
31 Dec 2020
1