ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.20322
  4. Cited By
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
v1v2 (latest)

Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

23 May 2025
Mengru Wang
Ziwen Xu
Shengyu Mao
Shumin Deng
Zhaopeng Tu
Ningyu Zhang
N. Zhang
    LLMSV
ArXiv (abs)PDFHTML

Papers citing "Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms"

50 / 55 papers shown
Title
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David Evans
LLMSV
157
3
0
23 Apr 2025
SEAL: Steerable Reasoning Calibration of Large Language Models for Free
SEAL: Steerable Reasoning Calibration of Large Language Models for Free
Runjin Chen
Zhenyu Zhang
Junyuan Hong
Souvik Kundu
Zhangyang Wang
OffRLLRM
139
14
0
07 Apr 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen
Can Rager
Johnny Lin
Curt Tigges
Joseph Isaac Bloom
...
Matthew Wearden
Arthur Conmy
Arthur Conmy
Samuel Marks
Neel Nanda
MU
164
23
0
12 Mar 2025
Steering Large Language Model Activations in Sparse Spaces
Reza Bayat
Ali Rahimi-Kalahroudi
Mohammad Pezeshki
Sarath Chandar
Pascal Vincent
LLMSV
76
9
0
28 Feb 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Subhash Kantamneni
Joshua Engels
Senthooran Rajamanoharan
Max Tegmark
Neel Nanda
141
17
0
23 Feb 2025
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
Alejandro Cuadron
Dacheng Li
Wenjie Ma
Xingyao Wang
Yichuan Wang
...
Aditya Desai
Ion Stoica
Ana Klimovic
Graham Neubig
Joseph E. Gonzalez
LRMAI4CE
304
54
0
12 Feb 2025
AnyEdit: Edit Any Knowledge Encoded in Language Models
AnyEdit: Edit Any Knowledge Encoded in Language Models
Houcheng Jiang
Sihang Li
Ningyu Zhang
Guojun Ma
Mingyang Wan
Xiang Wang
Xiangnan He
Tat-Seng Chua
KELM
135
19
0
08 Feb 2025
Sparse Autoencoders Do Not Find Canonical Units of Analysis
Sparse Autoencoders Do Not Find Canonical Units of Analysis
Patrick Leask
Bart Bussmann
Michael T. Pearce
Joseph Isaac Bloom
Curt Tigges
Noura Al Moubayed
Lee D. Sharkey
Neel Nanda
121
15
0
07 Feb 2025
Trading Inference-Time Compute for Adversarial Robustness
Trading Inference-Time Compute for Adversarial Robustness
Wojciech Zaremba
Evgenia Nitishinskaya
Boaz Barak
Stephanie Lin
Sam Toyer
...
Rachel Dias
Eric Wallace
Kai Y. Xiao
Johannes Heidecke
Amelia Glaese
LRMAAML
167
26
0
31 Jan 2025
Closed-Form Feedback-Free Learning with Forward Projection
Closed-Form Feedback-Free Learning with Forward Projection
Robert O'Shea
Bipin Rajendran
102
28
0
27 Jan 2025
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Xingyu Chen
Jiahao Xu
Tian Liang
Zhiwei He
Jianhui Pang
...
Zizhuo Zhang
Rui Wang
Zhaopeng Tu
Haitao Mi
Dong Yu
LRMReLM
206
197
0
30 Dec 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando
Oscar Obeso
Senthooran Rajamanoharan
Neel Nanda
169
33
0
21 Nov 2024
Improving Steering Vectors by Targeting Sparse Autoencoder Features
Improving Steering Vectors by Targeting Sparse Autoencoder Features
Sviatoslav Chalnev
Matthew Siu
Arthur Conmy
LLMSV
120
28
0
04 Nov 2024
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with
  Sparse Autoencoders
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Zhengfu He
Wentao Shu
Xuyang Ge
Lingjie Chen
Junxuan Wang
...
Qipeng Guo
Xuanjing Huang
Zuxuan Wu
Yu-Gang Jiang
Xipeng Qiu
123
31
0
27 Oct 2024
RobustKV: Defending Large Language Models against Jailbreak Attacks via
  KV Eviction
RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction
Tanqiu Jiang
Zian Wang
Jiacheng Liang
Changjiang Li
Yuhui Wang
Ting Wang
AAML
82
6
0
25 Oct 2024
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Yu Zhao
Alessio Devoto
Giwon Hong
Xiaotang Du
Aryo Pradipta Gema
Hongru Wang
Xuanli He
Kam-Fai Wong
Pasquale Minervini
KELMLLMSV
132
28
0
21 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
112
21
0
10 Oct 2024
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering
Joris Postmus
Steven Abreu
LLMSV
359
3
0
09 Oct 2024
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
David Chanin
James Wilken-Smith
Tomáš Dulka
Hardik Bhatnagar
Joseph Bloom
Joseph Isaac Bloom
128
37
0
22 Sep 2024
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual
  Knowledge in GPT-2 Small
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Maheep Chaudhary
Atticus Geiger
85
19
0
05 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
118
128
0
09 Aug 2024
Disentangling Dense Embeddings with Sparse Autoencoders
Disentangling Dense Embeddings with Sparse Autoencoders
Charles OÑeill
Christine Ye
K. Iyer
John F. Wu
58
7
0
01 Aug 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Meng Wang
Yunzhi Yao
Ziwen Xu
Shuofei Qiao
Shumin Deng
...
Yong Jiang
Pengjun Xie
Fei Huang
Huajun Chen
Ningyu Zhang
131
39
0
22 Jul 2024
Analyzing the Generalization and Reliability of Steering Vectors
Analyzing the Generalization and Reliability of Steering Vectors
Daniel Tan
David Chanin
Aengus Lynch
Dimitrios Kanoulas
Brooks Paige
Adrià Garriga-Alonso
Robert Kirk
LLMSV
152
27
0
17 Jul 2024
Fundamental Problems With Model Editing: How Should Rational Belief
  Revision Work in LLMs?
Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?
Peter Hase
Thomas Hofweber
Xiang Zhou
Elias Stengel-Eskin
Joey Tianyi Zhou
KELMLRM
101
17
0
27 Jun 2024
Multi-property Steering of Large Language Models with Dynamic Activation
  Composition
Multi-property Steering of Large Language Models with Dynamic Activation Composition
Daniel Scalena
Gabriele Sarti
Malvina Nissim
KELMLLMSVAI4CE
89
15
0
25 Jun 2024
Steering Without Side Effects: Improving Post-Deployment Control of
  Language Models
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSVAAML
112
24
0
21 Jun 2024
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering
Federico Errica
G. Siracusano
D. Sanvito
Roberto Bifulco
153
26
0
18 Jun 2024
Safety Arithmetic: A Framework for Test-time Safety Alignment of
  Language Models by Steering Parameters and Activations
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
Rima Hazra
Sayan Layek
Somnath Banerjee
Soujanya Poria
KELMLLMSV
79
13
0
17 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
146
16
0
13 Jun 2024
Controlling Large Language Model Agents with Entropic Activation
  Steering
Controlling Large Language Model Agents with Entropic Activation Steering
Nate Rahn
P. DÓro
Marc G. Bellemare
LLMSV
76
10
0
01 Jun 2024
Why Larger Language Models Do In-context Learning Differently?
Why Larger Language Models Do In-context Learning Differently?
Zhenmei Shi
Junyi Wei
Zhuoyan Xu
Yingyu Liang
81
26
0
30 May 2024
Personalized Steering of Large Language Models: Versatile Steering
  Vectors Through Bi-directional Preference Optimization
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
Yuanpu Cao
Tianrong Zhang
Bochuan Cao
Ziyi Yin
Lu Lin
Fenglong Ma
Jinghui Chen
LLMSV
88
33
0
28 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
135
158
0
22 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
171
159
0
28 Mar 2024
Detoxifying Large Language Models via Knowledge Editing
Detoxifying Large Language Models via Knowledge Editing
Meng Wang
Ningyu Zhang
Ziwen Xu
Zekun Xi
Shumin Deng
Yunzhi Yao
Qishen Zhang
Linyi Yang
Jindong Wang
Huajun Chen
KELM
112
66
0
21 Mar 2024
Extending Activation Steering to Broad Skills and Multiple Behaviours
Extending Activation Steering to Broad Skills and Multiple Behaviours
Teun van der Weij
Massimo Poesio
Nandi Schoots
LLMSV
90
18
0
09 Mar 2024
Measuring and Controlling Instruction (In)Stability in Language Model
  Dialogs
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs
Kenneth Li
Tianle Liu
Naomi Bashkansky
David Bau
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
96
12
0
13 Feb 2024
Style Vectors for Steering Generative Large Language Model
Style Vectors for Steering Generative Large Language Model
Kai Konen
Sophie Jentzsch
Diaoulé Diallo
Peer Schutt
Oliver Bensch
Roxanne El Baff
Dominik Opitz
Tobias Hecking
LLMSV
93
21
0
02 Feb 2024
Steering Llama 2 via Contrastive Activation Addition
Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky
Nick Gabrieli
Julian Schulz
Meg Tong
Evan Hubinger
Alexander Matt Turner
LLMSV
59
226
0
09 Dec 2023
Knowledge Editing for Large Language Models: A Survey
Knowledge Editing for Large Language Models: A Survey
Song Wang
Yaochen Zhu
Haochen Liu
Zaiyi Zheng
Chen Chen
Wenlin Yao
KELM
176
163
0
24 Oct 2023
Sparse Autoencoders Find Highly Interpretable Features in Language
  Models
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
139
449
0
15 Sep 2023
PromptRobust: Towards Evaluating the Robustness of Large Language Models
  on Adversarial Prompts
PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
Kaijie Zhu
Jindong Wang
Jiaheng Zhou
Zichen Wang
Hao Chen
...
Linyi Yang
Weirong Ye
Yue Zhang
Neil Zhenqiang Gong
Xingxu Xie
SILM
132
144
0
07 Jun 2023
Editing Large Language Models: Problems, Methods, and Opportunities
Editing Large Language Models: Problems, Methods, and Opportunities
Yunzhi Yao
Peng Wang
Bo Tian
Shuyang Cheng
Zhoubo Li
Shumin Deng
Huajun Chen
Ningyu Zhang
KELM
120
314
0
22 May 2023
Word Embeddings Are Steers for Language Models
Word Embeddings Are Steers for Language Models
Chi Han
Jialiang Xu
Manling Li
Yi R. Fung
Chenkai Sun
Nan Jiang
Tarek Abdelzaher
Heng Ji
LLMSV
107
42
0
22 May 2023
Training a Helpful and Harmless Assistant with Reinforcement Learning
  from Human Feedback
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
258
2,630
0
12 Apr 2022
Locating and Editing Factual Associations in GPT
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
259
1,391
0
10 Feb 2022
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLMOffRLLRM
412
4,606
0
27 Oct 2021
Do Prompt-Based Models Really Understand the Meaning of their Prompts?
Do Prompt-Based Models Really Understand the Meaning of their Prompts?
Albert Webson
Ellie Pavlick
LRM
134
374
0
02 Sep 2021
Fantastically Ordered Prompts and Where to Find Them: Overcoming
  Few-Shot Prompt Order Sensitivity
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Yao Lu
Max Bartolo
Alastair Moore
Sebastian Riedel
Pontus Stenetorp
AILawLRM
433
1,200
0
18 Apr 2021
12
Next