ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.05124
  4. Cited By
Extracting Latent Steering Vectors from Pretrained Language Models

Extracting Latent Steering Vectors from Pretrained Language Models

10 May 2022
Nishant Subramani
Nivedita Suresh
Matthew E. Peters
    LLMSV
ArXivPDFHTML

Papers citing "Extracting Latent Steering Vectors from Pretrained Language Models"

23 / 23 papers shown
Title
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Ren-Wei Liang
Chin-Ting Hsu
Chan-Hung Yu
Saransh Agrawal
Shih-Cheng Huang
Shang-Tse Chen
Kuan-Hao Huang
Shao-Hua Sun
76
0
0
27 Apr 2025
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
Chung-En Sun
Ge Yan
Tsui-Wei Weng
KELM
LRM
60
0
0
27 Mar 2025
Do Multilingual LLMs Think In English?
Do Multilingual LLMs Think In English?
Lisa Schut
Y. Gal
Sebastian Farquhar
42
3
0
24 Feb 2025
Activation Steering in Neural Theorem Provers
Activation Steering in Neural Theorem Provers
Shashank Kirtania
LLMSV
157
0
0
21 Feb 2025
Identifying and Manipulating Personality Traits in LLMs Through Activation Engineering
Identifying and Manipulating Personality Traits in LLMs Through Activation Engineering
Rumi A. Allbert
James K. Wiles
Vlad Grankovsky
LLMSV
AI4CE
77
1
0
10 Dec 2024
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
Zhaofeng Wu
Xinyan Velocity Yu
Dani Yogatama
Jiasen Lu
Yoon Kim
AIFin
46
10
0
07 Nov 2024
Do LLMs "know" internally when they follow instructions?
Do LLMs "know" internally when they follow instructions?
Juyeon Heo
Christina Heinze-Deml
Oussama Elachqar
Shirley Ren
Udhay Nallasamy
Andy Miller
Kwan Ho Ryan Chan
Jaya Narain
51
3
0
18 Oct 2024
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
Weixuan Wang
J. Yang
Wei Peng
LLMSV
26
2
0
16 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
62
14
0
15 Oct 2024
Uncovering Latent Chain of Thought Vectors in Language Models
Uncovering Latent Chain of Thought Vectors in Language Models
Jason Zhang
Scott Viteri
LLMSV
LRM
36
1
0
21 Sep 2024
Extracting Paragraphs from LLM Token Activations
Extracting Paragraphs from LLM Token Activations
Nicholas Pochinkov
Angelo Benoit
Lovkush Agarwal
Zainab Ali Majid
Lucile Ter-Minassian
30
1
0
10 Sep 2024
Residual Stream Analysis with Multi-Layer SAEs
Residual Stream Analysis with Multi-Layer SAEs
Tim Lawson
Lucy Farnik
Conor Houghton
Laurence Aitchison
26
3
0
06 Sep 2024
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in
  LLMs
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
Jannik Kossen
Jiatong Han
Muhammed Razzak
Lisa Schut
Shreshth A. Malik
Yarin Gal
HILM
58
33
0
22 Jun 2024
Discovering Bias in Latent Space: An Unsupervised Debiasing Approach
Discovering Bias in Latent Space: An Unsupervised Debiasing Approach
Dyah Adila
Shuai Zhang
Boran Han
Yuyang Wang
AAML
LLMSV
34
6
0
05 Jun 2024
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Chia-Yi Hsu
Yu-Lin Tsai
Chih-Hsun Lin
Pin-Yu Chen
Chia-Mu Yu
Chun-ying Huang
44
32
0
27 May 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
61
7
0
26 May 2024
Implicit In-context Learning
Implicit In-context Learning
Zhuowei Li
Zihao Xu
Ligong Han
Yunhe Gao
Song Wen
Di Liu
Hao Wang
Dimitris N. Metaxas
38
1
0
23 May 2024
Continuous Language Model Interpolation for Dynamic and Controllable
  Text Generation
Continuous Language Model Interpolation for Dynamic and Controllable Text Generation
Sara Kangaslahti
David Alvarez-Melis
KELM
29
0
0
10 Apr 2024
Test-Time Model Adaptation with Only Forward Passes
Test-Time Model Adaptation with Only Forward Passes
Shuaicheng Niu
Chunyan Miao
Guohao Chen
Pengcheng Wu
Peilin Zhao
TTA
38
18
0
02 Apr 2024
Fine-grained Text Style Transfer with Diffusion-Based Language Models
Fine-grained Text Style Transfer with Diffusion-Based Language Models
Yiwei Lyu
Tiange Luo
Jiacheng Shi
Todd C. Hollon
Ho Hin Lee
DiffM
32
3
0
31 May 2023
Editing Models with Task Arithmetic
Editing Models with Task Arithmetic
Gabriel Ilharco
Marco Tulio Ribeiro
Mitchell Wortsman
Suchin Gururangan
Ludwig Schmidt
Hannaneh Hajishirzi
Ali Farhadi
KELM
MoMe
MU
45
424
0
08 Dec 2022
Tailor: Generating and Perturbing Text with Semantic Controls
Tailor: Generating and Perturbing Text with Semantic Controls
Alexis Ross
Tongshuang Wu
Hao Peng
Matthew E. Peters
Matt Gardner
136
77
0
15 Jul 2021
The Power of Scale for Parameter-Efficient Prompt Tuning
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
280
3,844
0
18 Apr 2021
1