23

The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation

IEEE Access (IEEE Access), 2026
Diaoulé Diallo
Katharina Dworatzyk
Sophie Jentzsch
Peer Schütt
Sabine Theis
Tobias Hecking
Main:11 Pages
8 Figures
Bibliography:2 Pages
4 Tables
Appendix:3 Pages
Abstract

Controlling the behavior of large language models (LLMs) at inference time is essential for aligning outputs with human abilities and safety requirements. \emph{Activation steering} provides a lightweight alternative to prompt engineering and fine-tuning by directly modifying internal activations to guide generation. This research advances the literature in three significant directions. First, while previous work demonstrated the technical feasibility of steering emotional tone using automated classifiers, this paper presents the first human evaluation of activation steering concerning the emotional tone of LLM outputs, collecting over 7,000 crowd-sourced ratings from 190 participants via Prolific (n=190n=190). These ratings assess both perceived emotional intensity and overall text quality. Second, we find strong alignment between human and model-based quality ratings (mean r=0.776r=0.776, range 0.1570.157--0.9850.985), indicating automatic scoring can proxy perceived quality. Moderate steering strengths (λ0.15\lambda \approx 0.15) reliably amplify target emotions while preserving comprehensibility, with the strongest effects for disgust (ηp2=0.616\eta_p^2 = 0.616) and fear (ηp2=0.540\eta_p^2 = 0.540), and minimal effects for surprise (ηp2=0.042\eta_p^2 = 0.042). Finally, upgrading from Alpaca to LlaMA-3 yielded more consistent steering with significant effects across emotions and strengths (all p<0.001p < 0.001). Inter-rater reliability was high (ICC =0.71= 0.71--0.870.87), underscoring the robustness of the findings. These findings support activation-based control as a scalable method for steering LLM behavior across affective dimensions.

View on arXiv
Comments on this paper