Activation Reward Models for Few-Shot Model Alignment

2 July 2025

Tianning Chai

Chancharik Mitra

Brandon Huang

Gautam Rajendrakumar Gare

Zhiqiu Lin

Assaf Arbelle

Leonid Karlinsky

Rogerio Feris

Trevor Darrell

Deva Ramanan

Roei Herzig

ArXiv (abs)PDF HTML

Main:8 Pages

3 Figures

6 Tables

Appendix:18 Pages

Abstract

Aligning Large Language Models (LLMs) and Large Multimodal Models (LMMs) to human preferences is a central challenge in improving the quality of the models' generative outputs for real-world applications. A common approach is to use reward modeling to encode preferences, enabling alignment via post-training using reinforcement learning. However, traditional reward modeling is not easily adaptable to new preferences because it requires a separate reward model, commonly trained on large preference datasets. To address this, we introduce Activation Reward Models (Activation RMs) -- a novel few-shot reward modeling method that leverages activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning. Activation RMs outperform existing few-shot reward modeling approaches such as LLM-as-a-judge with in-context learning, voting-based scoring, and token probability scoring on standard reward modeling benchmarks. Furthermore, we demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications. Toward this end, we propose PreferenceHack, a novel few-shot setting benchmark, the first to test reward models on reward hacking in a paired preference format. Finally, we show that Activation RM achieves state-of-the-art performance on this benchmark, surpassing even GPT-4o.

View on arXiv

@article{chai2025_2507.01368,
  title={ Activation Reward Models for Few-Shot Model Alignment },
  author={ Tianning Chai and Chancharik Mitra and Brandon Huang and Gautam Rajendrakumar Gare and Zhiqiu Lin and Assaf Arbelle and Leonid Karlinsky and Rogerio Feris and Trevor Darrell and Deva Ramanan and Roei Herzig },
  journal={arXiv preprint arXiv:2507.01368},
  year={ 2025 }
}

Comments on this paper