Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2311.09528
Cited By
HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM
16 November 2023
Zhilin Wang
Yi Dong
Jiaqi Zeng
Virginia Adams
Makesh Narsimhan Sreedhar
Daniel Egert
Olivier Delalleau
Jane Polak Scowcroft
Neel Kant
Aidan Swope
Oleksii Kuchaiev
3DV
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM"
50 / 86 papers shown
Title
On Evaluating LLM Alignment by Evaluating LLMs as Judges
Yixin Liu
Pengfei Liu
Arman Cohan
ELM
113
0
0
25 Nov 2025
STORE: Semantic Tokenization, Orthogonal Rotation and Efficient Attention for Scaling Up Ranking Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Y. Xu
Chaofan Fan
J. Hu
Yu Zhang
Zeng Xiaoyi
J. Zhang
96
1
0
24 Nov 2025
Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria
Kartik Garg
Shourya Mishra
Kartikeya Sinha
Ojaswi Pratap Singh
Ayush Chopra
...
Ammar Sheikh
Raghav Maheshwari
Aman Chadha
Vinija Jain
Amitava Das
OffRL
97
0
0
22 Nov 2025
The PLLuM Instruction Corpus
Piotr Pęzik
Filip Żarnecki
Konrad Kaczyñski
A. Cichosz
Zuzanna Deckert
...
Konrad Wojtasik
Arkadiusz Janz
P. Kazienko
Julia Moska
Jan Kocoñ
56
0
0
21 Nov 2025
Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models
Chenglong Wang
Yifu Huo
Yang Gan
Yongyu Mu
Qiaozhi He
...
Tongran Liu
Anxiang Ma
Zhengtao Yu
Jingbo Zhu
Tong Xiao
68
0
0
16 Nov 2025
Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation
Hefei Xu
Le Wu
Chen Cheng
Hao Liu
70
0
0
15 Nov 2025
RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks
Mian Wu
Gavin Zhang
Sewon Min
Sergey Levine
Aviral Kumar
OffRL
75
0
0
03 Nov 2025
Ask a Strong LLM Judge when Your Reward Model is Uncertain
Zhenghao Xu
Qin Lu
Qingru Zhang
Liang Qiu
Ilgee Hong
...
Yao Liu
Haoming Jiang
Lihong Li
Hyokun Yun
Tuo Zhao
84
0
0
23 Oct 2025
Rectifying Shortcut Behaviors in Preference-based Reward Learning
Wenqian Ye
Guangtao Zheng
Aidong Zhang
68
0
0
21 Oct 2025
Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning
Chenghao Zhu
Meiling Tao
Tiannan Wang
Dongyi Ding
Yuchen Eleanor Jiang
Wangchunshu Zhou
92
2
0
21 Oct 2025
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
Austin Xu
Xuan-Phi Nguyen
Yilun Zhou
Chien-Sheng Wu
Caiming Xiong
Shafiq Joty
OffRL
ALM
LRM
ELM
166
0
0
20 Oct 2025
Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis
Wonduk Seo
Juhyeon Lee
Junseo Koh
Hyunjin An
Jian Park
Seunghyun Lee
Haihua Chen
Yi Bu
LLMAG
LRM
89
0
0
18 Oct 2025
Detecting Adversarial Fine-tuning with Auditing Agents
Sarah Egler
John Schulman
Nicholas Carlini
AAML
MLAU
133
0
0
17 Oct 2025
Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
Yuchun Miao
Liang Ding
Sen Zhang
Rong Bao
L. Zhang
Dacheng Tao
128
0
0
15 Oct 2025
PIXEL: Adaptive Steering Via Position-wise Injection with eXact Estimated Levels under Subspace Calibration
Manjiang Yu
Hongji Li
Priyanka Singh
X. Li
Di Wang
Lijie Hu
LLMSV
219
3
0
11 Oct 2025
ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection
Jingbiao Mei
Mingsheng Sun
Jinghong Chen
Pengda Qin
Yuhong Li
Da Chen
Bill Byrne
72
0
0
08 Oct 2025
CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension
Rui Li
Zeyu Zhang
Xiaohe Bo
Zihang Tian
Xu Chen
Quanyu Dai
Zhenhua Dong
Ruiming Tang
RALM
120
0
0
07 Oct 2025
Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards
Yiran Shen
Yu Xia
Jonathan D. Chang
Prithviraj Ammanabrolu
92
0
0
01 Oct 2025
Improving Large Vision and Language Models by Learning from a Panel of Peers
J. Hernandez
Jing Shi
Simon Jenni
Vicente Ordonez
Kushal Kafle
80
1
0
01 Sep 2025
The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors
Abdelrahman Sadallah
Tim Baumgärtner
Iryna Gurevych
Ted Briscoe
213
2
0
31 Aug 2025
Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks
Sheng Liu
Qiang Sheng
Danding Wang
Yang Li
Guang Yang
Juan Cao
203
2
0
27 Aug 2025
MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models
Xinyan Jiang
L. Zhang
Jiayi Zhang
Qingsong Yang
Guimin Hu
Di Wang
Lijie Hu
LLMSV
255
1
0
14 Aug 2025
Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models
Mason Nakamura
Saaduddin Mahmud
K. H. Wray
Hamed Zamani
S. Zilberstein
60
0
0
07 Aug 2025
Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
Peng Lai
Jianjie Zheng
Sijie Cheng
Yun-Nung Chen
Peng Li
Yang Liu
Guanhua Chen
143
0
0
05 Aug 2025
PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization
Han Jiang
Dongyao Zhu
Zhihua Wei
Xiaoyuan Yi
Ziang Xiao
Xing Xie
135
1
0
22 Jul 2025
CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering
Hang Lv
Sheng Liang
Hao Wang
Hongchao Gu
Yaxiong Wu
Wei Guo
Defu Lian
Yong Liu
Tong Xu
105
3
0
07 Jul 2025
PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization
Meiling Tao
Chenghao Zhu
Dongyi Ding
Tiannan Wang
Yuchen Eleanor Jiang
Wangchunshu Zhou
198
4
0
15 Jun 2025
Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory
Jiancong Xiao
Zhekun Shi
Kaizhao Liu
Q. Long
Weijie J. Su
159
3
0
14 Jun 2025
Control-R: Towards controllable test-time scaling
Di Zhang
Weida Wang
Junxian Li
Xunzhi Wang
Jiatong Li
...
Peng Ye
Shufei Zhang
Xuming He
Yuqiang Li
Dongzhan Zhou
LRM
140
0
0
30 May 2025
SGM: A Framework for Building Specification-Guided Moderation Filters
Linguistics Vanguard (LV), 2024
M. Fatehkia
Enes Altinisik
Mohamed Osman
Husrev Taha Sencar
160
2
0
26 May 2025
Incentivizing High-Quality Human Annotations with Golden Questions
Shang Liu
Zhongze Cai
Hanzhao Wang
Zhongyao Ma
Xiaocheng Li
232
1
0
25 May 2025
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
Liang Luo
Jiaqi Zeng
Olivier Delalleau
Hoo-Chang Shin
Felipe Soares
Alexander Bukharin
Ellie Evans
Yi Dong
Oleksii Kuchaiev
256
22
0
16 May 2025
A Systematic Analysis of Base Model Choice for Reward Modeling
Kian Ahrabian
Pegah Jandaghi
Negar Mokhberian
Sai Praneeth Karimireddy
Jay Pujara
282
0
0
16 May 2025
RM-R1: Reward Modeling as Reasoning
Xiusi Chen
Gaotang Li
Xiping Hu
Sara Szymkuć
Cheng Qian
...
Yu Zhang
D. Zhang
Tong Zhang
Hanghang Tong
Heng Ji
ReLM
OffRL
LRM
714
66
0
05 May 2025
Learning a Canonical Basis of Human Preferences from Binary Ratings
Kailas Vodrahalli
Wei Wei
James Zou
199
1
0
31 Mar 2025
ENCORE: Entropy-guided Reward Composition for Multi-head Safety Reward Models
Xiaomin Li
Xupeng Chen
Jingxuan Fan
Eric Hanchen Jiang
Mingye Gao
AAML
193
5
0
26 Mar 2025
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Jian Guan
Jian Wu
Jia-Nan Li
Chuanqi Cheng
Wei Wu
LM&MA
587
11
0
21 Mar 2025
OASST-ETC Dataset: Alignment Signals from Eye-tracking Analysis of LLM Responses
Angela Lopez-Cardona
Sebastian Idesis
Miguel Barreda-Ángeles
Sergi Abadal
Ioannis Arapakis
382
1
0
13 Mar 2025
Green Prompting
Marta Adamska
Daria Smirnova
Hamid Nasiri
Zhengxin Yu
Peter Garraghan
1.1K
3
0
09 Mar 2025
Effectively Steer LLM To Follow Preference via Building Confident Directions
Bingqing Song
Boran Han
Shuai Zhang
Hao Wang
Haoyang Fang
Bonan Min
Yuyang Wang
Mingyi Hong
LLMSV
191
7
0
04 Mar 2025
CoPL: Collaborative Preference Learning for Personalizing LLMs
Youngbin Choi
Seunghyuk Cho
M. Lee
Moonjeong Park
Yesong Ko
Jungseul Ok
Dongwoo Kim
315
0
0
03 Mar 2025
Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs
International Conference on Learning Representations (ICLR), 2025
Zhaowei Zhang
Fengshuo Bai
Qizhi Chen
Chengdong Ma
Mingzhi Wang
Haoran Sun
Zilong Zheng
Wenbo Ding
501
17
0
26 Feb 2025
Rethinking Diverse Human Preference Learning through Principal Component Analysis
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Feng Luo
Rui Yang
Hao Sun
Chunyuan Deng
Jiarui Yao
Jingyan Shen
Huan Zhang
Hanjie Chen
345
4
0
18 Feb 2025
Multi-Attribute Steering of Language Models via Targeted Intervention
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Duy Nguyen
Archiki Prasad
Elias Stengel-Eskin
Joey Tianyi Zhou
LLMSV
305
12
0
18 Feb 2025
Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards
Xinyi Yang
Liang Zeng
Heng Dong
Chao Yu
Xiaojun Wu
H. Yang
Yu Wang
Milind Tambe
Tonghan Wang
267
5
0
18 Feb 2025
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Yingshui Tan
Yilei Jiang
Yongbin Li
Qingbin Liu
Xingyuan Bu
Yuchi Xu
Xiangyu Yue
Xiaoyong Zhu
Bo Zheng
ALM
319
13
0
17 Feb 2025
Bone Soups: A Seek-and-Soup Model Merging Approach for Controllable Multi-Objective Generation
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Guofu Xie
Xiao Zhang
Ting Yao
Yunsheng Shi
MoMe
384
3
0
15 Feb 2025
Typhoon T1: An Open Thai Reasoning Model
Pittawat Taveekitworachai
Potsawee Manakul
Kasima Tharnpipitchai
Kunat Pipatanakul
OffRL
LRM
527
1
0
13 Feb 2025
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
Yueqin Yin
Shentao Yang
Yujia Xie
Ziyi Yang
Yuting Sun
Hany Awadalla
Weizhu Chen
Mingyuan Zhou
276
5
0
07 Jan 2025
Geometric-Averaged Preference Optimization for Soft Preference Labels
Neural Information Processing Systems (NeurIPS), 2024
Hiroki Furuta
Kuang-Huei Lee
Shixiang Shane Gu
Y. Matsuo
Aleksandra Faust
Heiga Zen
Izzeddin Gur
328
15
0
31 Dec 2024
1
2
Next