Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2311.09528
Cited By
HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM
16 November 2023
Zhilin Wang
Yi Dong
Jiaqi Zeng
Virginia Adams
Makesh Narsimhan Sreedhar
Daniel Egert
Olivier Delalleau
Jane Polak Scowcroft
Neel Kant
Aidan Swope
Oleksii Kuchaiev
3DV
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM"
36 / 86 papers shown
Title
Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness
Avinash Amballa
Durga Sandeep Saluru
Gayathri Akkinapalli
Abhishek Sureddy
Akshay Kumar Sureddy
ALM
218
0
0
26 Nov 2024
Self-Generated Critiques Boost Reward Modeling for Language Models
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Yue Yu
Zhengxing Chen
Aston Zhang
L Tan
Chenguang Zhu
...
Suchin Gururangan
Chao-Yue Zhang
Melanie Kambadur
Dhruv Mahajan
Rui Hou
LRM
ALM
476
49
0
25 Nov 2024
Interpreting Language Reward Models via Contrastive Explanations
International Conference on Learning Representations (ICLR), 2024
Junqi Jiang
Tom Bewley
Saumitra Mishra
Freddy Lecue
Manuela Veloso
417
5
0
25 Nov 2024
Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
Heyang Zhao
Chenlu Ye
Quanquan Gu
Tong Zhang
OffRL
459
10
0
07 Nov 2024
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Chris Yuhao Liu
Liang Zeng
Qingbin Liu
Rui Yan
Jujie He
Chaojie Wang
Shuicheng Yan
Yang Liu
Yahui Zhou
AI4TS
256
209
0
24 Oct 2024
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Lester James V. Miranda
Yizhong Wang
Yanai Elazar
Sachin Kumar
Valentina Pyatkin
Faeze Brahman
Noah A. Smith
Hannaneh Hajishirzi
Pradeep Dasigi
321
19
0
24 Oct 2024
Cross-lingual Transfer of Reward Models in Multilingual Alignment
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Jiwoo Hong
Noah Lee
Rodrigo Martínez-Castaño
César Rodríguez
James Thorne
360
15
0
23 Oct 2024
Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment
International Conference on Learning Representations (ICLR), 2024
Mingzhi Wang
Chengdong Ma
Qizhi Chen
Linjian Meng
Yang Han
Jiancong Xiao
Zhaowei Zhang
Jing Huo
Weijie Su
Wenbo Ding
502
15
0
22 Oct 2024
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Tingchen Fu
Mrinank Sharma
Juil Sock
Shay B. Cohen
David M. Krueger
Fazl Barez
AAML
414
24
0
11 Oct 2024
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Shenao Zhang
Zhihan Liu
Boyi Liu
Yanzhe Zhang
Yingxiang Yang
Yunxing Liu
Liyu Chen
Tao Sun
Ziyi Wang
483
5
0
10 Oct 2024
Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models
International Conference on Learning Representations (ICLR), 2024
Angela Lopez-Cardona
Carlos Segura
Alexandros Karatzoglou
Sergi Abadal
Ioannis Arapakis
ALM
443
8
0
02 Oct 2024
HelpSteer2-Preference: Complementing Ratings with Preferences
International Conference on Learning Representations (ICLR), 2024
Zhilin Wang
Alexander Bukharin
Olivier Delalleau
Daniel Egert
Gerald Shen
Jiaqi Zeng
Oleksii Kuchaiev
Yi Dong
ALM
396
97
0
02 Oct 2024
The Perfect Blend: Redefining RLHF with Mixture of Judges
Tengyu Xu
Eryk Helenowski
Karthik Abinav Sankararaman
Di Jin
Kaiyan Peng
...
Gabriel Cohen
Yuandong Tian
Hao Ma
Sinong Wang
Han Fang
325
24
0
30 Sep 2024
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
307
19
0
23 Sep 2024
Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization
AAAI Conference on Artificial Intelligence (AAAI), 2024
Zhuangzhuang Ye
Yang Zhou
Xiaocheng Zhang
Mengjiao Bao
Peng Yan
147
5
0
17 Sep 2024
Semi-Supervised Reward Modeling via Iterative Self-Training
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Yifei He
Haoxiang Wang
Ziyan Jiang
Alexandros Papangelis
Han Zhao
OffRL
214
9
0
10 Sep 2024
Critique-out-Loud Reward Models
Zachary Ankner
Mansheej Paul
Brandon Cui
Jonathan D. Chang
Prithviraj Ammanabrolu
ALM
LRM
279
67
0
21 Aug 2024
Learning Goal-Conditioned Representations for Language Reward Models
Vaskar Nath
Dylan Slack
Jeff Da
Yuntao Ma
Hugh Zhang
Spencer Whitehead
Sean Hendryx
154
0
0
18 Jul 2024
NativQA: Multilingual Culturally-Aligned Natural Query for LLMs
Md. Arid Hasan
Maram Hasanain
Fatema Ahmad
Sahinur Rahman Laskar
Sunaya Upadhyay
Vrunda N. Sukhadia
Mucahid Kutlu
Shammur A. Chowdhury
Firoj Alam
346
17
0
13 Jul 2024
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
Tzu-Han Lin
Chen-An Li
Hung-yi Lee
Yun-Nung Chen
VLM
ALM
107
6
0
01 Jul 2024
Decoding-Time Language Model Alignment with Multiple Objectives
Ruizhe Shi
Yifang Chen
Yushi Hu
Alisa Liu
Hannaneh Hajishirzi
Noah A. Smith
Simon Du
342
69
0
27 Jun 2024
Cascade Reward Sampling for Efficient Decoding-Time Alignment
Bolian Li
Yifan Wang
Anamika Lochab
A. Grama
Ruqi Zhang
AI4TS
473
27
0
24 Jun 2024
Nemotron-4 340B Technical Report
Nvidia
:
Bo Adler
Niket Agarwal
Ashwath Aithal
...
Jimmy Zhang
Jing Zhang
Vivienne Zhang
Yian Zhang
Chen Zhu
253
105
0
17 Jun 2024
Distributional Preference Alignment of LLMs via Optimal Transport
Neural Information Processing Systems (NeurIPS), 2024
Igor Melnyk
Youssef Mroueh
Brian M. Belgodere
Mattia Rigotti
Apoorva Nitsure
Mikhail Yurochkin
Kristjan Greenewald
Jirí Navrátil
Jerret Ross
248
19
0
09 Jun 2024
RLHF Workflow: From Reward Modeling to Online RLHF
Hanze Dong
Wei Xiong
Bo Pang
Haoxiang Wang
Han Zhao
Yingbo Zhou
Nan Jiang
Doyen Sahoo
Caiming Xiong
Tong Zhang
OffRL
211
200
0
13 May 2024
Performance-Aligned LLMs for Generating Fast Code
Daniel Nichols
Pranav Polasam
Harshitha Menon
Aniruddha Marathe
T. Gamblin
A. Bhatele
184
17
0
29 Apr 2024
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues
Makesh Narsimhan Sreedhar
Traian Rebedea
Shaona Ghosh
Jiaqi Zeng
Christopher Parisien
ALM
262
11
0
04 Apr 2024
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
Haoxiang Wang
Yong Lin
Wei Xiong
Rui Yang
Shizhe Diao
Delin Qu
Han Zhao
Tong Zhang
350
122
0
28 Feb 2024
Multi-modal preference alignment remedies regression of visual instruction tuning on language model
Shengzhi Li
Rongyu Lin
Shichao Pei
267
29
0
16 Feb 2024
Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
Rui Yang
Xiaoman Pan
Feng Luo
Delin Qu
Han Zhong
Dong Yu
Jianshu Chen
429
116
0
15 Feb 2024
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
Yuchun Miao
Sen Zhang
Liang Ding
Rong Bao
Lefei Zhang
Dacheng Tao
263
52
0
14 Feb 2024
Suppressing Pink Elephants with Direct Principle Feedback
Louis Castricato
Nathan Lile
Suraj Anand
Hailey Schoelkopf
Siddharth Verma
Stella Biderman
241
12
0
12 Feb 2024
Online Iterative Reinforcement Learning from Human Feedback with General Preference Model
Neural Information Processing Systems (NeurIPS), 2024
Chen Ye
Wei Xiong
Yuheng Zhang
Nan Jiang
Tong Zhang
OffRL
230
30
0
11 Feb 2024
Data Diversity Matters for Robust Instruction Tuning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Alexander Bukharin
Tuo Zhao
296
67
0
21 Nov 2023
Can LLMs Follow Simple Rules?
Norman Mu
Sarah Chen
Zifan Wang
Sizhe Chen
David Karamardian
Lulwa Aljeraisy
Basel Alomair
Dan Hendrycks
David Wagner
ALM
281
41
0
06 Nov 2023
Deep Reinforcement Learning from Hierarchical Preference Design
Alexander Bukharin
Yixiao Li
Pengcheng He
Tuo Zhao
259
1
0
06 Sep 2023
Previous
1
2