Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2505.19743
Cited By
v1
v2
v3 (latest)
Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models
International Joint Conference on Artificial Intelligence (IJCAI), 2025
26 May 2025
Y. Zhang
Yu Yu
Bo Tang
Yu Zhu
Chuxiong Sun
Wenqiang Wei
Jie Hu
Zipeng Xie
Zhiyu Li
Feiyu Xiong
Edward Chung
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models"
26 / 26 papers shown
DPO Meets PPO: Reinforced Token Optimization for RLHF
Han Zhong
Zikang Shan
Guhao Feng
Wei Xiong
Xinle Cheng
Li Zhao
Di He
Jiang Bian
Liwei Wang
618
97
0
29 Apr 2024
Token-level Direct Preference Optimization
Yongcheng Zeng
Guoqing Liu
Weiyu Ma
Ning Yang
Haifeng Zhang
Jun Wang
521
106
0
18 Apr 2024
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Yaowei Zheng
Richong Zhang
Junhao Zhang
Yanhan Ye
Zheyan Luo
Zhangchi Feng
Yongqiang Ma
648
1,132
0
20 Mar 2024
Aligner: Efficient Alignment by Learning to Correct
Jiaming Ji
Boyuan Chen
Hantao Lou
Chongye Guo
Borong Zhang
Xuehai Pan
Juntao Dai
Tianyi Qiu
Yaodong Yang
361
75
0
04 Feb 2024
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
Jacob Eisenstein
Chirag Nagpal
Alekh Agarwal
Ahmad Beirami
Alex DÁmour
...
Katherine Heller
Stephen Pfohl
Deepak Ramachandran
Peter Shaw
Jonathan Berant
644
137
0
14 Dec 2023
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai
Xuehai Pan
Ruiyang Sun
Jiaming Ji
Xinbo Xu
Mickel Liu
Yizhou Wang
Yaodong Yang
395
531
0
19 Oct 2023
A General Theoretical Paradigm to Understand Learning from Human Preferences
International Conference on Artificial Intelligence and Statistics (AISTATS), 2023
M. G. Azar
Mark Rowland
Bilal Piot
Daniel Guo
Daniele Calandriello
Michal Valko
Rémi Munos
597
835
0
18 Oct 2023
Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints
International Conference on Learning Representations (ICLR), 2023
Simon Mahns
Yibo Jiang
Yuguang Yang
Han Liu
Yuxin Chen
254
145
0
28 Sep 2023
RAIN: Your Language Models Can Align Themselves without Finetuning
International Conference on Learning Representations (ICLR), 2023
Yuhui Li
Fangyun Wei
Jinjing Zhao
Chao Zhang
Hongyang R. Zhang
SILM
291
156
0
13 Sep 2023
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Rishabh Bhardwaj
Soujanya Poria
ELM
369
211
0
18 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
7.7K
15,207
0
18 Jul 2023
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Neural Information Processing Systems (NeurIPS), 2023
Jiaming Ji
Mickel Liu
Juntao Dai
Xuehai Pan
Chi Zhang
Ce Bian
Chi Zhang
Ruiyang Sun
Yizhou Wang
Yaodong Yang
ALM
395
707
0
10 Jul 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Neural Information Processing Systems (NeurIPS), 2023
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
843
6,615
0
29 May 2023
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Hanze Dong
Wei Xiong
Deepanshu Goyal
Yihan Zhang
Winnie Chow
Boyao Wang
Shizhe Diao
Jipeng Zhang
Kashun Shum
Tong Zhang
ALM
458
635
0
13 Apr 2023
RRHF: Rank Responses to Align Language Models with Human Feedback without tears
Neural Information Processing Systems (NeurIPS), 2023
Zheng Yuan
Hongyi Yuan
Chuanqi Tan
Wei Wang
Songfang Huang
Feiran Huang
ALM
402
475
0
11 Apr 2023
Discovering Language Model Behaviors with Model-Written Evaluations
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
...
Danny Hernandez
Deep Ganguli
Evan Hubinger
Nicholas Schiefer
Jared Kaplan
ALM
359
580
0
19 Dec 2022
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
884
2,269
0
15 Dec 2022
Self-critiquing models for assisting human evaluators
William Saunders
Catherine Yeh
Jeff Wu
Steven Bills
Ouyang Long
Jonathan Ward
Jan Leike
ALM
ELM
378
357
0
12 Jun 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
897
3,458
0
12 Apr 2022
Training language models to follow instructions with human feedback
Neural Information Processing Systems (NeurIPS), 2022
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
2.1K
17,490
0
04 Mar 2022
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano
Jacob Hilton
S. Balaji
Jeff Wu
Ouyang Long
...
Gretchen Krueger
Kevin Button
Matthew Knight
B. Chess
John Schulman
ALM
RALM
462
1,605
0
17 Dec 2021
Recursively Summarizing Books with Human Feedback
Jeff Wu
Long Ouyang
Daniel M. Ziegler
Nissan Stiennon
Ryan J. Lowe
Jan Leike
Paul Christiano
ALM
501
341
0
22 Sep 2021
Dynamic Context Selection for Document-level Neural Machine Translation via Reinforcement Learning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Xiaomian Kang
Yang Zhao
Jiajun Zhang
Chengqing Zong
140
62
0
09 Oct 2020
Learning to summarize from human feedback
Neural Information Processing Systems (NeurIPS), 2020
Nisan Stiennon
Long Ouyang
Jeff Wu
Daniel M. Ziegler
Ryan J. Lowe
Chelsea Voss
Alec Radford
Dario Amodei
Paul Christiano
ALM
854
2,720
0
02 Sep 2020
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
1.7K
2,197
0
18 Sep 2019
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Tuomas Haarnoja
Aurick Zhou
Pieter Abbeel
Sergey Levine
2.3K
10,060
0
04 Jan 2018
1