Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.18629
Cited By
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
26 June 2024
Xin Lai
Zhuotao Tian
Yukang Chen
Senqiao Yang
Xiangru Peng
Jiaya Jia
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"
50 / 71 papers shown
Title
Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving
Qi Liu
Xinhao Zheng
Renqiu Xia
Xingzhi Qi
Qinxiang Cao
Junchi Yan
AIMat
40
0
0
07 May 2025
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
60
0
0
05 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
31
0
0
05 May 2025
RM-R1: Reward Modeling as Reasoning
X. Chen
Gaotang Li
Z. Wang
Bowen Jin
Cheng Qian
...
Y. Zhang
D. Zhang
Tong Zhang
Hanghang Tong
Heng Ji
ReLM
OffRL
LRM
42
0
0
05 May 2025
DeepCritic: Deliberate Critique with Large Language Models
Wenkai Yang
Jingwen Chen
Yankai Lin
Ji-Rong Wen
ALM
LRM
23
0
0
01 May 2025
Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning
Pengxiang Li
Zhi Gao
Bofei Zhang
Yapeng Mi
Xiaojian Ma
...
Tao Yuan
Yuwei Wu
Yunde Jia
Song-Chun Zhu
Qing Li
LLMAG
65
0
0
30 Apr 2025
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
Hasan Hammoud
Hani Itani
Bernard Ghanem
ReLM
LRM
69
0
0
29 Apr 2025
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Kesen Zhao
B. Zhu
Qianru Sun
Hanwang Zhang
MLLM
LRM
81
0
0
25 Apr 2025
A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents
Yuting Huang
Leilei Ding
Zhipeng Tang
Tianfu Wang
Xinrui Lin
W. Zhang
Mingxiao Ma
Yanyong Zhang
LLMAG
30
0
0
20 Apr 2025
Exploring Expert Failures Improves LLM Agent Tuning
Li-Cheng Lan
Andrew Bai
Minhao Cheng
Ruochen Wang
Cho-Jui Hsieh
LRM
43
0
0
17 Apr 2025
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng
Shijue Huang
Xingwei Qu
Ge Zhang
Yujia Qin
Baoquan Zhong
Chengquan Jiang
Jinxin Chi
Wanjun Zhong
OffRL
ReLM
SyDa
KELM
LRM
54
4
0
15 Apr 2025
REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective
Zhihao Xu
Yongqi Tong
Xin Zhang
Jun Zhou
Xiting Wang
31
0
0
15 Apr 2025
FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos
Rui Chen
Lei Sun
Jing Tang
Geng Li
Xiangxiang Chu
LRM
18
0
0
14 Apr 2025
SaRO: Enhancing LLM Safety through Reasoning-based Alignment
Yutao Mou
Yuxiao Luo
Shikun Zhang
Wei Ye
LLMSV
LRM
25
0
0
13 Apr 2025
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
X. Wang
Z. Yang
Chao Feng
Hongjin Lu
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
Furong Huang
Lijuan Wang
OODD
ReLM
VLM
LRM
59
1
0
10 Apr 2025
Supervised Optimism Correction: Be Confident When LLMs Are Sure
J. Zhang
Rushuai Yang
Shunyu Liu
Ting-En Lin
Fei Huang
Yi Chen
Y. Li
Dacheng Tao
OffRL
20
0
0
10 Apr 2025
LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models
Weibin Liao
Xin Gao
Tianyu Jia
Rihong Qiu
Yifan Zhu
Yang Lin
Xu Chu
Junfeng Zhao
Yasha Wang
30
0
0
03 Apr 2025
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
Chaohu Liu
Tianyi Gui
Yu Liu
Linli Xu
VLM
AAML
64
1
0
02 Apr 2025
Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents
Shuo Ren
Pu Jian
Zhenjiang Ren
Chunlin Leng
Can Xie
Jiajun Zhang
LLMAG
AI4CE
53
0
0
31 Mar 2025
CoRanking: Collaborative Ranking with Small and Large Ranking Agents
Wenhan Liu
Xinyu Ma
Y. X. Zhu
Lixin Su
S. Wang
Dawei Yin
Zhicheng Dou
ALM
34
0
0
30 Mar 2025
Controlling Large Language Model with Latent Actions
Chengxing Jia
Ziniu Li
Pengyuan Wang
Yi-Chen Li
Zhenyu Hou
Yuxiao Dong
Y. Yu
46
0
0
27 Mar 2025
Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark
Bingchen Miao
Y. Wu
Minghe Gao
Qifan Yu
Wendong Bu
Wenqiao Zhang
Yunfei Li
Siliang Tang
Tat-Seng Chua
Juncheng Billy Li
LLMAG
LRM
56
0
0
24 Mar 2025
Teaching LLMs for Step-Level Automatic Math Correction via Reinforcement Learning
J. Li
Jie Zhou
Yutao Yang
Bihao Zhan
Qianjun Pan
Yuyang Ding
Qin Chen
Jiang Bo
Xin Lin
Liang He
LRM
57
0
0
24 Mar 2025
A Survey on Mathematical Reasoning and Optimization with Large Language Models
Ali Forootani
OffRL
LRM
AI4CE
40
0
0
22 Mar 2025
MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow
Ziyue Wang
Junde Wu
Chang Han Low
Yueming Jin
LRM
55
1
0
21 Mar 2025
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Weiyun Wang
Zhangwei Gao
L. Chen
Zhe Chen
Jinguo Zhu
...
Lewei Lu
Haodong Duan
Yu Qiao
Jifeng Dai
Wenhai Wang
LRM
56
9
0
13 Mar 2025
Thinking Machines: A Survey of LLM based Reasoning Strategies
Dibyanayan Bandyopadhyay
Soham Bhattacharjee
Asif Ekbal
LRM
ELM
36
4
0
13 Mar 2025
Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving
Sara Rajaee
Kumar Pratik
Gabriele Cesa
Arash Behboodi
OffRL
LRM
56
0
0
12 Mar 2025
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Xufang Luo
Dongsheng Li
OffRL
LRM
53
3
0
10 Mar 2025
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang
Bohan Jia
Zijie Zhai
Shaosheng Cao
Zheyu Ye
Fei Zhao
Zhe Xu
Yao Hu
Shaohui Lin
MU
OffRL
LRM
MLLM
ReLM
VLM
50
35
0
09 Mar 2025
Process-based Self-Rewarding Language Models
Shimao Zhang
Xiao Liu
Xin Zhang
Junxiao Liu
Zheheng Luo
Shujian Huang
Yeyun Gong
ReLM
SyDa
LRM
82
2
0
05 Mar 2025
IterPref: Focal Preference Learning for Code Generation via Iterative Debugging
Jie Wu
Haoling Li
Xin Zhang
Jianwen Luo
Yangyu Huang
Ruihang Chu
Y. Yang
Scarlett Li
67
0
0
04 Mar 2025
Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models
Joykirat Singh
Tanmoy Chakraborty
A. Nambi
AI4Cl
LRM
ReLM
50
1
0
04 Mar 2025
Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution
K. Li
Tianhua Zhang
Yunxiang Li
Hongyin Luo
Abdalla Moustafa
Xixin Wu
James Glass
H. Meng
54
0
0
03 Mar 2025
Self-Training Elicits Concise Reasoning in Large Language Models
Tergel Munkhbat
Namgyu Ho
S. Kim
Yongjin Yang
Yujin Kim
Se-Young Yun
ReLM
LRM
54
10
0
27 Feb 2025
Self-rewarding correction for mathematical reasoning
Wei Xiong
Hanning Zhang
Chenlu Ye
Lichang Chen
Nan Jiang
Tong Zhang
ReLM
KELM
LRM
59
9
0
26 Feb 2025
Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time
Jiazheng Li
Yuxiang Zhou
Junru Lu
Gladys Tyen
Lin Gui
Cesare Aloisi
Yulan He
LRM
33
2
0
26 Feb 2025
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning
Wenkai Yang
Shuming Ma
Yankai Lin
Furu Wei
LRM
42
20
0
25 Feb 2025
CuDIP: Enhancing Theorem Proving in LLMs via Curriculum Learning-based Direct Preference Optimization
Shuming Shi
Ruobing Zuo
Gaolei He
Jianlin Wang
Chenyang Xu
Zhengfeng Yang
55
0
0
25 Feb 2025
Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review
Pei Fu
Tongkun Guan
Zining Wang
Zhentao Guo
Chen Duan
...
Boming Chen
Jiayao Ma
Qianyi Jiang
Kai Zhou
Junfeng Luo
VLM
48
0
0
23 Feb 2025
A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics
Ting-Ruen Wei
Haowei Liu
Xuyang Wu
Yi Fang
LRM
AI4CE
ReLM
KELM
101
1
0
21 Feb 2025
Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
Yongtao Wu
Luca Viano
Yihang Chen
Zhenyu Zhu
Kimon Antonakopoulos
Quanquan Gu
V. Cevher
43
0
0
18 Feb 2025
Preference Optimization for Reasoning with Pseudo Feedback
Fangkai Jiao
Geyang Guo
Xingxing Zhang
Nancy F. Chen
Shafiq R. Joty
Furu Wei
LRM
95
8
0
17 Feb 2025
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
L. Yang
Zhaochen Yu
Bin Cui
Mengdi Wang
ReLM
LRM
AI4CE
90
10
0
10 Feb 2025
PIPA: Preference Alignment as Prior-Informed Statistical Estimation
Junbo Li
Zhangyang Wang
Qiang Liu
OffRL
89
0
0
09 Feb 2025
Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization
Zishun Yu
Tengyu Xu
Di Jin
Karthik Abinav Sankararaman
Yun He
...
Eryk Helenowski
Chen Zhu
Sinong Wang
Hao Ma
Han Fang
LRM
49
4
0
29 Jan 2025
VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning
Ji Soo Lee
Jongha Kim
Jeehye Na
Jinyoung Park
H. Kim
VGen
34
0
0
12 Jan 2025
Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search
Shuangtao Li
Shuaihao Dong
Kexin Luan
Xinhan Di
Chaofan Ding
LRM
35
1
0
02 Jan 2025
Mars-PO: Multi-Agent Reasoning System Preference Optimization
Xiaoxuan Lou
Chaojie Wang
Bo An
LLMAG
LRM
64
0
0
28 Nov 2024
Patience Is The Key to Large Language Model Reasoning
Yijiong Yu
LRM
65
0
0
20 Nov 2024
1
2
Next