ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.09302
32
8

Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization

11 October 2024
Guanlin Liu
Kaixuan Ji
Ning Dai
Zheng Wu
Chen Dun
Q. Gu
Lin Yan
Quanquan Gu
Lin Yan
    OffRL
    LRM
ArXivPDFHTML
Abstract

Reinforcement Learning (RL) plays a crucial role in aligning large language models (LLMs) with human preferences and improving their ability to perform complex tasks. However, current approaches either require significant computational resources due to the use of multiple models and extensive online sampling for training (e.g., PPO) or are framed as bandit problems (e.g., DPO, DRO), which often struggle with multi-step reasoning tasks, such as math problem solving and complex reasoning that involve long chains of thought. To overcome these limitations, we introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision. Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.

View on arXiv
@article{ji2025_2410.09302,
  title={ Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization },
  author={ Kaixuan Ji and Guanlin Liu and Ning Dai and Qingping Yang and Renjie Zheng and Zheng Wu and Chen Dun and Quanquan Gu and Lin Yan },
  journal={arXiv preprint arXiv:2410.09302},
  year={ 2025 }
}
Comments on this paper