ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning
- OffRLALMRALM
Long-form generation has become a critical and challenging application for Large Language Models (LLMs). Existing studies are limited by their reliance on scarce, high-quality long-form response data and their focus on coarse-grained, general-purpose metrics (e.g., coherence and helpfulness), overlooking the nuanced, scenario-specific requirements of real-world tasks. To address these limitations, we propose a framework utilizing Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first decomposes each instruction into a set of fine-grained, adaptive constraint criteria spanning key dimensions of long-form generation tasks. Subsequently, we design a reward mechanism to quantify the response quality based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we leverage reinforcement learning to optimize LLMs using these fine-grained signals. Experimental results show that ACE-RL significantly outperforms existing SFT and RL baselines by 18.63% and 7.61% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 8.76%, providing a more effective training paradigm in long-form generation scenarios.
View on arXiv