Self-improvement, where models improve beyond their current performance without external supervision, remains a challenge. The core difficulty is sourcing a training signal stronger than what the model itself can currently produce. Majority voting has been shown to provide such a signal by aggregating over multiple samples, helping mitigate some of the inconsistencies in LM reasoning. In this work, we show that multi-agent debate--where models collaborate and exchange reasoning over multiple rounds--provides an even richer signal than single-round majority voting. We introduce Multi-Agent Consensus Alignment (MACA), which uses reinforcement learning (RL) to post-train models to effectively utilize multi-agent debate. We find that preference learning over full reasoning traces, learning to differentiate between majority and minority reasoning, is more effective than binary consensus rewards or SFT-based approaches for leveraging these debate signals. This produces three key improvements: models are (1) better at utilizing the multi-agent debate setting (+26.87% on MATH), (2) individually more accurate (+21.51% on MathQA), and (3) more self-consistent (+27.6% on GSM8K). We also see strong generalization to unseen benchmarks (+16.3% on GPQA, +11.6% on CommonsenseQA).

View on arXiv

Comments on this paper