ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.08120
59
0

DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

10 April 2025
Daniil Larionov
Sotaro Takeshita
Ran Zhang
Yanran Chen
Christoph Leiter
Zhipin Wang
Christian Greisinger
Steffen Eger
    ReLM
    ELM
    LRM
ArXivPDFHTML
Abstract

Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks, yet their effectiveness in evaluating natural language generation remains unexplored. This study systematically compares reasoning-based LLMs (DeepSeek-R1 and OpenAI o3) with their non-reasoning counterparts across machine translation (MT) and text summarization (TS) evaluation tasks. We evaluate eight models across three architectural categories, including state-of-the-art reasoning models, their distilled variants (ranging from 8B to 70B parameters), and equivalent conventional, non-reasoning LLMs. Our experiments on WMT23 and SummEval benchmarks reveal that the benefits of reasoning capabilities are highly model and task-dependent: while OpenAI o3-mini models show consistent performance improvements with increased reasoning intensity, DeepSeek-R1 underperforms compared to its non-reasoning variant, with exception to certain aspects of TS evaluation. Correlation analysis demonstrates that increased reasoning token usage positively correlates with evaluation quality in o3-mini models. Furthermore, our results show that distillation of reasoning capabilities maintains reasonable performance in medium-sized models (32B) but degrades substantially in smaller variants (8B). This work provides the first comprehensive assessment of reasoning LLMs for NLG evaluation and offers insights into their practical use.

View on arXiv
@article{larionov2025_2504.08120,
  title={ DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization? },
  author={ Daniil Larionov and Sotaro Takeshita and Ran Zhang and Yanran Chen and Christoph Leiter and Zhipin Wang and Christian Greisinger and Steffen Eger },
  journal={arXiv preprint arXiv:2504.08120},
  year={ 2025 }
}
Comments on this paper