34
0

Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance

Abstract

Existing benchmarks are becoming saturated and struggle to separate model performances due to factors like data contamination and advancing LLM capabilities. This paper introduces EMDM (Enhanced Model Differentiation Metric), a novel weighted metric that revitalizes benchmarks by enhancing model separation. EMDM integrates final answer and Chain-of-Thought (CoT) reasoning correctness, assigning weights based on the complexity and reasoning depth required to solve a given sample in the evaluation data. Using a baseline LLM in two setups-Unguided, where the model has no prior exposure to test samples, and Guided, where the model has prior knowledge of the desired answer-EMDM distinguishes instances of varying difficulty. The CoT and answer correctness from these setups inform an optimization objective for weight assignment, resulting in a more nuanced evaluation of model performance. Compared to the exact match (EM) metric, which achieves 17% separation on ARC-Challenge, EMDM achieves 46%, demonstrating its effectiveness in differentiating models based on reasoning and knowledge requirements.

View on arXiv
@article{etzine2025_2503.05551,
  title={ Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance },
  author={ Bryan Etzine and Masoud Hashemi and Nishanth Madhusudhan and Sagar Davasam and Roshnee Sharma and Sathwik Tejaswi Madhusudhan and Vikas Yadav },
  journal={arXiv preprint arXiv:2503.05551},
  year={ 2025 }
}
Comments on this paper