ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.18339
64
0

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

24 February 2025
Rylan Schaeffer
Punit Singh Koura
Binh Tang
R. Subramanian
Aaditya K. Singh
Todor Mihaylov
Prajjwal Bhargava
Lovish Madaan
Niladri S. Chatterji
Vedanuj Goswami
Sergey Edunov
Dieuwke Hupkes
Sanmi Koyejo
Sharan Narang
    ALM
ArXivPDFHTML
Abstract

The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy. In this paper, we conduct a large-scale study of four Chat Llama 2 models, comparing their performance on 160 standard NLP benchmarks (e.g., MMLU, ARC, BIG-Bench Hard) against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences. Three human evaluations, such as adversarial dishonesty and safety, are anticorrelated with NLP benchmarks, while two are uncorrelated. Moreover, through overparameterized linear regressions, we show that NLP scores can accurately predict human evaluations across different model scales, offering a path to reduce costly human annotation without sacrificing rigor. Overall, our results affirm the continued value of classic benchmarks and illuminate how to harness them to anticipate real-world user satisfaction - pointing to how NLP benchmarks can be leveraged to meet evaluation needs of our new era of conversational AI.

View on arXiv
@article{schaeffer2025_2502.18339,
  title={ Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks },
  author={ Rylan Schaeffer and Punit Singh Koura and Binh Tang and Ranjan Subramanian and Aaditya K Singh and Todor Mihaylov and Prajjwal Bhargava and Lovish Madaan and Niladri S. Chatterji and Vedanuj Goswami and Sergey Edunov and Dieuwke Hupkes and Sanmi Koyejo and Sharan Narang },
  journal={arXiv preprint arXiv:2502.18339},
  year={ 2025 }
}
Comments on this paper