ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.01672
217
254

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

2 February 2021
Sebastian Gehrmann
Tosin P. Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
Antoine Bosselut
Khyathi Raghavi Chandu
Miruna Clinciu
Dipanjan Das
Kaustubh D. Dhole
Wanyu Du
Esin Durmus
Ondrej Dusek
Chris C. Emezue
Varun Gangal
Cristina Garbacea
Tatsunori Hashimoto
Yufang Hou
Yacine Jernite
Harsh Jhamtani
Yangfeng Ji
Shailza Jolly
Mihir Kale
Dhruv Kumar
Faisal Ladhak
Aman Madaan
Mounica Maddela
Khyati Mahajan
Saad Mahamood
Bodhisattwa Prasad Majumder
Pedro Henrique Martins
Angelina McMillan-Major
Simon Mille
Emiel van Miltenburg
Moin Nadeem
Shashi Narayan
Vitaly Nikolaev
Andre Niyongabo Rubungo
Salomey Osei
Ankur P. Parikh
Laura Perez-Beltrachini
Niranjan Rao
Vikas Raunak
Juan Diego Rodriguez
Sashank Santhanam
João Sedoc
Thibault Sellam
Samira Shaikh
Anastasia Shimorina
Marco Antonio Sobrevilla Cabezudo
Hendrik Strobelt
Nishant Subramani
Wei-ping Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
    VLM
ArXivPDFHTML
Abstract

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.

View on arXiv
Comments on this paper