ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.14902
11
116

M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

24 May 2023
Yuxia Wang
Jonibek Mansurov
Petar Ivanov
Jinyan Su
Artem Shelmanov
Akim Tsvigun
Chenxi Whitehouse
Osama Mohammed Afzal
Tarek Mahmoud
Toru Sasaki
Thomas Arnold
Alham Fikri Aji
Nizar Habash
Iryna Gurevych
Preslav Nakov
    DeLMO
ArXivPDFHTML
Abstract

Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark \textbf{M4}, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at https://github.com/mbzuai-nlp/M4.

View on arXiv
Comments on this paper