ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.02486
29
0

We Need Improved Data Curation and Attribution in AI for Scientific Discovery

3 April 2025
Mara Graziani
Antonio Foncubierta
Dimitrios Christofidellis
Irina Espejo Morales
Malina Molnar
Marvin Alberts
Matteo Manica
Jannis Born
ArXivPDFHTML
Abstract

As the interplay between human-generated and synthetic data evolves, new challenges arise in scientific discovery concerning the integrity of the data and the stability of the models. In this work, we examine the role of synthetic data as opposed to that of real experimental data for scientific research. Our analyses indicate that nearly three-quarters of experimental datasets available on open-access platforms have relatively low adoption rates, opening new opportunities to enhance their discoverability and usability by automated methods. Additionally, we observe an increasing difficulty in distinguishing synthetic from real experimental data. We propose supplementing ongoing efforts in automating synthetic data detection by increasing the focus on watermarking real experimental data, thereby strengthening data traceability and integrity. Our estimates suggest that watermarking even less than half of the real world data generated annually could help sustain model robustness, while promoting a balanced integration of synthetic and human-generated content.

View on arXiv
@article{graziani2025_2504.02486,
  title={ We Need Improved Data Curation and Attribution in AI for Scientific Discovery },
  author={ Mara Graziani and Antonio Foncubierta and Dimitrios Christofidellis and Irina Espejo-Morales and Malina Molnar and Marvin Alberts and Matteo Manica and Jannis Born },
  journal={arXiv preprint arXiv:2504.02486},
  year={ 2025 }
}
Comments on this paper