476

FAIR Enough: How Can We Develop and Assess a FAIR-Compliant Dataset for Large Language Models' Training?

Data Intelligence (DI), 2024
Abstract

Advancements in Large Language Models (LLMs) highlight the need for ethical practices and data integrity. We introduce a framework that embeds FAIR (Findable, Accessible, Interoperable, Reusable) data principles into LLM training. This approach marks a shift towards practices compliant with FAIR standards. Our framework presents guidelines for integrating FAIR data principles into LLM training. This initiative includes a checklist for researchers and developers. We also demonstrate its practical application through a case study focused on bias identification and mitigation in our FAIR-compliant dataset. This work is a significant contribution to AI ethics and data science, advocating for balanced and ethical training methods in LLMs.

View on arXiv
Comments on this paper