A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training

10 July 2024

Michał Perełkiewicz

Rafał Poświata

Papers citing "A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training"

2 / 2 papers shown

Title
Deduplicating Training Data Makes Language Models Better Katherine Lee Daphne Ippolito A. Nystrom Chiyuan Zhang Douglas Eck Chris Callison-Burch Nicholas Carlini SyDa 237 588 0 14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 245 1,986 0 31 Dec 2020