ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.07630
  4. Cited By
A Review of the Challenges with Massive Web-mined Corpora Used in Large
  Language Models Pre-Training

A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training

10 July 2024
Michał Perełkiewicz
Rafał Poświata
ArXivPDFHTML

Papers citing "A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training"

2 / 2 papers shown
Title
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
237
588
0
14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
245
1,986
0
31 Dec 2020
1