ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.08758
  4. Cited By
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
  Crawled Corpus

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

18 April 2021
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
    AILaw
ArXivPDFHTML

Papers citing "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus"

22 / 72 papers shown
Title
Self-Adaptive Named Entity Recognition by Retrieving Unstructured
  Knowledge
Self-Adaptive Named Entity Recognition by Retrieving Unstructured Knowledge
Kosuke Nishida
Naoki Yoshinaga
Kyosuke Nishida
22
2
0
14 Oct 2022
Noise-Robust De-Duplication at Scale
Noise-Robust De-Duplication at Scale
Emily Silcock
Luca DÁmico-Wong
Jinglin Yang
Melissa Dell
SyDa
31
20
0
09 Oct 2022
Language models show human-like content effects on reasoning tasks
Language models show human-like content effects on reasoning tasks
Ishita Dasgupta
Andrew Kyle Lampinen
Stephanie C. Y. Chan
Hannah R. Sheahan
Antonia Creswell
D. Kumaran
James L. McClelland
Felix Hill
ReLM
LRM
28
180
0
14 Jul 2022
Pile of Law: Learning Responsible Data Filtering from the Law and a
  256GB Open-Source Legal Dataset
Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Peter Henderson
M. Krass
Lucia Zheng
Neel Guha
Christopher D. Manning
Dan Jurafsky
Daniel E. Ho
AILaw
ELM
129
97
0
01 Jul 2022
Towards WinoQueer: Developing a Benchmark for Anti-Queer Bias in Large
  Language Models
Towards WinoQueer: Developing a Benchmark for Anti-Queer Bias in Large Language Models
Virginia K. Felkner
Ho-Chun Herbert Chang
Eugene Jang
Jonathan May
OSLM
21
8
0
23 Jun 2022
Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender
  Bias
Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias
Yarden Tal
Inbal Magar
Roy Schwartz
17
33
0
20 Jun 2022
Characteristics of Harmful Text: Towards Rigorous Benchmarking of
  Language Models
Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models
Maribeth Rauh
John F. J. Mellor
J. Uesato
Po-Sen Huang
Johannes Welbl
...
Amelia Glaese
G. Irving
Iason Gabriel
William S. Isaac
Lisa Anne Hendricks
25
49
0
16 Jun 2022
On Advances in Text Generation from Images Beyond Captioning: A Case
  Study in Self-Rationalization
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Shruti Palaskar
Akshita Bhagia
Yonatan Bisk
Florian Metze
A. Black
Ana Marasović
18
4
0
24 May 2022
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
  Their Implications
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Kaitlyn Zhou
Su Lin Blodgett
Adam Trischler
Hal Daumé
Kaheer Suleman
Alexandra Olteanu
ELM
94
26
0
13 May 2022
On the Limitations of Dataset Balancing: The Lost Battle Against
  Spurious Correlations
On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations
Roy Schwartz
Gabriel Stanovsky
27
24
0
27 Apr 2022
Language Contamination Helps Explain the Cross-lingual Capabilities of
  English Pretrained Models
Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models
Terra Blevins
Luke Zettlemoyer
24
85
0
17 Apr 2022
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black
Stella Biderman
Eric Hallahan
Quentin G. Anthony
Leo Gao
...
Shivanshu Purohit
Laria Reynolds
J. Tow
Benqi Wang
Samuel Weinbach
61
800
0
14 Apr 2022
Efficient Hierarchical Domain Adaptation for Pretrained Language Models
Efficient Hierarchical Domain Adaptation for Pretrained Language Models
Alexandra Chronopoulou
Matthew E. Peters
Jesse Dodge
23
42
0
16 Dec 2021
Recent Advances in Natural Language Processing via Large Pre-Trained
  Language Models: A Survey
Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey
Bonan Min
Hayley L Ross
Elior Sulem
Amir Pouran Ben Veyseh
Thien Huu Nguyen
Oscar Sainz
Eneko Agirre
Ilana Heinz
Dan Roth
LM&MA
VLM
AI4CE
71
1,029
0
01 Nov 2021
Risks of AI Foundation Models in Education
Risks of AI Foundation Models in Education
Su Lin Blodgett
Michael A. Madaio
UQCV
13
14
0
19 Oct 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
250
1,986
0
31 Dec 2020
Making Pre-trained Language Models Better Few-shot Learners
Making Pre-trained Language Models Better Few-shot Learners
Tianyu Gao
Adam Fisch
Danqi Chen
241
1,918
0
31 Dec 2020
Extracting Training Data from Large Language Models
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
281
1,812
0
14 Dec 2020
When does MAML Work the Best? An Empirical Study on Model-Agnostic
  Meta-Learning in NLP Applications
When does MAML Work the Best? An Empirical Study on Model-Agnostic Meta-Learning in NLP Applications
Zequn Liu
Ruiyi Zhang
Yiping Song
Wei Ju
Ming Zhang
20
8
0
24 May 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
228
4,460
0
23 Jan 2020
The Woman Worked as a Babysitter: On Biases in Language Generation
The Woman Worked as a Babysitter: On Biases in Language Generation
Emily Sheng
Kai-Wei Chang
Premkumar Natarajan
Nanyun Peng
208
616
0
03 Sep 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,950
0
20 Apr 2018
Previous
12