ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.08758
  4. Cited By
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
  Crawled Corpus

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

18 April 2021
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
    AILaw
ArXivPDFHTML

Papers citing "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus"

50 / 72 papers shown
Title
Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning
Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning
Shaurya Sharthak
Vinayak Pahalwan
Adithya Kamath
Adarsh Shirawalmath
CLL
VLM
43
0
0
14 May 2025
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
Sai Krishna Mendu
Harish Yenala
Aditi Gulati
Shanu Kumar
Parag Agrawal
29
0
0
04 May 2025
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
Zhenyu (Allen) Zhang
Zechun Liu
Yuandong Tian
Harshit Khaitan
Z. Wang
Steven Li
57
0
0
28 Apr 2025
Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification
Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification
Takuma Udagawa
Yang Zhao
H. Kanayama
Bishwaranjan Bhattacharjee
26
0
0
19 Apr 2025
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Xiaoxuan Zhu
Zhouhong Gu
Baiqian Wu
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
40
0
0
01 Apr 2025
Un-Straightening Generative AI: How Queer Artists Surface and Challenge the Normativity of Generative AI Models
Un-Straightening Generative AI: How Queer Artists Surface and Challenge the Normativity of Generative AI Models
Jordan Taylor
Joel Mire
Franchesca Spektor
Alicia DeVrio
Maarten Sap
Haiyi Zhu
Sarah E Fox
58
1
0
12 Mar 2025
Re-evaluating Theory of Mind evaluation in large language models
Re-evaluating Theory of Mind evaluation in large language models
Jennifer Hu
Felix Sosa
T. Ullman
40
0
0
28 Feb 2025
Exploring and Controlling Diversity in LLM-Agent Conversation
Exploring and Controlling Diversity in LLM-Agent Conversation
Kuanchao Chu
Yi-Pei Chen
Hideki Nakayama
LLMAG
42
1
0
24 Feb 2025
More for Keys, Less for Values: Adaptive KV Cache Quantization
More for Keys, Less for Values: Adaptive KV Cache Quantization
Mohsen Hariri
Lam Nguyen
Sixu Chen
Shaochen Zhong
Qifan Wang
Xia Hu
Xiaotian Han
V. Chaudhary
MQ
38
0
0
24 Feb 2025
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Andreas Opedal
Haruki Shirakami
Bernhard Schölkopf
Abulhair Saparov
Mrinmaya Sachan
LRM
57
1
0
17 Feb 2025
Privacy-Preserving Dataset Combination
Privacy-Preserving Dataset Combination
Keren Fuentes
Mimee Xu
Irene Chen
36
0
0
09 Feb 2025
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
Eunsu Kim
Juyoung Suk
Seungone Kim
Niklas Muennighoff
Dongkwan Kim
Alice H. Oh
ELM
83
1
0
31 Dec 2024
BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
Yuzong Chen
Ahmed F. AbouElhamayed
Xilai Dai
Yang Wang
Marta Andronic
G. Constantinides
Mohamed S. Abdelfattah
MQ
103
1
0
18 Nov 2024
Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training
Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training
Elia Cunegatti
Leonardo Lucio Custode
Giovanni Iacca
47
0
0
11 Nov 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
59
3
0
24 Oct 2024
Self-calibration for Language Model Quantization and Pruning
Self-calibration for Language Model Quantization and Pruning
Miles Williams
G. Chrysostomou
Nikolaos Aletras
MQ
125
0
0
22 Oct 2024
ToW: Thoughts of Words Improve Reasoning in Large Language Models
ToW: Thoughts of Words Improve Reasoning in Large Language Models
Zhikun Xu
Ming shen
Jacob Dineen
Zhaonan Li
Xiao Ye
Shijie Lu
Aswin Rrv
Chitta Baral
Ben Zhou
LRM
130
1
0
21 Oct 2024
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
Fangru Lin
Shaoguang Mao
Emanuele La Malfa
Valentin Hofmann
Adrian de Wynter
Jing Yao
Si-Qing Chen
Michael Wooldridge
Furu Wei
Furu Wei
49
2
0
14 Oct 2024
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
HyoJung Han
Akiko Eriguchi
Haoran Xu
Hieu T. Hoang
Marine Carpuat
Huda Khayrallah
VLM
32
2
0
12 Oct 2024
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
David Grangier
Simin Fan
Skyler Seto
Pierre Ablin
36
3
0
30 Sep 2024
Benchmark Data Contamination of Large Language Models: A Survey
Benchmark Data Contamination of Large Language Models: A Survey
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
ELM
ALM
38
38
0
06 Jun 2024
DEPTH: Discourse Education through Pre-Training Hierarchically
DEPTH: Discourse Education through Pre-Training Hierarchically
Zachary Bamberger
Ofek Glick
Chaim Baskin
Yonatan Belinkov
59
0
0
13 May 2024
Recall Them All: Retrieval-Augmented Language Models for Long Object List Extraction from Long Documents
Recall Them All: Retrieval-Augmented Language Models for Long Object List Extraction from Long Documents
Sneha Singhania
Simon Razniewski
G. Weikum
RALM
34
1
0
04 May 2024
Data Authenticity, Consent, & Provenance for AI are all broken: what
  will it take to fix them?
Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?
Shayne Longpre
Robert Mahari
Naana Obeng-Marnu
William Brannon
Tobin South
Katy Gero
Sandy Pentland
Jad Kabbara
56
5
0
19 Apr 2024
Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models
Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models
Hyunbyung Park
Sukyung Lee
Gyoungjin Gim
Yungi Kim
Dahyun Kim
Chanjun Park
VLM
34
0
0
28 Mar 2024
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Jiacheng Liu
Sewon Min
Luke Zettlemoyer
Yejin Choi
Hannaneh Hajishirzi
43
50
0
30 Jan 2024
Potential Societal Biases of ChatGPT in Higher Education: A Scoping Review
Potential Societal Biases of ChatGPT in Higher Education: A Scoping Review
Ming Li
Ariunaa Enkhtur
B. Yamamoto
Fei Cheng
Lilan Chen
AI4CE
26
3
0
24 Nov 2023
Rethinking Benchmark and Contamination for Language Models with
  Rephrased Samples
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Shuo Yang
Wei-Lin Chiang
Lianmin Zheng
Joseph E. Gonzalez
Ion Stoica
ALM
27
110
0
08 Nov 2023
Towards A Holistic Landscape of Situated Theory of Mind in Large
  Language Models
Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models
Ziqiao Ma
Jacob Sansom
Run Peng
Joyce Chai
45
16
0
30 Oct 2023
A Comprehensive Evaluation of Tool-Assisted Generation Strategies
A Comprehensive Evaluation of Tool-Assisted Generation Strategies
Alon Jacovi
Avi Caciularu
Jonathan Herzig
Roee Aharoni
Bernd Bohnet
Mor Geva
ELM
28
6
0
16 Oct 2023
Position: Key Claims in LLM Research Have a Long Tail of Footnotes
Position: Key Claims in LLM Research Have a Long Tail of Footnotes
Anna Rogers
A. Luccioni
45
19
0
14 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
90
10,977
0
18 Jul 2023
Retentive Network: A Successor to Transformer for Large Language Models
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun
Li Dong
Shaohan Huang
Shuming Ma
Yuqing Xia
Jilong Xue
Jianyong Wang
Furu Wei
LRM
58
301
0
17 Jul 2023
Large Language Models as Batteries-Included Zero-Shot ESCO Skills
  Matchers
Large Language Models as Batteries-Included Zero-Shot ESCO Skills Matchers
Benjamin Clavié
Guillaume Soulié
24
11
0
07 Jul 2023
Extensive Evaluation of Transformer-based Architectures for Adverse Drug
  Events Extraction
Extensive Evaluation of Transformer-based Architectures for Adverse Drug Events Extraction
Simone Scaboro
Beatrice Portelli
Emmanuele Chersoni
Enrico Santus
Giuseppe Serra
16
8
0
08 Jun 2023
Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge in
  Foundation Models
Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge in Foundation Models
Tim Schott
Daniel Furman
Shreshta Bhat
ELM
32
4
0
23 May 2023
AIwriting: Relations Between Image Generation and Digital Writing
AIwriting: Relations Between Image Generation and Digital Writing
S. Rettberg
Talan Memmott
Jill Walker Rettberg
Jason Nelson
P. Lichty
DiffM
16
1
0
18 May 2023
PaLM 2 Technical Report
PaLM 2 Technical Report
Rohan Anil
Andrew M. Dai
Orhan Firat
Melvin Johnson
Dmitry Lepikhin
...
Ce Zheng
Wei Zhou
Denny Zhou
Slav Petrov
Yonghui Wu
ReLM
LRM
74
1,142
0
17 May 2023
NorBench -- A Benchmark for Norwegian Language Models
NorBench -- A Benchmark for Norwegian Language Models
David Samuel
Andrey Kutuzov
Samia Touileb
Erik Velldal
Lilja Ovrelid
Egil Rønningstad
Elina Sigdel
Anna Palatkina
21
23
0
06 May 2023
The MiniPile Challenge for Data-Efficient Language Models
The MiniPile Challenge for Data-Efficient Language Models
Jean Kaddour
MoE
ALM
24
40
0
17 Apr 2023
Sociocultural knowledge is needed for selection of shots in hate speech
  detection tasks
Sociocultural knowledge is needed for selection of shots in hate speech detection tasks
Antonis Maronikolakis
Abdullatif Köksal
Hinrich Schütze
37
0
0
04 Apr 2023
BloombergGPT: A Large Language Model for Finance
BloombergGPT: A Large Language Model for Finance
Shijie Wu
Ozan Irsoy
Steven Lu
Vadim Dabravolski
Mark Dredze
Sebastian Gehrmann
P. Kambadur
David S. Rosenberg
Gideon Mann
AIFin
71
785
0
30 Mar 2023
Class Cardinality Comparison as a Fermi Problem
Class Cardinality Comparison as a Fermi Problem
Shrestha Ghosh
Simon Razniewski
G. Weikum
37
2
0
08 Mar 2023
Auditing large language models: a three-layered approach
Auditing large language models: a three-layered approach
Jakob Mokander
Jonas Schuett
Hannah Rose Kirk
Luciano Floridi
AILaw
MLAU
42
194
0
16 Feb 2023
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained
  Language Models
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models
Alexandra Chronopoulou
Matthew E. Peters
Alexander M. Fraser
Jesse Dodge
MoMe
21
65
0
14 Feb 2023
Trustworthy Social Bias Measurement
Trustworthy Social Bias Measurement
Rishi Bommasani
Percy Liang
27
10
0
20 Dec 2022
Can Current Task-oriented Dialogue Models Automate Real-world Scenarios
  in the Wild?
Can Current Task-oriented Dialogue Models Automate Real-world Scenarios in the Wild?
Sang-Woo Lee
Sungdong Kim
Donghyeon Ko
Dong-hyun Ham
Youngki Hong
...
Wangkyo Jung
Kyunghyun Cho
Donghyun Kwak
H. Noh
W. Park
41
1
0
20 Dec 2022
Striving for data-model efficiency: Identifying data externalities on
  group performance
Striving for data-model efficiency: Identifying data externalities on group performance
Esther Rolf
Ben Packer
Alex Beutel
Fernando Diaz
TDI
22
2
0
11 Nov 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
101
2,306
0
09 Nov 2022
Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs
Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs
Maarten Sap
Ronan Le Bras
Daniel Fried
Yejin Choi
25
205
0
24 Oct 2022
12
Next