Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2104.08758
Cited By
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
18 April 2021
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus"
50 / 72 papers shown
Title
Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning
Shaurya Sharthak
Vinayak Pahalwan
Adithya Kamath
Adarsh Shirawalmath
CLL
VLM
43
0
0
14 May 2025
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
Sai Krishna Mendu
Harish Yenala
Aditi Gulati
Shanu Kumar
Parag Agrawal
29
0
0
04 May 2025
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
Zhenyu (Allen) Zhang
Zechun Liu
Yuandong Tian
Harshit Khaitan
Z. Wang
Steven Li
57
0
0
28 Apr 2025
Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification
Takuma Udagawa
Yang Zhao
H. Kanayama
Bishwaranjan Bhattacharjee
26
0
0
19 Apr 2025
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Xiaoxuan Zhu
Zhouhong Gu
Baiqian Wu
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
40
0
0
01 Apr 2025
Un-Straightening Generative AI: How Queer Artists Surface and Challenge the Normativity of Generative AI Models
Jordan Taylor
Joel Mire
Franchesca Spektor
Alicia DeVrio
Maarten Sap
Haiyi Zhu
Sarah E Fox
58
1
0
12 Mar 2025
Re-evaluating Theory of Mind evaluation in large language models
Jennifer Hu
Felix Sosa
T. Ullman
40
0
0
28 Feb 2025
Exploring and Controlling Diversity in LLM-Agent Conversation
Kuanchao Chu
Yi-Pei Chen
Hideki Nakayama
LLMAG
42
1
0
24 Feb 2025
More for Keys, Less for Values: Adaptive KV Cache Quantization
Mohsen Hariri
Lam Nguyen
Sixu Chen
Shaochen Zhong
Qifan Wang
Xia Hu
Xiaotian Han
V. Chaudhary
MQ
38
0
0
24 Feb 2025
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Andreas Opedal
Haruki Shirakami
Bernhard Schölkopf
Abulhair Saparov
Mrinmaya Sachan
LRM
57
1
0
17 Feb 2025
Privacy-Preserving Dataset Combination
Keren Fuentes
Mimee Xu
Irene Chen
36
0
0
09 Feb 2025
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
Eunsu Kim
Juyoung Suk
Seungone Kim
Niklas Muennighoff
Dongkwan Kim
Alice H. Oh
ELM
83
1
0
31 Dec 2024
BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
Yuzong Chen
Ahmed F. AbouElhamayed
Xilai Dai
Yang Wang
Marta Andronic
G. Constantinides
Mohamed S. Abdelfattah
MQ
103
1
0
18 Nov 2024
Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training
Elia Cunegatti
Leonardo Lucio Custode
Giovanni Iacca
47
0
0
11 Nov 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
59
3
0
24 Oct 2024
Self-calibration for Language Model Quantization and Pruning
Miles Williams
G. Chrysostomou
Nikolaos Aletras
MQ
125
0
0
22 Oct 2024
ToW: Thoughts of Words Improve Reasoning in Large Language Models
Zhikun Xu
Ming shen
Jacob Dineen
Zhaonan Li
Xiao Ye
Shijie Lu
Aswin Rrv
Chitta Baral
Ben Zhou
LRM
130
1
0
21 Oct 2024
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
Fangru Lin
Shaoguang Mao
Emanuele La Malfa
Valentin Hofmann
Adrian de Wynter
Jing Yao
Si-Qing Chen
Michael Wooldridge
Furu Wei
Furu Wei
49
2
0
14 Oct 2024
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
HyoJung Han
Akiko Eriguchi
Haoran Xu
Hieu T. Hoang
Marine Carpuat
Huda Khayrallah
VLM
32
2
0
12 Oct 2024
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
David Grangier
Simin Fan
Skyler Seto
Pierre Ablin
36
3
0
30 Sep 2024
Benchmark Data Contamination of Large Language Models: A Survey
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
ELM
ALM
38
38
0
06 Jun 2024
DEPTH: Discourse Education through Pre-Training Hierarchically
Zachary Bamberger
Ofek Glick
Chaim Baskin
Yonatan Belinkov
59
0
0
13 May 2024
Recall Them All: Retrieval-Augmented Language Models for Long Object List Extraction from Long Documents
Sneha Singhania
Simon Razniewski
G. Weikum
RALM
34
1
0
04 May 2024
Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?
Shayne Longpre
Robert Mahari
Naana Obeng-Marnu
William Brannon
Tobin South
Katy Gero
Sandy Pentland
Jad Kabbara
56
5
0
19 Apr 2024
Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models
Hyunbyung Park
Sukyung Lee
Gyoungjin Gim
Yungi Kim
Dahyun Kim
Chanjun Park
VLM
34
0
0
28 Mar 2024
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Jiacheng Liu
Sewon Min
Luke Zettlemoyer
Yejin Choi
Hannaneh Hajishirzi
43
50
0
30 Jan 2024
Potential Societal Biases of ChatGPT in Higher Education: A Scoping Review
Ming Li
Ariunaa Enkhtur
B. Yamamoto
Fei Cheng
Lilan Chen
AI4CE
26
3
0
24 Nov 2023
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Shuo Yang
Wei-Lin Chiang
Lianmin Zheng
Joseph E. Gonzalez
Ion Stoica
ALM
27
110
0
08 Nov 2023
Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models
Ziqiao Ma
Jacob Sansom
Run Peng
Joyce Chai
45
16
0
30 Oct 2023
A Comprehensive Evaluation of Tool-Assisted Generation Strategies
Alon Jacovi
Avi Caciularu
Jonathan Herzig
Roee Aharoni
Bernd Bohnet
Mor Geva
ELM
28
6
0
16 Oct 2023
Position: Key Claims in LLM Research Have a Long Tail of Footnotes
Anna Rogers
A. Luccioni
45
19
0
14 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
90
10,977
0
18 Jul 2023
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun
Li Dong
Shaohan Huang
Shuming Ma
Yuqing Xia
Jilong Xue
Jianyong Wang
Furu Wei
LRM
58
301
0
17 Jul 2023
Large Language Models as Batteries-Included Zero-Shot ESCO Skills Matchers
Benjamin Clavié
Guillaume Soulié
24
11
0
07 Jul 2023
Extensive Evaluation of Transformer-based Architectures for Adverse Drug Events Extraction
Simone Scaboro
Beatrice Portelli
Emmanuele Chersoni
Enrico Santus
Giuseppe Serra
16
8
0
08 Jun 2023
Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge in Foundation Models
Tim Schott
Daniel Furman
Shreshta Bhat
ELM
32
4
0
23 May 2023
AIwriting: Relations Between Image Generation and Digital Writing
S. Rettberg
Talan Memmott
Jill Walker Rettberg
Jason Nelson
P. Lichty
DiffM
16
1
0
18 May 2023
PaLM 2 Technical Report
Rohan Anil
Andrew M. Dai
Orhan Firat
Melvin Johnson
Dmitry Lepikhin
...
Ce Zheng
Wei Zhou
Denny Zhou
Slav Petrov
Yonghui Wu
ReLM
LRM
74
1,142
0
17 May 2023
NorBench -- A Benchmark for Norwegian Language Models
David Samuel
Andrey Kutuzov
Samia Touileb
Erik Velldal
Lilja Ovrelid
Egil Rønningstad
Elina Sigdel
Anna Palatkina
21
23
0
06 May 2023
The MiniPile Challenge for Data-Efficient Language Models
Jean Kaddour
MoE
ALM
24
40
0
17 Apr 2023
Sociocultural knowledge is needed for selection of shots in hate speech detection tasks
Antonis Maronikolakis
Abdullatif Köksal
Hinrich Schütze
37
0
0
04 Apr 2023
BloombergGPT: A Large Language Model for Finance
Shijie Wu
Ozan Irsoy
Steven Lu
Vadim Dabravolski
Mark Dredze
Sebastian Gehrmann
P. Kambadur
David S. Rosenberg
Gideon Mann
AIFin
71
785
0
30 Mar 2023
Class Cardinality Comparison as a Fermi Problem
Shrestha Ghosh
Simon Razniewski
G. Weikum
37
2
0
08 Mar 2023
Auditing large language models: a three-layered approach
Jakob Mokander
Jonas Schuett
Hannah Rose Kirk
Luciano Floridi
AILaw
MLAU
42
194
0
16 Feb 2023
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models
Alexandra Chronopoulou
Matthew E. Peters
Alexander M. Fraser
Jesse Dodge
MoMe
21
65
0
14 Feb 2023
Trustworthy Social Bias Measurement
Rishi Bommasani
Percy Liang
27
10
0
20 Dec 2022
Can Current Task-oriented Dialogue Models Automate Real-world Scenarios in the Wild?
Sang-Woo Lee
Sungdong Kim
Donghyeon Ko
Dong-hyun Ham
Youngki Hong
...
Wangkyo Jung
Kyunghyun Cho
Donghyun Kwak
H. Noh
W. Park
41
1
0
20 Dec 2022
Striving for data-model efficiency: Identifying data externalities on group performance
Esther Rolf
Ben Packer
Alex Beutel
Fernando Diaz
TDI
22
2
0
11 Nov 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
101
2,306
0
09 Nov 2022
Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs
Maarten Sap
Ronan Le Bras
Daniel Fried
Yejin Choi
25
205
0
24 Oct 2022
1
2
Next