ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2105.02732
  4. Cited By
What's in the Box? A Preliminary Analysis of Undesirable Content in the
  Common Crawl Corpus

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

6 May 2021
A. Luccioni
J. Viviano
ArXivPDFHTML

Papers citing "What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus"

50 / 68 papers shown
Title
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
Sai Krishna Mendu
Harish Yenala
Aditi Gulati
Shanu Kumar
Parag Agrawal
29
0
0
04 May 2025
Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification
Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification
Takuma Udagawa
Yang Zhao
H. Kanayama
Bishwaranjan Bhattacharjee
26
0
0
19 Apr 2025
Position: Ensuring mutual privacy is necessary for effective external evaluation of proprietary AI systems
Ben Bucknall
Robert F. Trager
Michael A. Osborne
80
0
0
03 Mar 2025
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
73
0
0
12 Nov 2024
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Catherine Arnett
Eliot Jones
Ivan P. Yamshchikov
Pierre-Carl Langlais
31
2
0
29 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language
  Models
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models
Eddie L. Ungless
Nikolas Vitsakis
Zeerak Talat
James Garforth
Bjorn Ross
Arno Onken
Atoosa Kasirzadeh
Alexandra Birch
28
1
0
17 Oct 2024
Benchmarking Defeasible Reasoning with Large Language Models -- Initial
  Experiments and Future Directions
Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions
Ilias Tachmazidis
Sotiris Batsakis
G. Antoniou
ReLM
LRM
16
0
0
16 Oct 2024
Creative Writers' Attitudes on Writing as Training Data for Large
  Language Models
Creative Writers' Attitudes on Writing as Training Data for Large Language Models
Katy Ilonka Gero
Meera Desai
Carly Schnitzler
Nayun Eom
Jack Cushman
Elena L. Glassman
21
1
0
22 Sep 2024
A Study on Bias Detection and Classification in Natural Language
  Processing
A Study on Bias Detection and Classification in Natural Language Processing
Ana Sofia Evans
Helena Moniz
Luísa Coheur
25
0
0
14 Aug 2024
Consent in Crisis: The Rapid Decline of the AI Data Commons
Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre
Robert Mahari
Ariel N. Lee
Campbell Lund
Hamidah Oderinwale
...
Hanlin Li
Daphne Ippolito
Sara Hooker
Jad Kabbara
Sandy Pentland
69
35
0
20 Jul 2024
A Review of the Challenges with Massive Web-mined Corpora Used in Large
  Language Models Pre-Training
A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training
Michał Perełkiewicz
Rafał Poświata
37
1
0
10 Jul 2024
RLHF Can Speak Many Languages: Unlocking Multilingual Preference
  Optimization for LLMs
RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs
John Dang
Arash Ahmadian
Kelly Marchisio
Julia Kreutzer
A. Ustun
Sara Hooker
37
21
0
02 Jul 2024
Fairness and Bias in Multimodal AI: A Survey
Fairness and Bias in Multimodal AI: A Survey
Tosin P. Adewumi
Lama Alkhaled
Namrata Gurung
G. V. Boven
Irene Pagliai
48
9
0
27 Jun 2024
GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large
  Language Models
GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models
Tao Zhang
Ziqian Zeng
Yuxiang Xiao
Huiping Zhuang
Cen Chen
James R. Foulds
Shimei Pan
CVBM
41
3
0
20 Jun 2024
A Taxonomy of Challenges to Curating Fair Datasets
A Taxonomy of Challenges to Curating Fair Datasets
Dora Zhao
M. Scheuerman
Pooja Chitre
Jerone T. A. Andrews
Georgia Panagiotidou
Shawn Walker
Kathleen H. Pine
Alice Xiang
39
2
0
10 Jun 2024
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
David Ifeoluwa Adelani
Jessica Ojo
Israel Abebe Azime
Jian Yun Zhuang
Jesujoba Oluwadara Alabi
...
Salomey Osei
Sokhar Samb
Tadesse Kebede Guge
Pontus Stenetorp
Pontus Stenetorp
ELM
55
7
0
05 Jun 2024
Toxicity Detection for Free
Toxicity Detection for Free
Zhanhao Hu
Julien Piet
Geng Zhao
Jiantao Jiao
David A. Wagner
32
3
0
29 May 2024
A Survey of Multimodal Large Language Model from A Data-centric
  Perspective
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping-Chia Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
47
36
0
26 May 2024
Who's in and who's out? A case study of multimodal CLIP-filtering in
  DataComp
Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp
Rachel Hong
William Agnew
Tadayoshi Kohno
Jamie Morgenstern
27
9
0
13 May 2024
Time Machine GPT
Time Machine GPT
Felix Drinkall
Eghbal Rahimikia
J. Pierrehumbert
Stefan Zohren
AI4TS
AI4CE
KELM
SyDa
29
3
0
29 Apr 2024
Investigating Gender Bias in Turkish Language Models
Investigating Gender Bias in Turkish Language Models
Orhun Caglidil
Malte Ostendorff
Georg Rehm
27
2
0
17 Apr 2024
Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data
  Annotation
Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation
Juhwan Choi
Jungmin Yun
Kyohoon Jin
Youngbin Kim
30
4
0
15 Apr 2024
Do Language Models Care About Text Quality? Evaluating Web-Crawled
  Corpora Across 11 Languages
Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages
Rik van Noord
Taja Kuzman
Peter Rupnik
Nikola Ljubesic
Miquel Espla-Gomis
Gema Ramírez-Sánchez
Antonio Toral
ALM
27
1
0
13 Mar 2024
Aya Model: An Instruction Finetuned Open-Access Multilingual Language
  Model
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
A. Ustun
Viraat Aryabumi
Zheng-Xin Yong
Wei-Yin Ko
Daniel D'souza
...
Shayne Longpre
Niklas Muennighoff
Marzieh Fadaee
Julia Kreutzer
Sara Hooker
ALM
ELM
SyDa
LRM
27
193
0
12 Feb 2024
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
S. Motwani
Mikhail Baranchuk
Martin Strohmeier
Vijay Bolina
Philip H. S. Torr
Lewis Hammond
Christian Schroeder de Witt
40
4
0
12 Feb 2024
Aya Dataset: An Open-Access Collection for Multilingual Instruction
  Tuning
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
Shivalika Singh
Freddie Vargus
Daniel D'souza
Börje F. Karlsson
Abinaya Mahendiran
...
Max Bartolo
Julia Kreutzer
A. Ustun
Marzieh Fadaee
Sara Hooker
117
115
0
09 Feb 2024
On Catastrophic Inheritance of Large Foundation Models
On Catastrophic Inheritance of Large Foundation Models
Hao Chen
Bhiksha Raj
Xing Xie
Jindong Wang
AI4CE
51
12
0
02 Feb 2024
LOCOST: State-Space Models for Long Document Abstractive Summarization
LOCOST: State-Space Models for Long Document Abstractive Summarization
Florian Le Bronnec
Song Duong
Mathieu Ravaut
Alexandre Allauzen
Nancy F. Chen
Vincent Guigue
Alberto Lumbreras
Laure Soulier
Patrick Gallinari
40
7
0
31 Jan 2024
Continual Learning Under Language Shift
Continual Learning Under Language Shift
Evangelia Gogoulou
Timothée Lesort
Magnus Boman
Joakim Nivre
KELM
CLL
27
3
0
02 Nov 2023
ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with
  Effective Evaluation Model
ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model
Jianghao Chen
Pu Jian
Tengxiao Xi
Yidong Yi
Qianlong Du
Chenglin Ding
Guibo Zhu
Chengqing Zong
Jinqiao Wang
Jiajun Zhang
22
6
0
02 Nov 2023
SoK: Memorization in General-Purpose Large Language Models
SoK: Memorization in General-Purpose Large Language Models
Valentin Hartmann
Anshuman Suri
Vincent Bindschaedler
David E. Evans
Shruti Tople
Robert West
KELM
LLMAG
16
20
0
24 Oct 2023
LibriSpeech-PC: Benchmark for Evaluation of Punctuation and
  Capitalization Capabilities of end-to-end ASR Models
LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models
Aleksandr Meister
Matvei Novikov
Nikolay Karpov
Evelina Bakhturina
Vitaly Lavrukhin
Boris Ginsburg
14
10
0
04 Oct 2023
Amplifying Limitations, Harms and Risks of Large Language Models
Amplifying Limitations, Harms and Risks of Large Language Models
Michael OÑeill
M. Connor
10
7
0
06 Jul 2023
Scaling Laws Do Not Scale
Scaling Laws Do Not Scale
Fernando Diaz
Michael A. Madaio
23
8
0
05 Jul 2023
GIO: Gradient Information Optimization for Training Dataset Selection
GIO: Gradient Information Optimization for Training Dataset Selection
Dante Everaert
Christopher Potts
19
3
0
20 Jun 2023
Lost in Translation: Large Language Models in Non-English Content
  Analysis
Lost in Translation: Large Language Models in Non-English Content Analysis
Gabriel Nicholas
Aliya Bhatia
ELM
13
34
0
12 Jun 2023
GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training
  Data Exploration
GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration
Aleksandra Piktus
Odunayo Ogundepo
Christopher Akiki
Akintunde Oladipo
Xinyu Crystina Zhang
Hailey Schoelkopf
Stella Biderman
Martin Potthast
Jimmy J. Lin
CVBM
36
10
0
02 Jun 2023
On the Risk of Misinformation Pollution with Large Language Models
On the Risk of Misinformation Pollution with Large Language Models
Yikang Pan
Liangming Pan
Wenhu Chen
Preslav Nakov
Min-Yen Kan
W. Wang
DeLMO
190
110
0
23 May 2023
Cross-lingual Transfer Can Worsen Bias in Sentiment Analysis
Cross-lingual Transfer Can Worsen Bias in Sentiment Analysis
Seraphina Goldfarb-Tarrant
Bjorn Ross
Adam Lopez
27
7
0
22 May 2023
The Web Can Be Your Oyster for Improving Large Language Models
The Web Can Be Your Oyster for Improving Large Language Models
Junyi Li
Tianyi Tang
Wayne Xin Zhao
Jingyuan Wang
Jian-Yun Nie
Ji-Rong Wen
RALM
KELM
22
5
0
18 May 2023
CCpdf: Building a High Quality Corpus for Visually Rich Documents from
  Web Crawl Data
CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data
M. Turski
Tomasz Stanislawek
Karol Kaczmarek
Pawel Dyda
Filip Graliñski
25
10
0
28 Apr 2023
Fundamentals of Generative Large Language Models and Perspectives in
  Cyber-Defense
Fundamentals of Generative Large Language Models and Perspectives in Cyber-Defense
Andrei Kucharavy
Z. Schillaci
Loic Maréchal
Maxime Wursch
Ljiljana Dolamic
Remi Sabonnadiere
Dimitri Percia David
Alain Mermoud
Vincent Lenders
ELM
AI4CE
22
31
0
21 Mar 2023
Data Portraits: Recording Foundation Model Training Data
Data Portraits: Recording Foundation Model Training Data
Marc Marone
Benjamin Van Durme
135
30
0
06 Mar 2023
The ROOTS Search Tool: Data Transparency for LLMs
The ROOTS Search Tool: Data Transparency for LLMs
Aleksandra Piktus
Christopher Akiki
Paulo Villegas
Hugo Laurenccon
Gérard Dupont
A. Luccioni
Yacine Jernite
Anna Rogers
VLM
28
29
0
27 Feb 2023
Poisoning Web-Scale Training Datasets is Practical
Poisoning Web-Scale Training Datasets is Practical
Nicholas Carlini
Matthew Jagielski
Christopher A. Choquette-Choo
Daniel Paleka
Will Pearce
Hyrum S. Anderson
Andreas Terzis
Kurt Thomas
Florian Tramèr
SILM
31
182
0
20 Feb 2023
Auditing large language models: a three-layered approach
Auditing large language models: a three-layered approach
Jakob Mokander
Jonas Schuett
Hannah Rose Kirk
Luciano Floridi
AILaw
MLAU
34
194
0
16 Feb 2023
High-Resource Methodological Bias in Low-Resource Investigations
High-Resource Methodological Bias in Low-Resource Investigations
Maartje ter Hoeve
David Grangier
Natalie Schluter
33
2
0
14 Nov 2022
Language Detoxification with Attribute-Discriminative Latent Space
Language Detoxification with Attribute-Discriminative Latent Space
Jin Myung Kwak
Minseon Kim
Sung Ju Hwang
14
12
0
19 Oct 2022
The State of Profanity Obfuscation in Natural Language Processing
The State of Profanity Obfuscation in Natural Language Processing
Debora Nozza
Dirk Hovy
34
7
0
14 Oct 2022
Training a T5 Using Lab-sized Resources
Training a T5 Using Lab-sized Resources
Manuel R. Ciosici
Leon Derczynski
VLM
20
8
0
25 Aug 2022
12
Next