Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.10440
Cited By
Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data
20 December 2022
Timm Jansen
Yangling Tong
V. Zevallos
Pedro Ortiz Suarez
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data"
5 / 5 papers shown
Title
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
Sai Krishna Mendu
Harish Yenala
Aditi Gulati
Shanu Kumar
Parag Agrawal
29
0
0
04 May 2025
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
79
2
0
11 Oct 2024
Symmetric Dot-Product Attention for Efficient Training of BERT Language Models
Martin Courtois
Malte Ostendorff
Leonhard Hennig
Georg Rehm
31
2
0
10 Jun 2024
Deep Learning for Hate Speech Detection: A Comparative Study
Jitendra Malik
Hezhe Qiao
Guansong Pang
A. Hengel
35
43
0
19 Feb 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
248
1,986
0
31 Dec 2020
1