Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

20 December 2022

Papers citing "Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data"

5 / 5 papers shown

Title
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs Sai Krishna Mendu Harish Yenala Aditi Gulati Shanu Kumar Parag Agrawal 29 0 0 04 May 2025
Data Processing for the OpenGPT-X Model Family Nicolo' Brandizzi Hammam Abdelwahab Anirban Bhowmick Lennard Helmer Benny Jörg Stein ... Georg Rehm Dennis Wegener Nicolas Flores-Herr Joachim Kohler Johannes Leveling VLM 79 2 0 11 Oct 2024
Symmetric Dot-Product Attention for Efficient Training of BERT Language Models Martin Courtois Malte Ostendorff Leonhard Hennig Georg Rehm 31 2 0 10 Jun 2024
Deep Learning for Hate Speech Detection: A Comparative Study Jitendra Malik Hezhe Qiao Guansong Pang A. Hengel 35 43 0 19 Feb 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 248 1,986 0 31 Dec 2020