v1v2 (latest)

HateCheck: Functional Tests for Hate Speech Detection Models

Annual Meeting of the Association for Computational Linguistics (ACL), 2020

31 December 2020

Paul Röttger

Papers citing "HateCheck: Functional Tests for Hate Speech Detection Models"

50 / 162 papers shown

DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses

Han Luo

Guy Laban

LLMAG

209

01 Dec 2025

Provably Safe Model Updates

149

01 Dec 2025

Feature Selection Empowered BERT for Detection of Hate Speech with Vocabulary Augmentation

Pritish N. Desai

Tanay Kewalramani

Srimanta Mandal

01 Dec 2025

HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection

Irina Proskurina

Marc-Antoine Carpentier

Julien Velcin

VLM

122

09 Nov 2025

KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification

108

13 Oct 2025

Hierarchical Scheduling for Multi-Vector Image Retrieval

122

10 Oct 2025

Energy-Driven Steering: Reducing False Refusals in Large Language Models

...

240

09 Oct 2025

Causality Guided Representation Learning for Cross-Style Hate Speech Detection

09 Oct 2025

Toxicity in Online Platforms and AI Systems: A Survey of Needs, Challenges, Mitigations, and Future DirectionsExpert systems with applications (ESWA), 2025

127

29 Sep 2025

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

156

18 Sep 2025

Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification

115

17 Sep 2025

Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities

03 Sep 2025

AI reasoning effort predicts human decision time in content moderation

Thomas Davidson

27 Aug 2025

Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach

104

09 Aug 2025

Web(er) of Hate: A Survey on How Hate Speech Is Typed

Luna Wang

Andrew Caines

Alice Hutchings

131

19 Jun 2025

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety

276

14 Jun 2025

Hatevolution: What Static Benchmarks Don't Tell UsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Chiara Di Bonaventura

Barbara McGillivray

Yulan He

Albert Meroño-Peñuela

207

13 Jun 2025

Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models

200

10 Jun 2025

LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification

Shuzhou Yuan

Ercong Nie

Lukas Kouba

Ashish Yashwanth Kangen

Helmut Schmid

Hinrich Schütze

Michael Färber

259

02 Jun 2025

Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data

Faeze Ghorbanpour

Daryna Dementieva

Kangyang Luo

345

20 May 2025

Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion

244

17 May 2025

System Prompt Optimization with Meta-Learning

362

14 May 2025

Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study

Faeze Ghorbanpour

Daryna Dementieva

Kangyang Luo

325

09 May 2025

SAGE: A Generic Framework for LLM Safety Evaluation

345

28 Apr 2025

Towards a comprehensive taxonomy of online abusive language informed by machine leaning

Samaneh Hosseini Moghaddam

163

24 Apr 2025

Tell Me What You Know About Sexism: Expert-LLM Interaction Strategies and Co-Created Definitions for Zero-Shot Sexism DetectionNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

192

21 Apr 2025

A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English

222

11 Apr 2025

AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models

215

07 Mar 2025

Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic VariationsInternational Conference on Human Factors in Computing Systems (CHI), 2025

270

03 Mar 2025

Evolving Hate Speech Online: An Adaptive Framework for Detection and Mitigation

Shiza Ali

Jeremy Blackburn

Gianluca Stringhini

274

24 Feb 2025

Echoes of Discord: Forecasting Hater Reactions to CounterspeechNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

898

17 Feb 2025

Demystifying Hateful Content: Leveraging Large Multimodal Models for Hateful Meme Detection with Explainable DecisionsInternational Conference on Web and Social Media (ICWSM), 2025

Ming Shan Hee

Roy Ka-wei Lee

VLM

243

16 Feb 2025

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

316

28 Jan 2025

A Survey on Automatic Online Hate Speech Detection in Low-Resource Languages

263

28 Nov 2024

HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on TwitterAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

443

23 Nov 2024

DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?International Conference on Computational Linguistics (COLING), 2024

Urja Khurana

Eric T. Nalisnick

Antske Fokkens

369

21 Oct 2024

Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models

262

17 Oct 2024

BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks

309

16 Oct 2024

Disentangling Hate Across Target Identities

Yiping Jin

Leo Wanner

Aneesh Moideen Koya

167

14 Oct 2024

A Target-Aware Analysis of Data Augmentation for Hate Speech Detection

Camilla Casula

Sara Tonelli

235

10 Oct 2024

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector AblationInternational Conference on Learning Representations (ICLR), 2024

Xinpeng Wang

Chengzhi Hu

Paul Röttger

Barbara Plank

405

04 Oct 2024

AggregHate: An Efficient Aggregative Approach for the Detection of Hatemongers on Social Platforms

Tom Marzea

Abraham Israeli

Oren Tsur

201

22 Sep 2024

What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data SlicingInternational Conference on Automated Software Engineering (ASE), 2024

241

14 Sep 2024

Recent Advances in Attack and Defense Approaches of Large Language Models

345

05 Sep 2024

SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists

Anna Korhonen

Hinrich Schütze

SyDa

277

30 Aug 2024

Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?

395

26 Aug 2024

Decoding Climate Disagreement: A Graph Neural Network-Based Approach to Understanding Social Media Dynamics

Ruiran Su

J. Pierrehumbert

157

09 Jul 2024

JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets

250

03 Jul 2024

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

273

02 Jul 2024

CELL your Model: Contrastive Explanations for Large Language Models

Ronny Luss

Erik Miehling

Amit Dhurandhar

537

17 Jun 2024