ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2012.15606
  4. Cited By
HateCheck: Functional Tests for Hate Speech Detection Models
v1v2 (latest)

HateCheck: Functional Tests for Hate Speech Detection Models

Annual Meeting of the Association for Computational Linguistics (ACL), 2020
31 December 2020
Paul Röttger
B. Vidgen
Dong Nguyen
Zeerak Talat
Helen Z. Margetts
J. Pierrehumbert
ArXiv (abs)PDFHTML

Papers citing "HateCheck: Functional Tests for Hate Speech Detection Models"

50 / 162 papers shown
DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses
DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses
Han Luo
Guy Laban
LLMAG
209
0
0
01 Dec 2025
Provably Safe Model Updates
Leo Elmecker-Plakolm
Pierre Fasterling
Philip Sosnin
Calvin Tsay
Matthew Wicker
AAMLKELM
149
0
0
01 Dec 2025
Feature Selection Empowered BERT for Detection of Hate Speech with Vocabulary Augmentation
Feature Selection Empowered BERT for Detection of Hate Speech with Vocabulary Augmentation
Pritish N. Desai
Tanay Kewalramani
Srimanta Mandal
48
0
0
01 Dec 2025
HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection
HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection
Irina Proskurina
Marc-Antoine Carpentier
Julien Velcin
VLM
122
0
0
09 Nov 2025
KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification
KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification
Yejin Lee
Su-Hyeon Kim
Hyundong Jin
Dayoung Kim
Yeonsoo Kim
Yo-Sub Han
108
0
0
13 Oct 2025
Hierarchical Scheduling for Multi-Vector Image Retrieval
Hierarchical Scheduling for Multi-Vector Image Retrieval
Maoliang Li
K. Li
Yaoyang Liu
Jiayu Chen
Zihao Zheng
Yinjun Wu
Xiang Chen
122
1
0
10 Oct 2025
Energy-Driven Steering: Reducing False Refusals in Large Language Models
Energy-Driven Steering: Reducing False Refusals in Large Language Models
Eric Hanchen Jiang
Weixuan Ou
Run Liu
Shengyuan Pang
Guancheng Wan
...
Wei Dong
Kai-Wei Chang
Xiaofeng Wang
Ying Nian Wu
Xinfeng Li
LLMSV
240
0
0
09 Oct 2025
Causality Guided Representation Learning for Cross-Style Hate Speech Detection
Causality Guided Representation Learning for Cross-Style Hate Speech Detection
Chengshuai Zhao
Shu Wan
Paras Sheth
Karan Patwa
K. S. Candan
Huan Liu
94
0
0
09 Oct 2025
Toxicity in Online Platforms and AI Systems: A Survey of Needs, Challenges, Mitigations, and Future Directions
Toxicity in Online Platforms and AI Systems: A Survey of Needs, Challenges, Mitigations, and Future DirectionsExpert systems with applications (ESWA), 2025
Smita Khapre
Melkamu Mersha
Hassan Shakil
Jonali Baruah
Jugal Kalita
127
2
0
29 Sep 2025
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages
Yujia Hu
Ming Shan Hee
Preslav Nakov
Roy Ka-wei Lee
156
0
0
18 Sep 2025
Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification
Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification
Samuel J. Bell
Eduardo Sánchez
David Dale
Pontus Stenetorp
Mikel Artetxe
Marta R. Costa-jussá
115
0
0
17 Sep 2025
Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities
Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities
Y. Kim
Himanshu Beniwal
Steven L. Johnson
Thomas Hartvigsen
74
0
0
03 Sep 2025
AI reasoning effort predicts human decision time in content moderation
AI reasoning effort predicts human decision time in content moderation
Thomas Davidson
64
0
0
27 Aug 2025
Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach
Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach
Naseem Machlovi
Maryam Saleki
Innocent Ababio
Ruhul Amin
104
4
0
09 Aug 2025
Web(er) of Hate: A Survey on How Hate Speech Is Typed
Web(er) of Hate: A Survey on How Hate Speech Is Typed
Luna Wang
Andrew Caines
Alice Hutchings
131
0
0
19 Jun 2025
QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
Taegyeong Lee
Jeonghwa Yoo
Hyoungseo Cho
Soo Yong Kim
Yunho Maeng
AAML
276
2
0
14 Jun 2025
Hatevolution: What Static Benchmarks Don't Tell Us
Hatevolution: What Static Benchmarks Don't Tell UsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Chiara Di Bonaventura
Barbara McGillivray
Yulan He
Albert Meroño-Peñuela
207
0
0
13 Jun 2025
Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models
Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models
Shuzhou Yuan
Ercong Nie
Mario Tawfelis
Helmut Schmid
Hinrich Schütze
Michael Färber
200
1
0
10 Jun 2025
LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification
LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification
Shuzhou Yuan
Ercong Nie
Lukas Kouba
Ashish Yashwanth Kangen
Helmut Schmid
Hinrich Schütze
Michael Färber
259
4
0
02 Jun 2025
Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data
Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data
Faeze Ghorbanpour
Daryna Dementieva
Kangyang Luo
345
0
0
20 May 2025
Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion
Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion
Yinghui Zhang
Tailin Chen
Yuchen Zhang
Zeyu Fu
244
6
0
17 May 2025
System Prompt Optimization with Meta-Learning
System Prompt Optimization with Meta-Learning
Yumin Choi
Jinheon Baek
Sung Ju Hwang
LLMAG
362
4
0
14 May 2025
Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study
Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study
Faeze Ghorbanpour
Daryna Dementieva
Kangyang Luo
325
6
0
09 May 2025
SAGE: A Generic Framework for LLM Safety Evaluation
SAGE: A Generic Framework for LLM Safety Evaluation
Madhur Jindal
Hari Shrawgi
Parag Agrawal
Sandipan Dandapat
ELM
345
3
0
28 Apr 2025
Towards a comprehensive taxonomy of online abusive language informed by machine leaning
Towards a comprehensive taxonomy of online abusive language informed by machine leaning
Samaneh Hosseini Moghaddam
Kelly Lyons
Cheryl Regehr
Vivek Goel
Kaitlyn Regehr
163
0
0
24 Apr 2025
Tell Me What You Know About Sexism: Expert-LLM Interaction Strategies and Co-Created Definitions for Zero-Shot Sexism Detection
Tell Me What You Know About Sexism: Expert-LLM Interaction Strategies and Co-Created Definitions for Zero-Shot Sexism DetectionNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Myrthe Reuver
Indira Sen
Matteo Melis
Gabriella Lapesa
192
2
0
21 Apr 2025
A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English
A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English
Julian Bäumler
Louis Blöcher
Lars-Joel Frey
Xian Chen
Markus Bayer
Christian A. Reuter
AILaw
222
2
0
11 Apr 2025
AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models
Hengrui Xing
Cong Tian
Liang Zhao
Tianhao Shen
WenSheng Wang
N. Zhang
Chao Huang
Zhenhua Duan
215
0
0
07 Mar 2025
Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic VariationsInternational Conference on Human Factors in Computing Systems (CHI), 2025
David Hartmann
Amin Oueslati
Dimitri Staufer
Lena Pohlmann
Simon Munzert
Hendrik Heuer
270
8
0
03 Mar 2025
Evolving Hate Speech Online: An Adaptive Framework for Detection and Mitigation
Evolving Hate Speech Online: An Adaptive Framework for Detection and Mitigation
Shiza Ali
Jeremy Blackburn
Gianluca Stringhini
274
1
0
24 Feb 2025
Echoes of Discord: Forecasting Hater Reactions to Counterspeech
Echoes of Discord: Forecasting Hater Reactions to CounterspeechNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Xiaoying Song
Sharon Lisseth Perez
Xinchen Yu
Eduardo Blanco
Lingzi Hong
898
6
0
17 Feb 2025
Demystifying Hateful Content: Leveraging Large Multimodal Models for Hateful Meme Detection with Explainable Decisions
Demystifying Hateful Content: Leveraging Large Multimodal Models for Hateful Meme Detection with Explainable DecisionsInternational Conference on Web and Social Media (ICWSM), 2025
Ming Shan Hee
Roy Ka-wei Lee
VLM
243
11
0
16 Feb 2025
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns
Xinyue Shen
Yixin Wu
Y. Qu
Michael Backes
Savvas Zannettou
Yang Zhang
316
18
0
28 Jan 2025
A Survey on Automatic Online Hate Speech Detection in Low-Resource
  Languages
A Survey on Automatic Online Hate Speech Detection in Low-Resource Languages
Susmita Das
Arpita Dutta
Kingshuk Roy
Abir Mondal
Arnab Mukhopadhyay
263
3
0
28 Nov 2024
HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter
HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on TwitterAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Manuel Tonneau
Diyi Liu
Niyati Malhotra
Scott A. Hale
Samuel Fraiberger
Victor Orozco-Olvera
Paul Röttger
443
7
0
23 Nov 2024
DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?
DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?International Conference on Computational Linguistics (COLING), 2024
Urja Khurana
Eric T. Nalisnick
Antske Fokkens
369
6
0
21 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language
  Models
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models
Eddie L. Ungless
Nikolas Vitsakis
Zeerak Talat
James Garforth
Bjorn Ross
Arno Onken
Atoosa Kasirzadeh
Alexandra Birch
262
3
0
17 Oct 2024
BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks
BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks
Anna Sokol
Elizabeth M. Daly
Michael Hind
David Piorkowski
Xiangliang Zhang
Nuno Moniz
Nitesh Chawla
309
0
0
16 Oct 2024
Disentangling Hate Across Target Identities
Disentangling Hate Across Target Identities
Yiping Jin
Leo Wanner
Aneesh Moideen Koya
167
0
0
14 Oct 2024
A Target-Aware Analysis of Data Augmentation for Hate Speech Detection
A Target-Aware Analysis of Data Augmentation for Hate Speech Detection
Camilla Casula
Sara Tonelli
235
1
0
10 Oct 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector AblationInternational Conference on Learning Representations (ICLR), 2024
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
405
24
0
04 Oct 2024
AggregHate: An Efficient Aggregative Approach for the Detection of
  Hatemongers on Social Platforms
AggregHate: An Efficient Aggregative Approach for the Detection of Hatemongers on Social Platforms
Tom Marzea
Abraham Israeli
Oren Tsur
201
0
0
22 Sep 2024
What Is Wrong with My Model? Identifying Systematic Problems with
  Semantic Data Slicing
What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data SlicingInternational Conference on Automated Software Engineering (ASE), 2024
Chenyang Yang
Yining Hong
Grace A. Lewis
Tongshuang Wu
Jane Hsieh
241
3
0
14 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILMAAML
345
8
0
05 Sep 2024
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic
  CheckLists
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
Raoyuan Zhao
Abdullatif Köksal
Yihong Liu
Leonie Weissweiler
Anna Korhonen
Hinrich Schütze
SyDa
277
3
0
30 Aug 2024
Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in
  Subjective Tasks?
Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?
Urja Khurana
Eric T. Nalisnick
Antske Fokkens
Swabha Swayamdipta
395
7
0
26 Aug 2024
Decoding Climate Disagreement: A Graph Neural Network-Based Approach to
  Understanding Social Media Dynamics
Decoding Climate Disagreement: A Graph Neural Network-Based Approach to Understanding Social Media Dynamics
Ruiran Su
J. Pierrehumbert
157
2
0
09 Jul 2024
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts
  Discovery from Large-Scale Human-LLM Conversational Datasets
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets
Zhihua Jin
Shiyi Liu
Haotian Li
Xun Zhao
Huamin Qu
250
6
0
03 Jul 2024
Whispering Experts: Neural Interventions for Toxicity Mitigation in
  Language Models
Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models
Xavier Suau
Pieter Delobelle
Katherine Metcalf
Armand Joulin
N. Apostoloff
Luca Zappella
P. Rodríguez
MUAAML
273
27
0
02 Jul 2024
CELL your Model: Contrastive Explanations for Large Language Models
CELL your Model: Contrastive Explanations for Large Language Models
Ronny Luss
Erik Miehling
Amit Dhurandhar
537
0
0
17 Jun 2024
1234
Next