v1v2 (latest)

HateCheck: Functional Tests for Hate Speech Detection Models

Annual Meeting of the Association for Computational Linguistics (ACL), 2020

31 December 2020

Paul Röttger

Papers citing "HateCheck: Functional Tests for Hate Speech Detection Models"

50 / 162 papers shown

Sexism Detection on a Data DietWeb Science Conference (WebSci), 2024

Rabiraj Bandyopadhyay

Dennis Assenmacher

J. Alonso-Moral

Claudia Wagner

199

07 Jun 2024

Prompt Exploration with Prompt Regression

150

17 May 2024

Mitigating Exaggerated Safety in Large Language Models

Ruchi Bhalani

Ruchira Ray

204

08 May 2024

SGHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Singapore

Ming Shan Hee

219

03 May 2024

From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets

396

27 Apr 2024

Analyzing Toxicity in Deep Conversations: A Reddit Case Study

Vigneshwaran Shankaran

Rajesh Sharma

164

11 Apr 2024

NLP for Counterspeech against Hate: A Survey and How-To Guide

303

29 Mar 2024

Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset

Janis Goldzycher

Paul Röttger

Gerold Schneider

AAML

224

28 Mar 2024

NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data

Manuel Tonneau

Pedro Vitor Quinta de Castro

Karim Lasri

I. Farouq

Lakshminarayanan Subramanian

Victor Orozco-Olvera

Samuel Fraiberger

341

28 Mar 2024

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

H. Nghiem

Hal Daumé

377

18 Mar 2024

Ethos: Rectifying Language Models in Orthogonal Parameter Space

Murali Annavaram

316

13 Mar 2024

Specification Overfitting in Artificial IntelligenceArtificial Intelligence Review (Artif Intell Rev), 2024

Benjamin Roth

Pedro Henrique Luz de Araujo

Yuxi Xia

Saskia Kaltenbrunner

Christoph Korab

593

13 Mar 2024

Harnessing Artificial Intelligence to Combat Online Hate: Exploring the Challenges and Opportunities of Large Language Models in Hate Speech Detection

Tharindu Kumarage

Amrita Bhattacharjee

Joshua Garland

273

12 Mar 2024

GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection?

Yiping Jin

Leo Wanner

A. Shvets

237

23 Feb 2024

Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon

236

03 Feb 2024

Red-Teaming for Generative AI: Silver Bullet or Security Theater?AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2024

Hoda Heidari

441

115

29 Jan 2024

Towards a Non-Ideal Methodological Framework for Responsible MLInternational Conference on Human Factors in Computing Systems (CHI), 2024

Ramaravind Kommiya Mothilal

Shion Guha

Syed Ishtiaque Ahmed

298

20 Jan 2024

Muted: Multilingual Targeted Offensive Speech Identification and Visualization

Bishwaranjan Bhattacharjee

155

18 Dec 2023

Causal ATE Mitigates Unintended Bias in Controlled Text Generation

Rahul Madhavan

Kahini Wadhawan

271

19 Nov 2023

Functionality learning through specification instructionsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Pedro Henrique Luz de Araujo

Benjamin Roth

ELM

216

14 Nov 2023

People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language DetectionConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Wil M.P. van der Aalst

Claudia Wagner

459

02 Nov 2023

Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test DataNeural Information Processing Systems (NeurIPS), 2023

217

25 Oct 2023

K-HATERS: A Hate Speech Detection Corpus in Korean with Target-Specific RatingsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

230

24 Oct 2023

Meta learning with language models: Challenges and opportunities in the classification of imbalanced text

Apostol T. Vassilev

Honglan Jin

Munawar Hasan

264

23 Oct 2023

Towards General Error Diagnosis via Behavioral Testing in Machine Translation

Junjie Wu

Lemao Liu

Dit-Yan Yeung

132

20 Oct 2023

Beyond Testers' Biases: Guiding Model Testing with Knowledge Bases using LLMs

Chenyang Yang

Rishabh Rustogi

Rachel A. Brower-Sinning

Grace A. Lewis

Jane Hsieh

Tongshuang Wu

KELM

209

14 Oct 2023

How toxic is antisemitism? Potentials and limitations of automated toxicity scoring for antisemitic online content

Helena Mihaljević

Elisabeth Steffen

117

05 Oct 2023

Can Language Models be Instructed to Protect Personal Information?

198

03 Oct 2023

No Offense Taken: Eliciting Offensiveness from Language Models

Anugya Srivastava

Rahul Ahuja

Rohith Mukku

212

02 Oct 2023

Towards a Unified Framework for Adaptable Problematic Content Detection via Continual Learning

243

29 Sep 2023

Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content

Jack Miller

186

26 Aug 2023

An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation SoftwareInternational Conference on Automated Software Engineering (ASE), 2023

Michael R. Lyu

136

18 Aug 2023

You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic ContentIEEE Symposium on Security and Privacy (IEEE S&P), 2023

172

10 Aug 2023

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Paul Röttger

389

259

02 Aug 2023

DoDo Learning: DOmain-DemOgraphic Transfer in Language Models for Detecting Abuse Targeted at Public FiguresWorkshop on Trolling, Aggression and Cyberbullying (TRAC), 2023

291

31 Jul 2023

HateModerate: Testing Hate Speech Detectors against Content Moderation Policies

Xueqing Liu

Ravishka Rathnasuriya

Wei Yang

G. Budhrani

247

23 Jul 2023

Evaluating AI systems under uncertain ground truth: a case study in dermatology

...

Yossi Matias

Pushmeet Kohli

Yao Xiao

Arnaud Doucet

Alan Karthikesalingam

289

05 Jul 2023

Concept-Based Explanations to Test for False Causal Relationships Learned by Abusive Language Classifiers

221

04 Jul 2023

A Weakly Supervised Classifier and Dataset of White Supremacist LanguageAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

196

27 Jun 2023

Politeness Stereotypes and Attack Vectors: Gender Stereotypes in Japanese and Korean Language Models

Victor Steinborn

Antonis Maronikolakis

Hinrich Schütze

253

16 Jun 2023

Evaluating the Effectiveness of Natural Language Inference for Hate Speech Detection in Languages with Limited Labeled Data

195

06 Jun 2023

COBRA Frames: Contextual Reasoning about Effects and Harms of Offensive StatementsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Xuhui Zhou

213

03 Jun 2023

Revisiting Hate Speech Benchmarks: From Data Curation to System DeploymentKnowledge Discovery and Data Mining (KDD), 2023

Atharva Kulkarni

Sarah Masud

Vikram Goyal

Tanmoy Chakraborty

196

01 Jun 2023

CFL: Causally Fair Language Models Through Token-level Attribute Controlled GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

240

01 Jun 2023

KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model ApplicationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

366

28 May 2023

Query-Efficient Black-Box Red Teaming via Bayesian OptimizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

201

27 May 2023

From Dogwhistles to Bullhorns: Unveiling Coded Rhetoric with Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Julia Mendelsohn

Ronan Le Bras

Yejin Choi

Maarten Sap

182

26 May 2023

Not wacky vs. definitely wacky: A study of scalar adverbs in pretrained language modelsBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023

Isabelle Lorge

J. Pierrehumbert

234

25 May 2023

How to Solve Few-Shot Abusive Content Detection Using the Data We Actually HaveInternational Conference on Language Resources and Evaluation (LREC), 2023

Viktor Hangya

Kangyang Luo

195

23 May 2023

Validating Multimedia Content Moderation Software via Semantic FusionInternational Symposium on Software Testing and Analysis (ISSTA), 2023

Jianping Zhang

Michael Lyu

215

23 May 2023