Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2012.15606
Cited By
HateCheck: Functional Tests for Hate Speech Detection Models
31 December 2020
Paul Röttger
B. Vidgen
Dong Nguyen
Zeerak Talat
Helen Z. Margetts
J. Pierrehumbert
Re-assign community
ArXiv
PDF
HTML
Papers citing
"HateCheck: Functional Tests for Hate Speech Detection Models"
50 / 143 papers shown
Title
Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data
B. V. Breugel
Nabeel Seedat
F. Imrie
M. Schaar
SyDa
24
19
0
25 Oct 2023
K-HATERS: A Hate Speech Detection Corpus in Korean with Target-Specific Ratings
Chaewon Park
Soohwan Kim
Kyubyong Park
Kunwoo Park
19
4
0
24 Oct 2023
Meta learning with language models: Challenges and opportunities in the classification of imbalanced text
Apostol T. Vassilev
Honglan Jin
Munawar Hasan
6
0
0
23 Oct 2023
Towards General Error Diagnosis via Behavioral Testing in Machine Translation
Junjie Wu
Lemao Liu
Dit-Yan Yeung
24
2
0
20 Oct 2023
Beyond Testers' Biases: Guiding Model Testing with Knowledge Bases using LLMs
Chenyang Yang
Rishabh Rustogi
Rachel A. Brower-Sinning
Grace A. Lewis
Christian Kastner
Tongshuang Wu
KELM
30
11
0
14 Oct 2023
How toxic is antisemitism? Potentials and limitations of automated toxicity scoring for antisemitic online content
Helena Mihaljević
Elisabeth Steffen
9
2
0
05 Oct 2023
Can Language Models be Instructed to Protect Personal Information?
Yang Chen
Ethan Mendes
Sauvik Das
Wei-ping Xu
Alan Ritter
PILM
19
34
0
03 Oct 2023
No Offense Taken: Eliciting Offensiveness from Language Models
Anugya Srivastava
Rahul Ahuja
Rohith Mukku
14
3
0
02 Oct 2023
Towards a Unified Framework for Adaptable Problematic Content Detection via Continual Learning
Ali Omrani
Alireza S. Ziabari
Preni Golazizian
Jeffery Sorensen
Morteza Dehghani
19
1
0
29 Sep 2023
Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content
Charles OÑeill
Jack Miller
I. Ciucă
Y. Ting 丁
Thang Bui
23
3
0
26 Aug 2023
An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software
Wenxuan Wang
Jingyuan Huang
Jen-tse Huang
Chang Chen
Jiazhen Gu
Pinjia He
Michael R. Lyu
VLM
28
6
0
18 Aug 2023
You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content
Xinlei He
Savvas Zannettou
Yun Shen
Yang Zhang
CLL
13
37
0
10 Aug 2023
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger
Hannah Rose Kirk
Bertie Vidgen
Giuseppe Attanasio
Federico Bianchi
Dirk Hovy
ALM
ELM
AILaw
21
122
0
02 Aug 2023
DoDo Learning: DOmain-DemOgraphic Transfer in Language Models for Detecting Abuse Targeted at Public Figures
Angus R. Williams
Hannah Rose Kirk
L. Burke
Yi-Ling Chung
Ivan Debono
Pica Johansson
Francesca Stevens
Jonathan Bright
Scott A. Hale
26
1
0
31 Jul 2023
HateModerate: Testing Hate Speech Detectors against Content Moderation Policies
Jiangrui Zheng
Xueqing Liu
Guanqun Yang
Mirazul Haque
Xing Qian
Ravishka Rathnasuriya
Wei Yang
G. Budhrani
35
3
0
23 Jul 2023
Evaluating AI systems under uncertain ground truth: a case study in dermatology
David Stutz
A. Cemgil
Abhijit Guha Roy
Tatiana Matejovicova
Melih Barsbey
...
Yossi Matias
Pushmeet Kohli
Yun-hui Liu
Arnaud Doucet
Alan Karthikesalingam
25
4
0
05 Jul 2023
Concept-Based Explanations to Test for False Causal Relationships Learned by Abusive Language Classifiers
I. Nejadgholi
S. Kiritchenko
Kathleen C. Fraser
Esma Balkir
21
0
0
04 Jul 2023
A Weakly Supervised Classifier and Dataset of White Supremacist Language
Michael Miller Yoder
Ahmad Diab
D. W. Brown
Kathleen M. Carley
19
5
0
27 Jun 2023
Politeness Stereotypes and Attack Vectors: Gender Stereotypes in Japanese and Korean Language Models
Victor Steinborn
Antonis Maronikolakis
Hinrich Schütze
16
0
0
16 Jun 2023
Evaluating the Effectiveness of Natural Language Inference for Hate Speech Detection in Languages with Limited Labeled Data
Janis Goldzycher
Moritz Preisig
Chantal Amrhein
Gerold Schneider
21
3
0
06 Jun 2023
COBRA Frames: Contextual Reasoning about Effects and Harms of Offensive Statements
Xuhui Zhou
Haojie Zhu
Akhila Yerukola
Thomas Davidson
Jena D. Hwang
Swabha Swayamdipta
Maarten Sap
19
33
0
03 Jun 2023
Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment
Atharva Kulkarni
Sarah Masud
Vikram Goyal
Tanmoy Chakraborty
18
9
0
01 Jun 2023
CFL: Causally Fair Language Models Through Token-level Attribute Controlled Generation
Rahul Madhavan
Rishabh Garg
Kahini Wadhawan
S. Mehta
15
5
0
01 Jun 2023
KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application
Hwaran Lee
Seokhee Hong
Joonsuk Park
Takyoung Kim
Gunhee Kim
Jung-Woo Ha
30
28
0
28 May 2023
Query-Efficient Black-Box Red Teaming via Bayesian Optimization
Deokjae Lee
JunYeong Lee
Jung-Woo Ha
Jin-Hwa Kim
Sang-Woo Lee
Hwaran Lee
Hyun Oh Song
AAML
19
23
0
27 May 2023
From Dogwhistles to Bullhorns: Unveiling Coded Rhetoric with Language Models
Julia Mendelsohn
Ronan Le Bras
Yejin Choi
Maarten Sap
21
25
0
26 May 2023
Not wacky vs. definitely wacky: A study of scalar adverbs in pretrained language models
Isabelle Lorge
J. Pierrehumbert
31
0
0
25 May 2023
How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have
Viktor Hangya
Alexander M. Fraser
26
0
0
23 May 2023
Validating Multimedia Content Moderation Software via Semantic Fusion
Wenxuan Wang
Jingyuan Huang
Chang Chen
Jiazhen Gu
Jianping Zhang
Weibin Wu
Pinjia He
Michael Lyu
60
9
0
23 May 2023
Evaluating ChatGPT's Performance for Multilingual and Emoji-based Hate Speech Detection
Mithun Das
Saurabh Kumar Pandey
Animesh Mukherjee
41
10
0
22 May 2023
Cross-functional Analysis of Generalisation in Behavioural Learning
Pedro Henrique Luz de Araujo
Benjamin Roth
10
3
0
22 May 2023
Angler: Helping Machine Translation Practitioners Prioritize Model Improvements
Samantha Robertson
Zijie J. Wang
Dominik Moritz
Mary Beth Kery
Fred Hohman
25
15
0
12 Apr 2023
Interpretable Unified Language Checking
Tianhua Zhang
Hongyin Luo
Yung-Sung Chuang
Wei Fang
Luc Gaitskell
Thomas Hartvigsen
Xixin Wu
D. Fox
Helen M. Meng
James R. Glass
27
22
0
07 Apr 2023
Sociocultural knowledge is needed for selection of shots in hate speech detection tasks
Antonis Maronikolakis
Abdullatif Köksal
Hinrich Schütze
32
0
0
04 Apr 2023
Assessing Language Model Deployment with Risk Cards
Leon Derczynski
Hannah Rose Kirk
Vidhisha Balachandran
Sachin Kumar
Yulia Tsvetkov
M. Leiser
Saif Mohammad
20
42
0
31 Mar 2023
A Federated Approach for Hate Speech Detection
Jay Gala
Deep Gandhi
Jash Mehta
Zeerak Talat
13
4
0
18 Feb 2023
Auditing large language models: a three-layered approach
Jakob Mokander
Jonas Schuett
Hannah Rose Kirk
Luciano Floridi
AILaw
MLAU
34
194
0
16 Feb 2023
Same Same, But Different: Conditional Multi-Task Learning for Demographic-Specific Toxicity Detection
Soumyajit Gupta
Sooyong Lee
Maria De-Arteaga
Matthew Lease
8
13
0
14 Feb 2023
BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution Generalization of VQA Models
Ali Borji
CoGe
10
1
0
28 Jan 2023
Can Large Language Models Change User Preference Adversarially?
Varshini Subhash
AAML
24
8
0
05 Jan 2023
Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI
Lorena Piedras
Lucas Rosenblatt
Julia Wilkins
26
9
0
05 Jan 2023
Evaluating Psychological Safety of Large Language Models
Xingxuan Li
Yutong Li
Linlin Liu
Shafiq R. Joty
Lidong Bing
LM&MA
23
21
0
20 Dec 2022
Manifestations of Xenophobia in AI Systems
Nenad Tomašev
J. L. Maynard
Iason Gabriel
24
9
0
15 Dec 2022
Human-in-the-Loop Hate Speech Classification in a Multilingual Context
Ana Kotarcic
Dominik Hangartner
Fabrizio Gilardi
Selina Kurer
K. Donnay
24
2
0
05 Dec 2022
Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation
Zhexin Zhang
Jiale Cheng
Hao-Lun Sun
Jiawen Deng
Fei Mi
Yasheng Wang
Lifeng Shang
Minlie Huang
SILM
18
8
0
04 Dec 2022
Cross-Platform and Cross-Domain Abusive Language Detection with Supervised Contrastive Learning
Md. Tawkat Islam Khondaker
Muhammad Abdul-Mageed
L. Lakshmanan
12
1
0
11 Nov 2022
CoRAL: a Context-aware Croatian Abusive Language Dataset
Ravi Shekhar
Mladen Karan
Matthew Purver
33
5
0
11 Nov 2022
NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries?
Saadia Gabriel
Hamid Palangi
Yejin Choi
AAML
35
1
0
08 Nov 2022
System Demo: Tool and Infrastructure for Offensive Language Error Analysis (OLEA) in English
M. Grace
XajavionJaySeabrum
Dananjay Srinivas
Alexis Palmer
29
0
0
28 Oct 2022
"It's Not Just Hate'': A Multi-Dimensional Perspective on Detecting Harmful Speech Online
Federico Bianchi
S. A. Hills
Patrícia G. C. Rossini
Dirk Hovy
Rebekah Tromble
N. Tintarev
28
14
0
28 Oct 2022
Previous
1
2
3
Next