Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2012.15606
Cited By
HateCheck: Functional Tests for Hate Speech Detection Models
31 December 2020
Paul Röttger
B. Vidgen
Dong Nguyen
Zeerak Talat
Helen Z. Margetts
J. Pierrehumbert
Re-assign community
ArXiv
PDF
HTML
Papers citing
"HateCheck: Functional Tests for Hate Speech Detection Models"
50 / 143 papers shown
Title
System Prompt Optimization with Meta-Learning
Yumin Choi
Jinheon Baek
Sung Ju Hwang
LLMAG
48
0
0
14 May 2025
Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study
Faeze Ghorbanpour
Daryna Dementieva
Alexander M. Fraser
40
0
0
09 May 2025
SAGE
\texttt{SAGE}
SAGE
: A Generic Framework for LLM Safety Evaluation
Madhur Jindal
Hari Shrawgi
Parag Agrawal
Sandipan Dandapat
ELM
47
0
0
28 Apr 2025
Towards a comprehensive taxonomy of online abusive language informed by machine leaning
Samaneh Hosseini Moghaddam
Kelly Lyons
Cheryl Regehr
Vivek Goel
Kaitlyn Regehr
23
0
0
24 Apr 2025
Tell Me What You Know About Sexism: Expert-LLM Interaction Strategies and Co-Created Definitions for Zero-Shot Sexism Detection
Myrthe Reuver
Indira Sen
Matteo Melis
Gabriella Lapesa
20
0
0
21 Apr 2025
A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English
Julian Bäumler
Louis Blöcher
Lars-Joel Frey
Xian Chen
Markus Bayer
Christian A. Reuter
AILaw
44
0
0
11 Apr 2025
AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models
Hengrui Xing
Cong Tian
L. Zhao
Z. Ma
WenSheng Wang
N. Zhang
Chao Huang
Zhenhua Duan
47
0
0
07 Mar 2025
Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic Variations
David Hartmann
Amin Oueslati
Dimitri Staufer
Lena Pohlmann
Simon Munzert
Hendrik Heuer
48
0
0
03 Mar 2025
Evolving Hate Speech Online: An Adaptive Framework for Detection and Mitigation
Shiza Ali
Jeremy Blackburn
Gianluca Stringhini
59
0
0
24 Feb 2025
Echoes of Discord: Forecasting Hater Reactions to Counterspeech
Xiaoying Song
Sharon Lisseth Perez
Xinchen Yu
Eduardo Blanco
Lingzi Hong
113
0
0
17 Feb 2025
Demystifying Hateful Content: Leveraging Large Multimodal Models for Hateful Meme Detection with Explainable Decisions
Ming Shan Hee
Roy Ka-Wei Lee
VLM
75
0
0
16 Feb 2025
SubData: A Python Library to Collect and Combine Datasets for Evaluating LLM Alignment on Downstream Tasks
Leon Fröhling
Pietro Bernardelle
Gianluca Demartini
ALM
74
0
0
21 Dec 2024
A Survey on Automatic Online Hate Speech Detection in Low-Resource Languages
Susmita Das
Arpita Dutta
Kingshuk Roy
Abir Mondal
Arnab Mukhopadhyay
66
0
0
28 Nov 2024
HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter
Manuel Tonneau
Diyi Liu
Niyati Malhotra
Scott A. Hale
Samuel Fraiberger
Victor Orozco-Olvera
Paul Röttger
71
0
0
23 Nov 2024
DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?
Urja Khurana
Eric T. Nalisnick
Antske Fokkens
44
1
0
21 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models
Eddie L. Ungless
Nikolas Vitsakis
Zeerak Talat
James Garforth
Bjorn Ross
Arno Onken
Atoosa Kasirzadeh
Alexandra Birch
28
1
0
17 Oct 2024
BenchmarkCards: Large Language Model and Risk Reporting
Anna Sokol
Nuno Moniz
Elizabeth M. Daly
Michael Hind
Nitesh V. Chawla
31
0
0
16 Oct 2024
Disentangling Hate Across Target Identities
Yiping Jin
Leo Wanner
Aneesh Moideen Koya
23
0
0
14 Oct 2024
A Target-Aware Analysis of Data Augmentation for Hate Speech Detection
Camilla Casula
Sara Tonelli
26
0
0
10 Oct 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
46
6
0
04 Oct 2024
AggregHate: An Efficient Aggregative Approach for the Detection of Hatemongers on Social Platforms
Tom Marzea
Abraham Israeli
Oren Tsur
23
0
0
22 Sep 2024
What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing
Chenyang Yang
Yining Hong
Grace A. Lewis
Tongshuang Wu
Christian Kastner
38
1
0
14 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
52
1
0
05 Sep 2024
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
Raoyuan Zhao
Abdullatif Köksal
Yihong Liu
Leonie Weissweiler
Anna Korhonen
Hinrich Schütze
SyDa
36
1
0
30 Aug 2024
Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?
Urja Khurana
Eric T. Nalisnick
Antske Fokkens
Swabha Swayamdipta
35
3
0
26 Aug 2024
Decoding Climate Disagreement: A Graph Neural Network-Based Approach to Understanding Social Media Dynamics
Ruiran Su
J. Pierrehumbert
24
0
0
09 Jul 2024
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets
Zhihua Jin
Shiyi Liu
Haotian Li
Xun Zhao
Huamin Qu
34
3
0
03 Jul 2024
Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models
Xavier Suau
Pieter Delobelle
Katherine Metcalf
Armand Joulin
N. Apostoloff
Luca Zappella
P. Rodríguez
MU
AAML
32
8
0
02 Jul 2024
CELL your Model: Contrastive Explanations for Large Language Models
Ronny Luss
Erik Miehling
Amit Dhurandhar
40
0
0
17 Jun 2024
Sexism Detection on a Data Diet
Rabiraj Bandyopadhyay
Dennis Assenmacher
J. Alonso-Moral
Claudia Wagner
41
0
0
07 Jun 2024
Prompt Exploration with Prompt Regression
Michael Feffer
Ronald Xu
Yuekai Sun
Mikhail Yurochkin
22
0
0
17 May 2024
Mitigating Exaggerated Safety in Large Language Models
Ruchi Bhalani
Ruchira Ray
21
1
0
08 May 2024
SGHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Singapore
Ri Chi Ng
Nirmalendu Prakash
Ming Shan Hee
K. T. W. Choo
Roy Ka-Wei Lee
35
4
0
03 May 2024
From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets
Manuel Tonneau
Diyi Liu
Samuel Fraiberger
Ralph Schroeder
Scott A. Hale
Paul Röttger
27
5
0
27 Apr 2024
Analyzing Toxicity in Deep Conversations: A Reddit Case Study
Vigneshwaran Shankaran
Rajesh Sharma
28
1
0
11 Apr 2024
NLP for Counterspeech against Hate: A Survey and How-To Guide
Helena Bonaldi
Yi-Ling Chung
Gavin Abercrombie
Marco Guerini
AAML
31
13
0
29 Mar 2024
Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset
Janis Goldzycher
Paul Röttger
Gerold Schneider
AAML
29
8
0
28 Mar 2024
NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data
Manuel Tonneau
Pedro Vitor Quinta de Castro
Karim Lasri
I. Farouq
Lakshminarayanan Subramanian
Victor Orozco-Olvera
Samuel Fraiberger
36
9
0
28 Mar 2024
HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models
H. Nghiem
Hal Daumé
31
1
0
18 Mar 2024
Ethos: Rectifying Language Models in Orthogonal Parameter Space
Lei Gao
Yue Niu
Tingting Tang
A. Avestimehr
Murali Annavaram
MU
32
10
0
13 Mar 2024
Specification Overfitting in Artificial Intelligence
Benjamin Roth
Pedro Henrique Luz de Araujo
Yuxi Xia
Saskia Kaltenbrunner
Christoph Korab
56
0
0
13 Mar 2024
Harnessing Artificial Intelligence to Combat Online Hate: Exploring the Challenges and Opportunities of Large Language Models in Hate Speech Detection
Tharindu Kumarage
Amrita Bhattacharjee
Joshua Garland
39
7
0
12 Mar 2024
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection?
Yiping Jin
Leo Wanner
A. Shvets
21
2
0
23 Feb 2024
Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon
Fajri Koto
Tilman Beck
Zeerak Talat
Iryna Gurevych
Timothy Baldwin
44
7
0
03 Feb 2024
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Michael Feffer
Anusha Sinha
Wesley Hanwen Deng
Zachary Chase Lipton
Hoda Heidari
AAML
30
66
0
29 Jan 2024
Towards a Non-Ideal Methodological Framework for Responsible ML
Ramaravind Kommiya Mothilal
Shion Guha
Syed Ishtiaque Ahmed
32
7
0
20 Jan 2024
Muted: Multilingual Targeted Offensive Speech Identification and Visualization
Christoph Tillmann
Aashka Trivedi
Sara Rosenthal
Santosh Borse
Rong Zhang
Avirup Sil
Bishwaranjan Bhattacharjee
8
2
0
18 Dec 2023
Causal ATE Mitigates Unintended Bias in Controlled Text Generation
Rahul Madhavan
Kahini Wadhawan
21
0
0
19 Nov 2023
Functionality learning through specification instructions
Pedro Henrique Luz de Araujo
Benjamin Roth
ELM
33
0
0
14 Nov 2023
People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection
Indira Sen
Dennis Assenmacher
Mattia Samory
Isabelle Augenstein
Wil M.P. van der Aalst
Claudia Wagner
17
19
0
02 Nov 2023
1
2
3
Next