Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.11176
Cited By
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers
22 February 2022
Alyssa Lees
Vinh Q. Tran
Yi Tay
Jeffrey Scott Sorensen
Jai Gupta
Donald Metzler
Lucy Vasserman
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A New Generation of Perspective API: Efficient Multilingual Character-level Transformers"
50 / 102 papers shown
Title
MisgenderMender: A Community-Informed Approach to Interventions for Misgendering
Tamanna Hossain
Sunipa Dev
Sameer Singh
19
5
0
23 Apr 2024
Towards Human-centered Proactive Conversational Agents
Yang Deng
Lizi Liao
Zhonghua Zheng
Grace Hui Yang
Tat-Seng Chua
LLMAG
32
24
0
19 Apr 2024
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
Shaona Ghosh
Prasoon Varshney
Erick Galinkin
Christopher Parisien
ELM
38
35
0
09 Apr 2024
Fairness in Large Language Models: A Taxonomic Survey
Zhibo Chu
Zichong Wang
Wenbin Zhang
AILaw
43
32
0
31 Mar 2024
A Review of Multi-Modal Large Language and Vision Models
Kilian Carolan
Laura Fennelly
A. Smeaton
VLM
22
22
0
28 Mar 2024
NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data
Manuel Tonneau
Pedro Vitor Quinta de Castro
Karim Lasri
I. Farouq
Lakshminarayanan Subramanian
Victor Orozco-Olvera
Samuel Fraiberger
31
9
0
28 Mar 2024
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
Zhuowen Yuan
Zidi Xiong
Yi Zeng
Ning Yu
Ruoxi Jia
D. Song
Bo-wen Li
AAML
KELM
40
38
0
19 Mar 2024
From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models
Luiza Amador Pozzobon
Patrick Lewis
Sara Hooker
B. Ermiş
36
7
0
06 Mar 2024
ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
Zhexin Zhang
Yida Lu
Jingyuan Ma
Di Zhang
Rui Li
...
Hao-Lun Sun
Lei Sha
Zhifang Sui
Hongning Wang
Minlie Huang
18
26
0
26 Feb 2024
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan
Erik Jones
Meena Jagadeesan
Jacob Steinhardt
KELM
42
26
0
09 Feb 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
Lijun Li
Bowen Dong
Ruohui Wang
Xuhao Hu
Wangmeng Zuo
Dahua Lin
Yu Qiao
Jing Shao
ELM
30
84
0
07 Feb 2024
Gradient-Based Language Model Red Teaming
Nevan Wichers
Carson E. Denison
Ahmad Beirami
8
25
0
30 Jan 2024
A Group Fairness Lens for Large Language Models
Guanqun Bi
Lei Shen
Yuqiang Xie
Yanan Cao
Tiangang Zhu
Xiao-feng He
ALM
26
4
0
24 Dec 2023
The DSA Transparency Database: Auditing Self-reported Moderation Actions by Social Media
Amaury Trujillo
T. Fagni
S. Cresci
18
9
0
16 Dec 2023
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
Xinpeng Wang
Xiaoyuan Yi
Han Jiang
Shanlin Zhou
Zhihua Wei
Xing Xie
25
12
0
13 Dec 2023
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan
Kartikeya Upasani
Jianfeng Chi
Rashi Rungta
Krithika Iyer
...
Michael Tontchev
Qing Hu
Brian Fuller
Davide Testuggine
Madian Khabsa
AI4MH
11
371
0
07 Dec 2023
Automatic Construction of a Korean Toxic Instruction Dataset for Ethical Tuning of Large Language Models
Sungjoo Byun
Dongjun Jang
Hyemi Jo
Hyopil Shin
16
2
0
30 Nov 2023
Causal ATE Mitigates Unintended Bias in Controlled Text Generation
Rahul Madhavan
Kahini Wadhawan
21
0
0
19 Nov 2023
Subtle Misogyny Detection and Mitigation: An Expert-Annotated Dataset
Brooklyn Sheppard
Anna Richter
Allison Cohen
Elizabeth Allyn Smith
Tamara Kneese
Carolyne Pelletier
Ioana Baldini
Yue Dong
19
4
0
15 Nov 2023
Toxicity Detection is NOT all you Need: Measuring the Gaps to Supporting Volunteer Content Moderators
Yang Trista Cao
Lovely-Frances Domingo
Sarah Ann Gilbert
Michelle Mazurek
Katie Shilton
Hal Daumé
16
5
0
14 Nov 2023
A Taxonomy of Rater Disagreements: Surveying Challenges & Opportunities from the Perspective of Annotating Online Toxicity
Wenbo Zhang
Hangzhi Guo
Ian D Kivlichan
Vinodkumar Prabhakaran
Davis Yadav
Amulya Yadav
23
2
0
07 Nov 2023
Unveiling Safety Vulnerabilities of Large Language Models
George Kour
Marcel Zalmanovici
Naama Zwerdling
Esther Goldbraich
Ora Nova Fandina
Ateret Anaby-Tavor
Orna Raz
E. Farchi
AAML
16
14
0
07 Nov 2023
Measuring Adversarial Datasets
Yuanchen Bai
Raoyi Huang
Vijay Viswanathan
Tzu-Sheng Kuo
Tongshuang Wu
39
1
0
06 Nov 2023
People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection
Indira Sen
Dennis Assenmacher
Mattia Samory
Isabelle Augenstein
Wil M.P. van der Aalst
Claudia Wagner
17
19
0
02 Nov 2023
What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts and Rationales for Disambiguating Defeasible Social and Moral Situations
Kavel Rao
Liwei Jiang
Valentina Pyatkin
Yuling Gu
Niket Tandon
Nouha Dziri
Faeze Brahman
Yejin Choi
19
15
0
24 Oct 2023
Towards Detecting Contextual Real-Time Toxicity for In-Game Chat
Zachary Yang
Nicolas Grenan-Godbout
Reihaneh Rabbany
19
3
0
20 Oct 2023
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Xi Chen
Xiao Wang
Lucas Beyer
Alexander Kolesnikov
Jialin Wu
...
Keran Rong
Tianli Yu
Daniel Keysers
Xiao-Qi Zhai
Radu Soricut
MLLM
VLM
30
93
0
13 Oct 2023
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems
Yixin Wan
Jieyu Zhao
Aman Chadha
Nanyun Peng
Kai-Wei Chang
32
22
0
08 Oct 2023
Auto-survey Challenge
Thanh Gia Hieu Khuong
Benedictus Kent Rachmat
15
1
0
06 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
44
523
0
05 Oct 2023
CLEVA: Chinese Language Models EVAluation Platform
Yanyang Li
Jianqiao Zhao
Duo Zheng
Zi-Yuan Hu
Zhi Chen
...
Yongfeng Huang
Shijia Huang
Dahua Lin
Michael R. Lyu
Liwei Wang
ALM
ELM
33
9
0
09 Aug 2023
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Jiaming Ji
Mickel Liu
Juntao Dai
Xuehai Pan
Chi Zhang
Ce Bian
Chi Zhang
Ruiyang Sun
Yizhou Wang
Yaodong Yang
ALM
14
396
0
10 Jul 2023
Your spouse needs professional help: Determining the Contextual Appropriateness of Messages through Modeling Social Relationships
David Jurgens
Agrima Seth
Jack E. Sargent
Athena Aghighi
Michael Geraci
20
7
0
06 Jul 2023
Your Attack Is Too DUMB: Formalizing Attacker Scenarios for Adversarial Transferability
Marco Alecci
Mauro Conti
Francesco Marchiori
L. Martinelli
Luca Pajola
AAML
16
7
0
27 Jun 2023
Lost in Translation: Large Language Models in Non-English Content Analysis
Gabriel Nicholas
Aliya Bhatia
ELM
13
33
0
12 Jun 2023
CFL: Causally Fair Language Models Through Token-level Attribute Controlled Generation
Rahul Madhavan
Rishabh Garg
Kahini Wadhawan
S. Mehta
15
5
0
01 Jun 2023
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
...
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
VLM
44
187
0
29 May 2023
SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created Through Human-Machine Collaboration
Hwaran Lee
Seokhee Hong
Joonsuk Park
Takyoung Kim
M. Cha
...
Eun-Ju Lee
Yong Lim
Alice H. Oh
San-hee Park
Jung-Woo Ha
34
16
0
28 May 2023
Healing Unsafe Dialogue Responses with Weak Supervision Signals
Zi Liang
Pinghui Wang
Ruofei Zhang
Shuo Zhang
Xiaofan Ye Yi Huang
Junlan Feng
21
1
0
25 May 2023
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Shayne Longpre
Gregory Yauney
Emily Reif
Katherine Lee
Adam Roberts
...
Denny Zhou
Jason W. Wei
Kevin Robinson
David M. Mimno
Daphne Ippolito
21
147
0
22 May 2023
ToxBuster: In-game Chat Toxicity Buster with BERT
Zachary Yang
Yasmine Maricar
M. Davari
Nicolas Grenon-Godbout
Reihaneh Rabbany
14
3
0
21 May 2023
Analyzing Norm Violations in Live-Stream Chat
Jihyung Moon
Dong-Ho Lee
Hyundong Justin Cho
Woojeong Jin
Chan Young Park
MinWoo Kim
Jonathan May
Jay Pujara
Sungjoon Park
23
4
0
18 May 2023
Toxic comments reduce the activity of volunteer editors on Wikipedia
Ivan Smirnov
Camelia Oprea
Markus Strohmaier
KELM
8
3
0
26 Apr 2023
On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research
Luiza Amador Pozzobon
B. Ermiş
Patrick Lewis
Sara Hooker
27
45
0
24 Apr 2023
Language Model Behavior: A Comprehensive Survey
Tyler A. Chang
Benjamin Bergen
VLM
LRM
LM&MA
27
102
0
20 Mar 2023
Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI
Lorena Piedras
Lucas Rosenblatt
Julia Wilkins
26
9
0
05 Jan 2023
Hypothesis Engineering for Zero-Shot Hate Speech Detection
Janis Goldzycher
Gerold Schneider
14
7
0
03 Oct 2022
A Holistic Approach to Undesired Content Detection in the Real World
Todor Markov
Chong Zhang
Sandhini Agarwal
Tyna Eloundou
Teddy Lee
Steven Adler
Angela Jiang
L. Weng
17
210
0
05 Aug 2022
UL2: Unifying Language Learning Paradigms
Yi Tay
Mostafa Dehghani
Vinh Q. Tran
Xavier Garcia
Jason W. Wei
...
Tal Schuster
H. Zheng
Denny Zhou
N. Houlsby
Donald Metzler
AI4CE
49
293
0
10 May 2022
Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages
Thomas Mandl
Sandip J Modha
Gautam Kishore Shahi
Prasenjit Majumder
Mohana Dave
Daksh Patel
Chintak Mandalia
Aditya Patel
91
172
0
12 Aug 2021
Previous
1
2
3
Next