ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.11176
  4. Cited By
A New Generation of Perspective API: Efficient Multilingual
  Character-level Transformers

A New Generation of Perspective API: Efficient Multilingual Character-level Transformers

22 February 2022
Alyssa Lees
Vinh Q. Tran
Yi Tay
Jeffrey Scott Sorensen
Jai Gupta
Donald Metzler
Lucy Vasserman
ArXivPDFHTML

Papers citing "A New Generation of Perspective API: Efficient Multilingual Character-level Transformers"

50 / 102 papers shown
Title
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Zihan Guan
Mengxuan Hu
Ronghang Zhu
Sheng R. Li
Anil Vullikanti
AAML
24
0
0
11 May 2025
Mapping the Italian Telegram Ecosystem: Communities, Toxicity, and Hate Speech
Mapping the Italian Telegram Ecosystem: Communities, Toxicity, and Hate Speech
Lorenzo Alvisi
S. Tardelli
Maurizio Tesconi
104
0
0
28 Apr 2025
VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform
VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform
Xingyu Lu
Tianke Zhang
Chang Meng
X. Wang
Jinpeng Wang
...
Hai-Tao Zheng
Fan Yang
Tingting Gao
Di Zhang
Kun Gai
OffRL
44
0
0
21 Apr 2025
Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala
Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala
Shanilka Haturusinghe
Tharindu Cyril Weerasooriya
Marcos Zampieri
Christopher Homan
S. Liyanage
38
0
0
02 Apr 2025
Safe Vision-Language Models via Unsafe Weights Manipulation
Safe Vision-Language Models via Unsafe Weights Manipulation
Moreno DÍncà
E. Peruzzo
Xingqian Xu
Humphrey Shi
N. Sebe
Massimiliano Mancini
MU
55
0
0
14 Mar 2025
SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations
Xingwei Tan
Chen Lyu
Hafiz Muhammad Umer
Sahrish Khan
Mahathi Parvatham
Lois Arthurs
Simon Cullen
Shelley Wilson
Arshad Jhumka
Gabriele Pergola
44
0
0
09 Mar 2025
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation
Vera Neplenbroek
Arianna Bisazza
Raquel Fernández
100
0
0
17 Feb 2025
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
Berk Atil
Vipul Gupta
Sarkar Snigdha Sarathi Das
R. Passonneau
127
0
0
07 Feb 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards
Yue Liu
Hongcheng Gao
Shengfang Zhai
Jun-Xiong Xia
Tianyi Wu
Zhiwei Xue
Y. Chen
Kenji Kawaguchi
Jiaheng Zhang
Bryan Hooi
AI4TS
LRM
129
13
0
30 Jan 2025
Dynamics of Toxicity in Political Podcasts
Dynamics of Toxicity in Political Podcasts
Naquee Rizwan
Nayandeep Deb
Sarthak Roy
Vishwajeet Singh Solanki
Kiran Garimella
Animesh Mukherjee
62
0
0
22 Jan 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
SILM
68
5
0
08 Jan 2025
Digital Guardians: Can GPT-4, Perspective API, and Moderation API reliably detect hate speech in reader comments of German online newspapers?
Manuel Weber
Moritz Huber
Maximilian Auch
Alexander Döschl
Max-Emanuel Keller
P. Mandl
27
0
0
03 Jan 2025
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
LLM-jp
Akiko Aizawa
Eiji Aramaki
Bowen Chen
Fei Cheng
...
Yuya Yamamoto
Yusuke Yamauchi
Hitomi Yanaka
Rio Yokota
Koichiro Yoshino
40
14
0
31 Dec 2024
Towards Efficient and Explainable Hate Speech Detection via Model
  Distillation
Towards Efficient and Explainable Hate Speech Detection via Model Distillation
Paloma Piot
Javier Parapar
70
174
0
18 Dec 2024
HateDay: Insights from a Global Hate Speech Dataset Representative of a
  Day on Twitter
HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter
Manuel Tonneau
Diyi Liu
Niyati Malhotra
Scott A. Hale
Samuel Fraiberger
Victor Orozco-Olvera
Paul Röttger
66
0
0
23 Nov 2024
Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
Aaron Zheng
Mansi Rana
Andreas Stolcke
72
1
0
21 Nov 2024
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Xinyan Guan
Yanjiang Liu
Xinyu Lu
Boxi Cao
Ben He
...
Le Sun
Jie Lou
Bowen Yu
Y. Lu
Hongyu Lin
ALM
79
2
0
18 Nov 2024
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models
Xikang Yang
Xuehai Tang
Jizhong Han
Songlin Hu
68
0
0
18 Nov 2024
Unfair Alignment: Examining Safety Alignment Across Vision Encoder
  Layers in Vision-Language Models
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models
Saketh Bachu
Erfan Shayegani
Trishna Chakraborty
Rohit Lal
Arindam Dutta
Chengyu Song
Yue Dong
Nael B. Abu-Ghazaleh
A. Roy-Chowdhury
29
0
0
06 Nov 2024
On Calibration of LLM-based Guard Models for Reliable Content Moderation
On Calibration of LLM-based Guard Models for Reliable Content Moderation
Hongfu Liu
Hengguan Huang
Hao Wang
Xiangming Gu
Ye Wang
53
2
0
14 Oct 2024
JurEE not Judges: safeguarding llm interactions with small, specialised
  Encoder Ensembles
JurEE not Judges: safeguarding llm interactions with small, specialised Encoder Ensembles
Dom Nasrabadi
24
1
0
11 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
73
1
0
09 Oct 2024
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Jan Ebert
Alexander Arno Weber
...
René Jäkel
Georg Rehm
Stefan Kesselheim
Joachim Köhler
Nicolas Flores-Herr
64
6
0
30 Sep 2024
Alignment with Preference Optimization Is All You Need for LLM Safety
Alignment with Preference Optimization Is All You Need for LLM Safety
Réda Alami
Ali Khalifa Almansoori
Ahmed Alzubaidi
M. Seddik
Mugariya Farooq
Hakim Hacid
19
1
0
12 Sep 2024
Efficient Detection of Toxic Prompts in Large Language Models
Efficient Detection of Toxic Prompts in Large Language Models
Yi Liu
Junzhe Yu
Huijia Sun
Ling Shi
Gelei Deng
Yuqi Chen
Yang Liu
29
4
0
21 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak
  Attacks
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
Kexin Chen
Yi Liu
Dongxia Wang
Jiaying Chen
Wenhai Wang
44
1
0
18 Aug 2024
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov
  Decision Processes and Tree Search
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search
Robert J. Moss
AAML
26
0
0
11 Aug 2024
The Monetisation of Toxicity: Analysing YouTube Content Creators and
  Controversy-Driven Engagement
The Monetisation of Toxicity: Analysing YouTube Content Creators and Controversy-Driven Engagement
Jian Li
Bowen Xu
Sören Schwertfeger
19
2
0
01 Aug 2024
Towards Generalized Offensive Language Identification
Towards Generalized Offensive Language Identification
A. Dmonte
Tejas Arya
Tharindu Ranasinghe
Marcos Zampieri
36
3
0
26 Jul 2024
SAFETY-J: Evaluating Safety with Critique
SAFETY-J: Evaluating Safety with Critique
Yixiu Liu
Yuxiang Zheng
Shijie Xia
Jiajun Li
Yi Tu
Chaoling Song
Pengfei Liu
ELM
34
2
0
24 Jul 2024
Tracking Patterns in Toxicity and Antisocial Behavior Over User
  Lifetimes on Large Social Media Platforms
Tracking Patterns in Toxicity and Antisocial Behavior Over User Lifetimes on Large Social Media Platforms
Katy Blumer
Jon Kleinberg
13
0
0
12 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language
  Mixture
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
37
6
0
10 Jul 2024
Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders
Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders
Jinseok Kim
Jaewon Jung
Sangyeop Kim
S. Park
Sungzoon Cho
51
0
0
09 Jul 2024
$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via
  Knowledge-Enhanced Logical Reasoning
R2R^2R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang
Bo-wen Li
LRM
32
12
0
08 Jul 2024
Badllama 3: removing safety finetuning from Llama 3 in minutes
Badllama 3: removing safety finetuning from Llama 3 in minutes
Dmitrii Volkov
26
4
0
01 Jul 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks,
  and Refusals of LLMs
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han
Kavel Rao
Allyson Ettinger
Liwei Jiang
Bill Yuchen Lin
Nathan Lambert
Yejin Choi
Nouha Dziri
37
62
0
26 Jun 2024
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating
  Toxicity in French Texts
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
Caroline Brun
Vassilina Nikoulina
29
1
0
25 Jun 2024
LionGuard: Building a Contextualized Moderation Classifier to Tackle
  Localized Unsafe Content
LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content
Jessica Foo
Shaun Khoo
35
4
0
24 Jun 2024
Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Xiaochen Li
Zheng-Xin Yong
Stephen H. Bach
CLL
26
13
0
23 Jun 2024
Supporting Human Raters with the Detection of Harmful Content using
  Large Language Models
Supporting Human Raters with the Detection of Harmful Content using Large Language Models
Kurt Thomas
Patrick Gage Kelley
David Tao
Sarah Meiklejohn
Owen Vallis
Shunwen Tan
Blaz Bratanic
Felipe Tiengo Ferreira
Vijay Eranti
Elie Bursztein
31
2
0
18 Jun 2024
TorchOpera: A Compound AI System for LLM Safety
TorchOpera: A Compound AI System for LLM Safety
Shanshan Han
Yuhang Yao
Zijian Hu
Dimitris Stripelis
Zhaozhuo Xu
Chaoyang He
LLMAG
36
0
0
16 Jun 2024
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
Zhen Xiang
Linzhi Zheng
Yanjie Li
Junyuan Hong
Qinbin Li
...
Zidi Xiong
Chulin Xie
Carl Yang
Dawn Song
Bo Li
LLMAG
45
22
0
13 Jun 2024
The Life Cycle of Large Language Models: A Review of Biases in Education
The Life Cycle of Large Language Models: A Review of Biases in Education
Jinsook Lee
Yann Hicke
Renzhe Yu
Christopher A. Brooks
René F. Kizilcec
AI4Ed
34
1
0
03 Jun 2024
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of
  LLM Safeguards
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
Diego Dorn
Alexandre Variengien
Charbel-Raphaël Ségerie
Vincent Corruble
24
7
0
03 Jun 2024
Enhancing Jailbreak Attack Against Large Language Models through Silent
  Tokens
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
Jiahao Yu
Haozheng Luo
Jerry Yao-Chieh Hu
Wenbo Guo
Han Liu
Xinyu Xing
33
18
0
31 May 2024
Harmful Speech Detection by Language Models Exhibits Gender-Queer
  Dialect Bias
Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias
Rebecca Dorn
Lee Kezar
Fred Morstatter
Kristina Lerman
27
7
0
23 May 2024
Grounding Toxicity in Real-World Events across Languages
Grounding Toxicity in Real-World Events across Languages
Wondimagegnhue Tufa
Ilia Markov
Piek Vossen
16
0
0
22 May 2024
Jill Watson: A Virtual Teaching Assistant powered by ChatGPT
Jill Watson: A Virtual Teaching Assistant powered by ChatGPT
Karan Taneja
Pratyusha Maiti
Sandeep Kakar
P. Guruprasad
Sanjeev Rao
Ashok K. Goel
21
22
0
17 May 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM
  Generated Conversations
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations
Preetam Prabhu Srikar Dammu
Hayoung Jung
Anjali Singh
Monojit Choudhury
Tanushree Mitra
27
8
0
08 May 2024
The Constant in HATE: Analyzing Toxicity in Reddit across Topics and
  Languages
The Constant in HATE: Analyzing Toxicity in Reddit across Topics and Languages
Wondimagegnhue Tufa
Ilia Markov
Piek Vossen
13
0
0
29 Apr 2024
123
Next