A New Generation of Perspective API: Efficient Multilingual Character-level Transformers

22 February 2022

Alyssa Lees

Vinh Q. Tran

Yi Tay

Jeffrey Scott Sorensen

Papers citing "A New Generation of Perspective API: Efficient Multilingual Character-level Transformers"

50 / 102 papers shown

Title
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety Zihan Guan Mengxuan Hu Ronghang Zhu Sheng R. Li Anil Vullikanti AAML 24 0 0 11 May 2025
Mapping the Italian Telegram Ecosystem: Communities, Toxicity, and Hate Speech Lorenzo Alvisi S. Tardelli Maurizio Tesconi 104 0 0 28 Apr 2025
VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform Xingyu Lu Tianke Zhang Chang Meng X. Wang Jinpeng Wang ... Hai-Tao Zheng Fan Yang Tingting Gao Di Zhang Kun Gai OffRL 44 0 0 21 Apr 2025
Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala Shanilka Haturusinghe Tharindu Cyril Weerasooriya Marcos Zampieri Christopher Homan S. Liyanage 38 0 0 02 Apr 2025
Safe Vision-Language Models via Unsafe Weights Manipulation Moreno DÍncà E. Peruzzo Xingqian Xu Humphrey Shi N. Sebe Massimiliano Mancini MU 55 0 0 14 Mar 2025
SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations Xingwei Tan Chen Lyu Hafiz Muhammad Umer Sahrish Khan Mahathi Parvatham Lois Arthurs Simon Cullen Shelley Wilson Arshad Jhumka Gabriele Pergola 44 0 0 09 Mar 2025
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation Vera Neplenbroek Arianna Bisazza Raquel Fernández 100 0 0 17 Feb 2025
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet Berk Atil Vipul Gupta Sarkar Snigdha Sarathi Das R. Passonneau 127 0 0 07 Feb 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards Yue Liu Hongcheng Gao Shengfang Zhai Jun-Xiong Xia Tianyi Wu Zhiwei Xue Y. Chen Kenji Kawaguchi Jiaheng Zhang Bryan Hooi AI4TS LRM 129 13 0 30 Jan 2025
Dynamics of Toxicity in Political Podcasts Naquee Rizwan Nayandeep Deb Sarthak Roy Vishwajeet Singh Solanki Kiran Garimella Animesh Mukherjee 62 0 0 22 Jan 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates Fengqing Jiang Zhangchen Xu Luyao Niu Bill Yuchen Lin Radha Poovendran SILM 68 5 0 08 Jan 2025
Digital Guardians: Can GPT-4, Perspective API, and Moderation API reliably detect hate speech in reader comments of German online newspapers? Manuel Weber Moritz Huber Maximilian Auch Alexander Döschl Max-Emanuel Keller P. Mandl 27 0 0 03 Jan 2025
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs LLM-jp Akiko Aizawa Eiji Aramaki Bowen Chen Fei Cheng ... Yuya Yamamoto Yusuke Yamauchi Hitomi Yanaka Rio Yokota Koichiro Yoshino 40 14 0 31 Dec 2024
Towards Efficient and Explainable Hate Speech Detection via Model Distillation Paloma Piot Javier Parapar 70 174 0 18 Dec 2024
HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter Manuel Tonneau Diyi Liu Niyati Malhotra Scott A. Hale Samuel Fraiberger Victor Orozco-Olvera Paul Röttger 66 0 0 23 Nov 2024
Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings Aaron Zheng Mansi Rana Andreas Stolcke 72 1 0 21 Nov 2024
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering Xinyan Guan Yanjiang Liu Xinyu Lu Boxi Cao Ben He ... Le Sun Jie Lou Bowen Yu Y. Lu Hongyu Lin ALM 79 2 0 18 Nov 2024
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models Xikang Yang Xuehai Tang Jizhong Han Songlin Hu 68 0 0 18 Nov 2024
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models Saketh Bachu Erfan Shayegani Trishna Chakraborty Rohit Lal Arindam Dutta Chengyu Song Yue Dong Nael B. Abu-Ghazaleh A. Roy-Chowdhury 29 0 0 06 Nov 2024
On Calibration of LLM-based Guard Models for Reliable Content Moderation Hongfu Liu Hengguan Huang Hao Wang Xiangming Gu Ye Wang 53 2 0 14 Oct 2024
JurEE not Judges: safeguarding llm interactions with small, specialised Encoder Ensembles Dom Nasrabadi 24 1 0 11 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond Shanshan Han 73 1 0 09 Oct 2024
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs Mehdi Ali Michael Fromm Klaudia Thellmann Jan Ebert Alexander Arno Weber ... René Jäkel Georg Rehm Stefan Kesselheim Joachim Köhler Nicolas Flores-Herr 64 6 0 30 Sep 2024
Alignment with Preference Optimization Is All You Need for LLM Safety Réda Alami Ali Khalifa Almansoori Ahmed Alzubaidi M. Seddik Mugariya Farooq Hakim Hacid 19 1 0 12 Sep 2024
Efficient Detection of Toxic Prompts in Large Language Models Yi Liu Junzhe Yu Huijia Sun Ling Shi Gelei Deng Yuqi Chen Yang Liu 29 4 0 21 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks Kexin Chen Yi Liu Dongxia Wang Jiaying Chen Wenhai Wang 44 1 0 18 Aug 2024
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search Robert J. Moss AAML 26 0 0 11 Aug 2024
The Monetisation of Toxicity: Analysing YouTube Content Creators and Controversy-Driven Engagement Jian Li Bowen Xu Sören Schwertfeger 19 2 0 01 Aug 2024
Towards Generalized Offensive Language Identification A. Dmonte Tejas Arya Tharindu Ranasinghe Marcos Zampieri 36 3 0 26 Jul 2024
SAFETY-J: Evaluating Safety with Critique Yixiu Liu Yuxiang Zheng Shijie Xia Jiajun Li Yi Tu Chaoling Song Pengfei Liu ELM 34 2 0 24 Jul 2024
Tracking Patterns in Toxicity and Antisocial Behavior Over User Lifetimes on Large Social Media Platforms Katy Blumer Jon Kleinberg 13 0 0 12 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture Jiayang Song Yuheng Huang Zhehua Zhou Lei Ma 37 6 0 10 Jul 2024
Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders Jinseok Kim Jaewon Jung Sangyeop Kim S. Park Sungzoon Cho 51 0 0 09 Jul 2024
$R^2$ -Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning Mintong Kang Bo-wen Li LRM 32 12 0 08 Jul 2024
Badllama 3: removing safety finetuning from Llama 3 in minutes Dmitrii Volkov 26 4 0 01 Jul 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Seungju Han Kavel Rao Allyson Ettinger Liwei Jiang Bill Yuchen Lin Nathan Lambert Yejin Choi Nouha Dziri 37 62 0 26 Jun 2024
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts Caroline Brun Vassilina Nikoulina 29 1 0 25 Jun 2024
LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content Jessica Foo Shaun Khoo 35 4 0 24 Jun 2024
Preference Tuning For Toxicity Mitigation Generalizes Across Languages Xiaochen Li Zheng-Xin Yong Stephen H. Bach CLL 26 13 0 23 Jun 2024
Supporting Human Raters with the Detection of Harmful Content using Large Language Models Kurt Thomas Patrick Gage Kelley David Tao Sarah Meiklejohn Owen Vallis Shunwen Tan Blaz Bratanic Felipe Tiengo Ferreira Vijay Eranti Elie Bursztein 31 2 0 18 Jun 2024
TorchOpera: A Compound AI System for LLM Safety Shanshan Han Yuhang Yao Zijian Hu Dimitris Stripelis Zhaozhuo Xu Chaoyang He LLMAG 36 0 0 16 Jun 2024
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning Zhen Xiang Linzhi Zheng Yanjie Li Junyuan Hong Qinbin Li ... Zidi Xiong Chulin Xie Carl Yang Dawn Song Bo Li LLMAG 45 22 0 13 Jun 2024
The Life Cycle of Large Language Models: A Review of Biases in Education Jinsook Lee Yann Hicke Renzhe Yu Christopher A. Brooks René F. Kizilcec AI4Ed 34 1 0 03 Jun 2024
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards Diego Dorn Alexandre Variengien Charbel-Raphaël Ségerie Vincent Corruble 24 7 0 03 Jun 2024
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens Jiahao Yu Haozheng Luo Jerry Yao-Chieh Hu Wenbo Guo Han Liu Xinyu Xing 33 18 0 31 May 2024
Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias Rebecca Dorn Lee Kezar Fred Morstatter Kristina Lerman 27 7 0 23 May 2024
Grounding Toxicity in Real-World Events across Languages Wondimagegnhue Tufa Ilia Markov Piek Vossen 16 0 0 22 May 2024
Jill Watson: A Virtual Teaching Assistant powered by ChatGPT Karan Taneja Pratyusha Maiti Sandeep Kakar P. Guruprasad Sanjeev Rao Ashok K. Goel 21 22 0 17 May 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations Preetam Prabhu Srikar Dammu Hayoung Jung Anjali Singh Monojit Choudhury Tanushree Mitra 27 8 0 08 May 2024
The Constant in HATE: Analyzing Toxicity in Reddit across Topics and Languages Wondimagegnhue Tufa Ilia Markov Piek Vossen 13 0 0 29 Apr 2024