v1v2 (latest)

Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

Conference on Computer and Communications Security (CCS), 2022

7 September 2022

Waiman Si

Michael Backes

Jeremy Blackburn

Emiliano De Cristofaro

Gianluca Stringhini

Savvas Zannettou

Yang Zhang

ArXiv (abs)PDF HTML Github

Papers citing "Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots"

37 / 37 papers shown

Toxicity in Online Platforms and AI Systems: A Survey of Needs, Challenges, Mitigations, and Future DirectionsExpert systems with applications (ESWA), 2025

216

29 Sep 2025

"Abuse Risks are Often Inherent to Product Features": Exploring AI Vendors' Bug Bounty and Responsible Disclosure Policies

Yangheran Piao

Jingjie Li

Daniel W. Woods

274

07 Sep 2025

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

307

07 Aug 2025

LM-Scout: Analyzing the Security of Language Model Integration in Android Apps

382

13 May 2025

SaLoRA: Safety-Alignment Preserved Low-Rank AdaptationInternational Conference on Learning Representations (ICLR), 2025

475

03 Jan 2025

The Dark Side of AI Companionship: A Taxonomy of Harmful Algorithmic Behaviors in Human-AI RelationshipsInternational Conference on Human Factors in Computing Systems (CHI), 2024

323

26 Oct 2024

Vision Language Models Can Parse Floor Plan Maps

405

19 Sep 2024

The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

348

01 Sep 2024

GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language ModelsConference on Computer and Communications Security (CCS), 2024

Weiming Zhang

Tianwei Zhang

Nenghai Yu

512

22 Aug 2024

Efficient Detection of Toxic Prompts in Large Language ModelsInternational Conference on Automated Software Engineering (ASE), 2024

549

21 Aug 2024

Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

324

26 Jun 2024

A Map of Exploring Human Interaction patterns with LLM: Insights into Collaboration and Creativity

Jiayang Li

Jiale Li

392

06 Apr 2024

SOTOPIA-

π

: Interactive Learning of Socially Intelligent Language AgentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Graham Neubig

469

13 Mar 2024

Prompt Stealing Attacks Against Large Language Models

Zeyang Sha

Yang Zhang

SILM AAML

456

20 Feb 2024

Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning

Kailong Wang

Yang Liu

269

13 Feb 2024

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation

Yang Liu

390

08 Dec 2023

GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives

Vinodkumar Prabhakaran

Christopher Homan

Lora Aroyo

Aida Mostafazadeh Davani

316

09 Nov 2023

Comprehensive Assessment of Toxicity in ChatGPT

Michael Backes

342

03 Nov 2023

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Ning Liu

294

142

20 Oct 2023

Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks

Shuyu Jiang

Xingshu Chen

Rui Tang

370

16 Oct 2023

Low-Resource Languages Jailbreak GPT-4

556

305

03 Oct 2023

Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute MisconceptionsAsia-Pacific Computer Systems Architecture Conference (ACSA), 2023

Yufan Chen

Arjun Arunasalam

Z. Berkay Celik

234

03 Oct 2023

Bias and Fairness in Chatbots: An OverviewAPSIPA Transactions on Signal and Information Processing (TASIP), 2023

386

16 Sep 2023

AI in the Gray: Exploring Moderation Policies in Dialogic Large Language Models vs. Human Answers in Controversial TopicsInternational Conference on Information and Knowledge Management (CIKM), 2023

Guillermo Suarez-Tangil

AI4MH

321

28 Aug 2023

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language ModelsConference on Computer and Communications Security (CCS), 2023

Michael Backes

595

558

07 Aug 2023

MasterKey: Automated Jailbreak Across Multiple Large Language Model ChatbotsNetwork and Distributed System Security Symposium (NDSS), 2023

Yi Liu

Kailong Wang

Haoyu Wang

Yang Liu

496

216

16 Jul 2023

Understanding Multi-Turn Toxic Behaviors in Open-Domain ChatbotsInternational Symposium on Recent Advances in Intrusion Detection (RAID), 2023

Bocheng Chen

Guangjing Wang

Hanqing Guo

Yuanda Wang

Qiben Yan

268

14 Jul 2023

Intersectionality in Conversational AI Safety: How Bayesian Multilevel Models Help Understand Diverse Perceptions of Safety

Lora Aroyo

Vinodkumar Prabhakaran

Alex S. Taylor

Ding Wang

267

20 Jun 2023

DICES Dataset: Diversity in Conversational AI Evaluation for SafetyNeural Information Processing Systems (NeurIPS), 2023

Lora Aroyo

Vinodkumar Prabhakaran

Ding Wang

313

20 Jun 2023

Prompt Injection attack against LLM-integrated Applications

Yi Liu

Kailong Wang

...

Haoyu Wang

Leo Yu Zhang

Yang Liu

SILM

612

677

08 Jun 2023

BiasAsker: Measuring the Bias in Conversational AI System

Michael Lyu

335

21 May 2023

Generating Phishing Attacks using ChatGPT

Sayak Saha Roy

Krishna Vamsi Naragam

Shirin Nilizadeh

278

09 May 2023

Safer Conversational AI as a Source of User Delight

329

18 Apr 2023

Talking Abortion (Mis)information with ChatGPT on TikTok

216

23 Feb 2023

Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements

Jiale Cheng

277

18 Feb 2023

Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity

664

142

30 Jan 2023

Beam Search Strategies for Neural Machine Translation

Markus Freitag

Yaser Al-Onaizan

510

487

06 Feb 2017