Toxicity in ChatGPT: Analyzing Persona-assigned Language Models

11 April 2023

Papers citing "Toxicity in ChatGPT: Analyzing Persona-assigned Language Models"

50 / 59 papers shown

Title
A Comprehensive Analysis of Large Language Model Outputs: Similarity, Diversity, and Bias Brandon Smith Mohamed Reda Bouadjenek Tahsin Alamgir Kheya Phillip Dawson S. Aryal ALM ELM 26 0 0 14 May 2025
Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks Yixin Cheng Hongcheng Guo Yangming Li Leonid Sigal AAML WaLM 59 0 0 08 May 2025
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models Xiaobao Wu LRM 72 1 0 05 May 2025
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs Haoming Yang Ke Ma Xiaojun Jia Yingfei Sun Qianqian Xu Q. Huang AAML 159 0 0 03 May 2025
$$\texttt{SAGE}$: A Generic Framework for LLM Safety Evaluation$ $\texttt{SAGE}$ : A Generic Framework for LLM Safety Evaluation Madhur Jindal Hari Shrawgi Parag Agrawal Sandipan Dandapat ELM 47 0 0 28 Apr 2025
Benchmarking Multi-National Value Alignment for Large Language Models Chengyi Ju Weijie Shi Chengzhong Liu Yalan Qin Jipeng Zhang ... Jia Zhu Jiajie Xu Yaodong Yang Sirui Han Yike Guo 134 0 0 17 Apr 2025
MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming Stefan Schoepf Muhammad Zaid Hameed Ambrish Rawat Kieran Fraser Giulio Zizzo Giandomenico Cornacchia Mark Purcell 37 0 0 08 Mar 2025
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective Yuchen Wen Keping Bi Wei Chen J. Guo Xueqi Cheng 89 1 0 20 Feb 2025
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet Berk Atil Vipul Gupta Sarkar Snigdha Sarathi Das R. Passonneau 178 0 0 07 Feb 2025
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models Abdulkadir Erol Trilok Padhi Agnik Saha Ugur Kursuncu Mehmet Emin Aktas 47 1 0 17 Jan 2025
Enhancing Patient-Centric Communication: Leveraging LLMs to Simulate Patient Perspectives Xinyao Ma Rui Zhu Zihao Wang Jingwei Xiong Qingyu Chen Haixu Tang L. Jean Camp Lucila Ohno-Machado LM&MA 46 0 0 12 Jan 2025
UPCS: Unbiased Persona Construction for Dialogue Generation Kuiyun Chen Yanbin Wei 45 0 0 03 Jan 2025
Quantifying Risk Propensities of Large Language Models: Ethical Focus and Bias Detection through Role-Play Yifan Zeng Liang Kairong Fangzhou Dong Peijia Zheng 53 0 0 26 Oct 2024
LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education Iain Xie Weissburg Sathvika Anand Sharon Levy Haewon Jeong 62 2 0 17 Oct 2024
On the Role of Attention Heads in Large Language Model Safety Zhenhong Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Fan Zhang Yongbin Li 59 5 0 17 Oct 2024
Bias Similarity Across Large Language Models Hyejun Jeong Shiqing Ma Amir Houmansadr 54 0 0 15 Oct 2024
Language Imbalance Driven Rewarding for Multilingual Self-improving Wen Yang Junhong Wu Chen Wang Chengqing Zong Junzhe Zhang ALM LRM 66 4 0 11 Oct 2024
LLMs generate structurally realistic social networks but overestimate political homophily Serina Chang Alicja Chaszczewicz Emma Wang Maya Josifovska Emma Pierson J. Leskovec 44 6 0 29 Aug 2024
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts Caroline Brun Vassilina Nikoulina 36 1 0 25 Jun 2024
Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance Manon Reusens Philipp Borchert Jochen De Weerdt Bart Baesens 42 1 0 25 Jun 2024
Cascade Reward Sampling for Efficient Decoding-Time Alignment Bolian Li Yifan Wang A. Grama Ruqi Zhang Ruqi Zhang AI4TS 49 9 0 24 Jun 2024
EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms Siyu Yuan Kaitao Song Jiangjie Chen Xu Tan Dongsheng Li Deqing Yang LLMAG 66 14 0 20 Jun 2024
Exploring Safety-Utility Trade-Offs in Personalized Language Models Anvesh Rao Vijjini Somnath Basu Roy Chowdhury Snigdha Chaturvedi 51 6 0 17 Jun 2024
Merging Improves Self-Critique Against Jailbreak Attacks Victor Gallego AAML MoMe 44 3 0 11 Jun 2024
Ask LLMs Directly, "What shapes your bias?": Measuring Social Bias in Large Language Models Jisu Shin Hoyun Song Huije Lee Soyeong Jeong Jong C. Park 38 6 0 06 Jun 2024
Building Better AI Agents: A Provocation on the Utilisation of Persona in LLM-based Conversational Agents Guangzhi Sun Xiao Zhan Jose Such 41 24 0 26 May 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations Preetam Prabhu Srikar Dammu Hayoung Jung Anjali Singh Monojit Choudhury Tanushree Mitra 32 8 0 08 May 2024
MCRanker: Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers Fang Guo Wenyu Li Honglei Zhuang Yun Luo Yafu Li Qi Zhu Le Yan Yue Zhang ALM 73 6 0 18 Apr 2024
On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models Xinpeng Wang Shitong Duan Xiaoyuan Yi Jing Yao Shanlin Zhou Zhihua Wei Peng Zhang Dongkuan Xu Maosong Sun Xing Xie OffRL 41 16 0 07 Mar 2024
From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models Luiza Amador Pozzobon Patrick Lewis Sara Hooker B. Ermiş 38 7 0 06 Mar 2024
Angry Men, Sad Women: Large Language Models Reflect Gendered Stereotypes in Emotion Attribution Flor Miriam Plaza del Arco A. C. Curry Alba Curry Gavin Abercrombie Dirk Hovy 31 24 0 05 Mar 2024
Random Silicon Sampling: Simulating Human Sub-Population Opinion Using a Large Language Model Based on Group-Level Demographic Information Seungjong Sun Eungu Lee Dongyan Nan Xiangying Zhao Wonbyung Lee Bernard J. Jansen Jang Hyun Kim 56 17 0 28 Feb 2024
All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks Kazuhiro Takemoto 34 21 0 18 Jan 2024
Canvil: Designerly Adaptation for LLM-Powered User Experiences K. J. Kevin Feng Q. V. Liao Ziang Xiao Jennifer Wortman Vaughan Amy X. Zhang David W. McDonald 39 16 0 17 Jan 2024
Authorship Obfuscation in Multilingual Machine-Generated Text Detection Dominik Macko Robert Moro Adaku Uchendu Ivan Srba Jason Samuel Lucas Michiharu Yamashita Nafis Irtiza Tripto Dongwon Lee Jakub Simko M. Bieliková DeLMO 40 17 0 15 Jan 2024
Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images Shicheng Xu Danyang Hou Liang Pang Jingcheng Deng Jun Xu Huawei Shen Xueqi Cheng 16 8 0 23 Nov 2023
Generative AI for Hate Speech Detection: Evaluation and Findings Sagi Pendzel Tomer Wullach Amir Adler Einat Minkov 25 11 0 16 Nov 2023
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities Lingbo Mo Boshi Wang Muhao Chen Huan Sun 29 27 0 15 Nov 2023
Does Pre-trained Language Model Actually Infer Unseen Links in Knowledge Graph Completion? Yusuke Sakai Hidetaka Kamigaito Katsuhiko Hayashi Taro Watanabe 26 1 0 15 Nov 2023
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization Zhexin Zhang Junxiao Yang Pei Ke Fei Mi Hongning Wang Minlie Huang AAML 28 113 0 15 Nov 2023
Fake Alignment: Are LLMs Really Aligned Well? Yixu Wang Yan Teng Kexin Huang Chengqi Lyu Songyang Zhang Wenwei Zhang Xingjun Ma Yu-Gang Jiang Yu Qiao Yingchun Wang 35 15 0 10 Nov 2023
Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models Luiza Amador Pozzobon B. Ermiş Patrick Lewis Sara Hooker 30 20 0 11 Oct 2023
GPT-who: An Information Density-based Machine-Generated Text Detector Saranya Venkatraman Adaku Uchendu Dongwon Lee DeLMO 29 33 0 09 Oct 2023
Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting Tilman Beck Hendrik Schuff Anne Lauscher Iryna Gurevych 37 32 0 13 Sep 2023
CMD: a framework for Context-aware Model self-Detoxification Zecheng Tang Keyan Zhou Juntao Li Yuyang Ding Pinzheng Wang Bowen Yan Minzhang MU 23 5 0 16 Aug 2023
Collective Human Opinions in Semantic Textual Similarity Yuxia Wang Shimin Tao Ning Xie Hao Yang Timothy Baldwin Karin Verspoor 29 4 0 08 Aug 2023
Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy Yu Fu Deyi Xiong Yue Dong WaLM 45 30 0 25 Jul 2023
Large Language Models as Superpositions of Cultural Perspectives Grgur Kovač Masataka Sawayama Rémy Portelas Cédric Colas Peter Ford Dominey Pierre-Yves Oudeyer LLMAG 35 33 0 15 Jul 2023
On the Exploitability of Instruction Tuning Manli Shu Jiong Wang Chen Zhu Jonas Geiping Chaowei Xiao Tom Goldstein SILM 31 91 0 28 Jun 2023
In-Context Impersonation Reveals Large Language Models' Strengths and Biases Leonard Salewski Stephan Alaniz Isabel Rio-Torto Eric Schulz Zeynep Akata 44 149 0 24 May 2023