Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.05335
Cited By
Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
11 April 2023
A. Deshpande
Vishvak Murahari
Tanmay Rajpurohit
A. Kalyan
Karthik Narasimhan
LM&MA
LLMAG
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Toxicity in ChatGPT: Analyzing Persona-assigned Language Models"
50 / 59 papers shown
Title
A Comprehensive Analysis of Large Language Model Outputs: Similarity, Diversity, and Bias
Brandon Smith
Mohamed Reda Bouadjenek
Tahsin Alamgir Kheya
Phillip Dawson
S. Aryal
ALM
ELM
26
0
0
14 May 2025
Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks
Yixin Cheng
Hongcheng Guo
Yangming Li
Leonid Sigal
AAML
WaLM
59
0
0
08 May 2025
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
72
1
0
05 May 2025
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs
Haoming Yang
Ke Ma
Xiaojun Jia
Yingfei Sun
Qianqian Xu
Q. Huang
AAML
159
0
0
03 May 2025
SAGE
\texttt{SAGE}
SAGE
: A Generic Framework for LLM Safety Evaluation
Madhur Jindal
Hari Shrawgi
Parag Agrawal
Sandipan Dandapat
ELM
47
0
0
28 Apr 2025
Benchmarking Multi-National Value Alignment for Large Language Models
Chengyi Ju
Weijie Shi
Chengzhong Liu
Yalan Qin
Jipeng Zhang
...
Jia Zhu
Jiajie Xu
Yaodong Yang
Sirui Han
Yike Guo
134
0
0
17 Apr 2025
MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming
Stefan Schoepf
Muhammad Zaid Hameed
Ambrish Rawat
Kieran Fraser
Giulio Zizzo
Giandomenico Cornacchia
Mark Purcell
37
0
0
08 Mar 2025
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective
Yuchen Wen
Keping Bi
Wei Chen
J. Guo
Xueqi Cheng
89
1
0
20 Feb 2025
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
Berk Atil
Vipul Gupta
Sarkar Snigdha Sarathi Das
R. Passonneau
178
0
0
07 Feb 2025
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models
Abdulkadir Erol
Trilok Padhi
Agnik Saha
Ugur Kursuncu
Mehmet Emin Aktas
47
1
0
17 Jan 2025
Enhancing Patient-Centric Communication: Leveraging LLMs to Simulate Patient Perspectives
Xinyao Ma
Rui Zhu
Zihao Wang
Jingwei Xiong
Qingyu Chen
Haixu Tang
L. Jean Camp
Lucila Ohno-Machado
LM&MA
46
0
0
12 Jan 2025
UPCS: Unbiased Persona Construction for Dialogue Generation
Kuiyun Chen
Yanbin Wei
45
0
0
03 Jan 2025
Quantifying Risk Propensities of Large Language Models: Ethical Focus and Bias Detection through Role-Play
Yifan Zeng
Liang Kairong
Fangzhou Dong
Peijia Zheng
53
0
0
26 Oct 2024
LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education
Iain Xie Weissburg
Sathvika Anand
Sharon Levy
Haewon Jeong
62
2
0
17 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Fan Zhang
Yongbin Li
59
5
0
17 Oct 2024
Bias Similarity Across Large Language Models
Hyejun Jeong
Shiqing Ma
Amir Houmansadr
54
0
0
15 Oct 2024
Language Imbalance Driven Rewarding for Multilingual Self-improving
Wen Yang
Junhong Wu
Chen Wang
Chengqing Zong
Junzhe Zhang
ALM
LRM
66
4
0
11 Oct 2024
LLMs generate structurally realistic social networks but overestimate political homophily
Serina Chang
Alicja Chaszczewicz
Emma Wang
Maya Josifovska
Emma Pierson
J. Leskovec
44
6
0
29 Aug 2024
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
Caroline Brun
Vassilina Nikoulina
36
1
0
25 Jun 2024
Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance
Manon Reusens
Philipp Borchert
Jochen De Weerdt
Bart Baesens
42
1
0
25 Jun 2024
Cascade Reward Sampling for Efficient Decoding-Time Alignment
Bolian Li
Yifan Wang
A. Grama
Ruqi Zhang
Ruqi Zhang
AI4TS
49
9
0
24 Jun 2024
EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
Siyu Yuan
Kaitao Song
Jiangjie Chen
Xu Tan
Dongsheng Li
Deqing Yang
LLMAG
66
14
0
20 Jun 2024
Exploring Safety-Utility Trade-Offs in Personalized Language Models
Anvesh Rao Vijjini
Somnath Basu Roy Chowdhury
Snigdha Chaturvedi
51
6
0
17 Jun 2024
Merging Improves Self-Critique Against Jailbreak Attacks
Victor Gallego
AAML
MoMe
44
3
0
11 Jun 2024
Ask LLMs Directly, "What shapes your bias?": Measuring Social Bias in Large Language Models
Jisu Shin
Hoyun Song
Huije Lee
Soyeong Jeong
Jong C. Park
38
6
0
06 Jun 2024
Building Better AI Agents: A Provocation on the Utilisation of Persona in LLM-based Conversational Agents
Guangzhi Sun
Xiao Zhan
Jose Such
41
24
0
26 May 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations
Preetam Prabhu Srikar Dammu
Hayoung Jung
Anjali Singh
Monojit Choudhury
Tanushree Mitra
32
8
0
08 May 2024
MCRanker: Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers
Fang Guo
Wenyu Li
Honglei Zhuang
Yun Luo
Yafu Li
Qi Zhu
Le Yan
Yue Zhang
ALM
73
6
0
18 Apr 2024
On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models
Xinpeng Wang
Shitong Duan
Xiaoyuan Yi
Jing Yao
Shanlin Zhou
Zhihua Wei
Peng Zhang
Dongkuan Xu
Maosong Sun
Xing Xie
OffRL
41
16
0
07 Mar 2024
From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models
Luiza Amador Pozzobon
Patrick Lewis
Sara Hooker
B. Ermiş
38
7
0
06 Mar 2024
Angry Men, Sad Women: Large Language Models Reflect Gendered Stereotypes in Emotion Attribution
Flor Miriam Plaza del Arco
A. C. Curry
Alba Curry
Gavin Abercrombie
Dirk Hovy
31
24
0
05 Mar 2024
Random Silicon Sampling: Simulating Human Sub-Population Opinion Using a Large Language Model Based on Group-Level Demographic Information
Seungjong Sun
Eungu Lee
Dongyan Nan
Xiangying Zhao
Wonbyung Lee
Bernard J. Jansen
Jang Hyun Kim
56
17
0
28 Feb 2024
All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
Kazuhiro Takemoto
34
21
0
18 Jan 2024
Canvil: Designerly Adaptation for LLM-Powered User Experiences
K. J. Kevin Feng
Q. V. Liao
Ziang Xiao
Jennifer Wortman Vaughan
Amy X. Zhang
David W. McDonald
39
16
0
17 Jan 2024
Authorship Obfuscation in Multilingual Machine-Generated Text Detection
Dominik Macko
Robert Moro
Adaku Uchendu
Ivan Srba
Jason Samuel Lucas
Michiharu Yamashita
Nafis Irtiza Tripto
Dongwon Lee
Jakub Simko
M. Bieliková
DeLMO
40
17
0
15 Jan 2024
Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images
Shicheng Xu
Danyang Hou
Liang Pang
Jingcheng Deng
Jun Xu
Huawei Shen
Xueqi Cheng
16
8
0
23 Nov 2023
Generative AI for Hate Speech Detection: Evaluation and Findings
Sagi Pendzel
Tomer Wullach
Amir Adler
Einat Minkov
25
11
0
16 Nov 2023
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities
Lingbo Mo
Boshi Wang
Muhao Chen
Huan Sun
29
27
0
15 Nov 2023
Does Pre-trained Language Model Actually Infer Unseen Links in Knowledge Graph Completion?
Yusuke Sakai
Hidetaka Kamigaito
Katsuhiko Hayashi
Taro Watanabe
26
1
0
15 Nov 2023
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
Zhexin Zhang
Junxiao Yang
Pei Ke
Fei Mi
Hongning Wang
Minlie Huang
AAML
28
113
0
15 Nov 2023
Fake Alignment: Are LLMs Really Aligned Well?
Yixu Wang
Yan Teng
Kexin Huang
Chengqi Lyu
Songyang Zhang
Wenwei Zhang
Xingjun Ma
Yu-Gang Jiang
Yu Qiao
Yingchun Wang
35
15
0
10 Nov 2023
Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models
Luiza Amador Pozzobon
B. Ermiş
Patrick Lewis
Sara Hooker
30
20
0
11 Oct 2023
GPT-who: An Information Density-based Machine-Generated Text Detector
Saranya Venkatraman
Adaku Uchendu
Dongwon Lee
DeLMO
29
33
0
09 Oct 2023
Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting
Tilman Beck
Hendrik Schuff
Anne Lauscher
Iryna Gurevych
37
32
0
13 Sep 2023
CMD: a framework for Context-aware Model self-Detoxification
Zecheng Tang
Keyan Zhou
Juntao Li
Yuyang Ding
Pinzheng Wang
Bowen Yan
Minzhang
MU
23
5
0
16 Aug 2023
Collective Human Opinions in Semantic Textual Similarity
Yuxia Wang
Shimin Tao
Ning Xie
Hao Yang
Timothy Baldwin
Karin Verspoor
29
4
0
08 Aug 2023
Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy
Yu Fu
Deyi Xiong
Yue Dong
WaLM
45
30
0
25 Jul 2023
Large Language Models as Superpositions of Cultural Perspectives
Grgur Kovač
Masataka Sawayama
Rémy Portelas
Cédric Colas
Peter Ford Dominey
Pierre-Yves Oudeyer
LLMAG
35
33
0
15 Jul 2023
On the Exploitability of Instruction Tuning
Manli Shu
Jiong Wang
Chen Zhu
Jonas Geiping
Chaowei Xiao
Tom Goldstein
SILM
31
91
0
28 Jun 2023
In-Context Impersonation Reveals Large Language Models' Strengths and Biases
Leonard Salewski
Stephan Alaniz
Isabel Rio-Torto
Eric Schulz
Zeynep Akata
44
149
0
24 May 2023
1
2
Next