Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.02622
Cited By
Safeguarding Large Language Models: A Survey
3 June 2024
Yi Dong
Ronghui Mu
Yanghao Zhang
Siqi Sun
Tianle Zhang
Changshun Wu
Gao Jin
Yi Qi
Jinwei Hu
Jie Meng
Saddek Bensalem
Xiaowei Huang
OffRL
KELM
AILaw
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Safeguarding Large Language Models: A Survey"
42 / 42 papers shown
Title
Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification
Takuma Udagawa
Yang Zhao
H. Kanayama
Bishwaranjan Bhattacharjee
18
0
0
19 Apr 2025
Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails
William Hackett
Lewis Birch
Stefan Trawicki
N. Suri
Peter Garraghan
24
2
0
15 Apr 2025
CtrlRAG: Black-box Adversarial Attacks Based on Masked Language Models in Retrieval-Augmented Language Generation
Runqi Sui
AAML
32
0
0
10 Mar 2025
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models
Alberto Purpura
Sahil Wadhwa
Jesse Zymet
Akshay Gupta
Andy Luo
Melissa Kazemi Rad
Swapnil Shinde
Mohammad Sorower
AAML
66
0
0
03 Mar 2025
Among Them: A game-based framework for assessing persuasion capabilities of LLMs
Mateusz Idziejczak
Vasyl Korzavatykh
Mateusz Stawicki
Andrii Chmutov
Marcin Korcz
Iwo Błądek
Dariusz Brzezinski
LLMAG
41
0
0
27 Feb 2025
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde
Alasdair Paren
Preetham Arvind
Maxime Kayser
Tom Rainforth
Thomas Lukasiewicz
Bernard Ghanem
Philip H. S. Torr
Adel Bibi
45
1
0
26 Feb 2025
ARACNE: An LLM-Based Autonomous Shell Pentesting Agent
Tomas Nieponice
Veronica Valeros
Sebastian Garcia
LLMAG
57
0
0
24 Feb 2025
Prompt Inject Detection with Generative Explanation as an Investigative Tool
Jonathan Pan
Swee Liang Wong
Yidi Yuan
Xin Wei Chia
SILM
46
0
0
16 Feb 2025
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
Gabriel Chua
Shing Yee Chan
Shaun Khoo
75
1
0
20 Nov 2024
Standardization Trends on Safety and Trustworthiness Technology for Advanced AI
Jonghong Jeon
29
2
0
29 Oct 2024
Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey
Shang Wang
Tianqing Zhu
Bo Liu
Ming Ding
Xu Guo
Dayong Ye
Wanlei Zhou
Philip S. Yu
PILM
52
9
0
12 Jun 2024
Knowledge Return Oriented Prompting (KROP)
Jason Martin
Kenneth Yeung
25
0
0
11 Jun 2024
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
Weikang Zhou
Xiao Wang
Limao Xiong
Han Xia
Yingshuang Gu
...
Lijun Li
Jing Shao
Tao Gui
Qi Zhang
Xuanjing Huang
71
29
0
18 Mar 2024
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Zhangchen Xu
Fengqing Jiang
Luyao Niu
Jinyuan Jia
Bill Yuchen Lin
Radha Poovendran
AAML
129
82
0
14 Feb 2024
Attacking Large Language Models with Projected Gradient Descent
Simon Geisler
Tom Wollschlager
M. H. I. Abdalla
Johannes Gasteiger
Stephan Günnemann
AAML
SILM
42
48
0
14 Feb 2024
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
Xing-ming Guo
Fangxu Yu
Huan Zhang
Lianhui Qin
Bin Hu
AAML
109
69
0
13 Feb 2024
Can LLMs Recognize Toxicity? Definition-Based Toxicity Metric
Hyukhun Koh
Dohyung Kim
Minwoo Lee
Kyomin Jung
23
3
0
10 Feb 2024
Building Guardrails for Large Language Models
Yizhen Dong
Ronghui Mu
Gao Jin
Yi Qi
Jinwei Hu
Xingyu Zhao
Jie Meng
Wenjie Ruan
Xiaowei Huang
OffRL
57
23
0
02 Feb 2024
Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models
Jiawei Zhao
Kejiang Chen
Xianjian Yuan
Yuang Qi
Weiming Zhang
Neng H. Yu
50
4
0
15 Dec 2023
Sociodemographic Prompting is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs
Huaman Sun
Jiaxin Pei
Minje Choi
David Jurgens
69
16
0
16 Nov 2023
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
Erfan Shayegani
Md Abdullah Al Mamun
Yu Fu
Pedram Zaree
Yue Dong
Nael B. Abu-Ghazaleh
AAML
138
139
0
16 Oct 2023
Large Language Models Can Be Good Privacy Protection Learners
Yijia Xiao
Yiqiao Jin
Yushi Bai
Yue Wu
Xianjun Yang
...
Xujiang Zhao
Yanchi Liu
Haifeng Chen
Wei Wang
Wei Cheng
PILM
95
17
0
03 Oct 2023
On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?
Hangfan Zhang
Zhimeng Guo
Huaisheng Zhu
Bochuan Cao
Lu Lin
Jinyuan Jia
Jinghui Chen
Di Wu
63
23
0
02 Oct 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
110
292
0
19 Sep 2023
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Potsawee Manakul
Adian Liusie
Mark J. F. Gales
HILM
LRM
147
386
0
15 Mar 2023
Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI
Lorena Piedras
Lucas Rosenblatt
Julia Wilkins
26
9
0
05 Jan 2023
Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey
Sachin Kumar
Vidhisha Balachandran
Lucille Njoo
Antonios Anastasopoulos
Yulia Tsvetkov
ELM
61
84
0
14 Oct 2022
Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis
Yuxin Xiao
Paul Pu Liang
Umang Bhatt
W. Neiswanger
Ruslan Salakhutdinov
Louis-Philippe Morency
167
86
0
10 Oct 2022
Out-of-Distribution Detection and Selective Generation for Conditional Language Models
Jie Jessie Ren
Jiaming Luo
Yao-Min Zhao
Kundan Krishna
Mohammad Saleh
Balaji Lakshminarayanan
Peter J. Liu
OODD
64
92
0
30 Sep 2022
A Survey of Machine Unlearning
Thanh Tam Nguyen
T. T. Huynh
Phi Le Nguyen
Alan Wee-Chung Liew
Hongzhi Yin
Quoc Viet Hung Nguyen
MU
77
216
0
06 Sep 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
216
327
0
23 Aug 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang
Jason W. Wei
Dale Schuurmans
Quoc Le
Ed H. Chi
Sharan Narang
Aakanksha Chowdhery
Denny Zhou
ReLM
BDL
LRM
AI4CE
297
3,163
0
21 Mar 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Improving deep neural network generalization and robustness to background bias via layer-wise relevance propagation optimization
P. R. Bassi
Sergio S J Dertkigil
Andrea Cavalli
AI4CE
36
26
0
01 Feb 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,261
0
28 Jan 2022
Measure and Improve Robustness in NLP Models: A Survey
Xuezhi Wang
Haohan Wang
Diyi Yang
120
130
0
15 Dec 2021
Differentially Private Fine-tuning of Language Models
Da Yu
Saurabh Naik
A. Backurs
Sivakanth Gopi
Huseyin A. Inan
...
Y. Lee
Andre Manoel
Lukas Wutschitz
Sergey Yekhanin
Huishuai Zhang
131
258
0
13 Oct 2021
Challenges in Detoxifying Language Models
Johannes Welbl
Amelia Glaese
J. Uesato
Sumanth Dathathri
John F. J. Mellor
Lisa Anne Hendricks
Kirsty Anderson
Pushmeet Kohli
Ben Coppin
Po-Sen Huang
LM&MA
242
191
0
15 Sep 2021
Types of Out-of-Distribution Texts and How to Detect Them
Udit Arora
William Huang
He He
OODD
204
97
0
14 Sep 2021
Gradient-based Adversarial Attacks against Text Transformers
Chuan Guo
Alexandre Sablayrolles
Hervé Jégou
Douwe Kiela
SILM
98
225
0
15 Apr 2021
Entity-level Factual Consistency of Abstractive Text Summarization
Feng Nan
Ramesh Nallapati
Zhiguo Wang
Cicero Nogueira dos Santos
Henghui Zhu
Dejiao Zhang
Kathleen McKeown
Bing Xiang
HILM
139
156
0
18 Feb 2021
Categorical Reparameterization with Gumbel-Softmax
Eric Jang
S. Gu
Ben Poole
BDL
75
5,262
0
03 Nov 2016
1