Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.13031
Cited By
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
19 March 2024
Zhuowen Yuan
Zidi Xiong
Yi Zeng
Ning Yu
Ruoxi Jia
D. Song
Bo-wen Li
AAML
KELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content"
10 / 10 papers shown
Title
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift
Julien Piet
Xiao Huang
Dennis Jacob
Annabella Chow
Maha Alrashed
Geng Zhao
Zhanhao Hu
Chawin Sitawarin
Basel Alomair
David A. Wagner
AAML
63
0
0
28 Apr 2025
Frontier AI's Impact on the Cybersecurity Landscape
Wenbo Guo
Yujin Potter
Tianneng Shi
Zhun Wang
Andy Zhang
Dawn Song
50
1
0
07 Apr 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards
Yue Liu
Hongcheng Gao
Shengfang Zhai
Jun-Xiong Xia
Tianyi Wu
Zhiwei Xue
Y. Chen
Kenji Kawaguchi
Jiaheng Zhang
Bryan Hooi
AI4TS
LRM
127
13
0
30 Jan 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
105
9
0
28 Jan 2025
Smoothed Embeddings for Robust Language Models
Ryo Hase
Md. Rafi Ur Rashid
Ashley Lewis
Jing Liu
T. Koike-Akino
K. Parsons
Y. Wang
AAML
44
0
0
27 Jan 2025
On Calibration of LLM-based Guard Models for Reliable Content Moderation
Hongfu Liu
Hengguan Huang
Hao Wang
Xiangming Gu
Ye Wang
51
2
0
14 Oct 2024
Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution
Ankita Sinha
Wendi Cui
Kamalika Das
Jiaxin Zhang
AAML
23
2
0
12 Oct 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
47
36
0
01 Aug 2024
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
Zhen Xiang
Linzhi Zheng
Yanjie Li
Junyuan Hong
Qinbin Li
...
Zidi Xiong
Chulin Xie
Carl Yang
Dawn Song
Bo Li
LLMAG
45
22
0
13 Jun 2024
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
1