Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.07876
Cited By
Causality Analysis for Evaluating the Security of Large Language Models
13 December 2023
Wei Zhao
Zhe Li
Junfeng Sun
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Causality Analysis for Evaluating the Security of Large Language Models"
4 / 4 papers shown
Title
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
63
0
0
08 Mar 2025
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Zhibo Wang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
52
3
0
17 Nov 2024
Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
Shuai Zhao
Jinming Wen
Anh Tuan Luu
J. Zhao
Jie Fu
SILM
57
89
0
02 May 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
303
11,730
0
04 Mar 2022
1