Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.13660
Cited By
Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
21 April 2024
Narek Maloyan
Ekansh Verma
Bulat Nutfullin
Bislan Ashinov
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge"
3 / 3 papers shown
Title
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
30
22
0
11 Jun 2024
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
213
327
0
23 Aug 2022
Gradient-based Adversarial Attacks against Text Transformers
Chuan Guo
Alexandre Sablayrolles
Hervé Jégou
Douwe Kiela
SILM
93
225
0
15 Apr 2021
1