ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.13660
  4. Cited By
Trojan Detection in Large Language Models: Insights from The Trojan
  Detection Challenge

Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge

21 April 2024
Narek Maloyan
Ekansh Verma
Bulat Nutfullin
Bislan Ashinov
ArXivPDFHTML

Papers citing "Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge"

3 / 3 papers shown
Title
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
30
22
0
11 Jun 2024
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors,
  and Lessons Learned
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
213
327
0
23 Aug 2022
Gradient-based Adversarial Attacks against Text Transformers
Gradient-based Adversarial Attacks against Text Transformers
Chuan Guo
Alexandre Sablayrolles
Hervé Jégou
Douwe Kiela
SILM
93
225
0
15 Apr 2021
1