ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.00254
  4. Cited By
Why do universal adversarial attacks work on large language models?:
  Geometry might be the answer

Why do universal adversarial attacks work on large language models?: Geometry might be the answer

1 September 2023
Varshini Subhash
Anna Bialas
Weiwei Pan
Finale Doshi-Velez
    AAML
ArXivPDFHTML

Papers citing "Why do universal adversarial attacks work on large language models?: Geometry might be the answer"

9 / 9 papers shown
Title
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt
  Decomposition Process
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process
Peiran Wang
Xiaogeng Liu
Chaowei Xiao
AAML
31
3
0
11 Oct 2024
Defining and Evaluating Decision and Composite Risk in Language Models
  Applied to Natural Language Inference
Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference
Ke Shen
Mayank Kejriwal
37
0
0
04 Aug 2024
Can Large Language Models Automatically Jailbreak GPT-4V?
Can Large Language Models Automatically Jailbreak GPT-4V?
Yuanwei Wu
Yue Huang
Yixin Liu
Xiang Li
Pan Zhou
Lichao Sun
SILM
37
1
0
23 Jul 2024
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Ji-Rong Wen
61
39
0
14 Mar 2024
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
Yifan Zeng
Yiran Wu
Xiao Zhang
Huazheng Wang
Qingyun Wu
LLMAG
AAML
42
61
0
02 Mar 2024
A Comprehensive Study of Jailbreak Attack versus Defense for Large
  Language Models
A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models
Zihao Xu
Yi Liu
Gelei Deng
Yuekang Li
S. Picek
PILM
AAML
33
35
0
21 Feb 2024
Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks
Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks
Haz Sameen Shahgir
Xianghao Kong
Greg Ver Steeg
Yue Dong
16
5
0
22 Dec 2023
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
496
0
01 Nov 2022
MINIMAL: Mining Models for Data Free Universal Adversarial Triggers
MINIMAL: Mining Models for Data Free Universal Adversarial Triggers
Swapnil Parekh
Yaman Kumar Singla
Somesh Singh
Changyou Chen
Balaji Krishnamurthy
R. Shah
AAML
16
3
0
25 Sep 2021
1