ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.14132
  4. Cited By
Detecting Language Model Attacks with Perplexity

Detecting Language Model Attacks with Perplexity

27 August 2023
Gabriel Alon
Michael Kamfonas
    AAML
ArXivPDFHTML

Papers citing "Detecting Language Model Attacks with Perplexity"

30 / 30 papers shown
Title
LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures
LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures
Francisco Aguilera-Martínez
Fernando Berzal
PILM
48
0
0
02 May 2025
Attack and defense techniques in large language models: A survey and new perspectives
Attack and defense techniques in large language models: A survey and new perspectives
Zhiyu Liao
Kang Chen
Yuanguo Lin
Kangkang Li
Yunxuan Liu
Hefeng Chen
Xingwang Huang
Yuanhui Yu
AAML
54
0
0
02 May 2025
Prompt Injection Attack to Tool Selection in LLM Agents
Prompt Injection Attack to Tool Selection in LLM Agents
Jiawen Shi
Zenghui Yuan
Guiyao Tie
Pan Zhou
Neil Zhenqiang Gong
Lichao Sun
LLMAG
51
0
0
28 Apr 2025
A generative approach to LLM harmfulness detection with special red flag tokens
A generative approach to LLM harmfulness detection with special red flag tokens
Sophie Xhonneux
David Dobre
Mehrnaz Mohfakhami
Leo Schwinn
Gauthier Gidel
45
1
0
22 Feb 2025
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
Yi Wang
Fenghua Weng
S. Yang
Zhan Qin
Minlie Huang
Wenjie Wang
KELM
AAML
43
0
0
17 Feb 2025
Smoothed Embeddings for Robust Language Models
Smoothed Embeddings for Robust Language Models
Ryo Hase
Md. Rafi Ur Rashid
Ashley Lewis
Jing Liu
T. Koike-Akino
K. Parsons
Y. Wang
AAML
44
0
0
27 Jan 2025
Time-Reversal Provides Unsupervised Feedback to LLMs
Time-Reversal Provides Unsupervised Feedback to LLMs
Yerram Varun
Rahul Madhavan
Sravanti Addepalli
A. Suggala
Karthikeyan Shanmugam
Prateek Jain
LRM
SyDa
64
0
0
03 Dec 2024
Attention Tracker: Detecting Prompt Injection Attacks in LLMs
Attention Tracker: Detecting Prompt Injection Attacks in LLMs
Kuo-Han Hung
Ching-Yun Ko
Ambrish Rawat
I-Hsin Chung
Winston H. Hsu
Pin-Yu Chen
46
7
0
01 Nov 2024
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
Yunhan Zhao
Xiang Zheng
Lin Luo
Yige Li
Xingjun Ma
Yu-Gang Jiang
VLM
AAML
52
3
0
28 Oct 2024
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
Honglin Mu
Han He
Yuxin Zhou
Yunlong Feng
Yang Xu
...
Zeming Liu
Xudong Han
Qi Shi
Qingfu Zhu
Wanxiang Che
AAML
33
1
0
28 Oct 2024
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization
Xiyue Peng
Hengquan Guo
Jiawei Zhang
Dongqing Zou
Ziyu Shao
Honghao Wei
Xin Liu
32
0
0
25 Oct 2024
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt
  Decomposition Process
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process
Peiran Wang
Xiaogeng Liu
Chaowei Xiao
AAML
24
3
0
11 Oct 2024
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Xinyi Zeng
Yuying Shang
Yutao Zhu
Jingyuan Zhang
Yu Tian
AAML
51
2
0
09 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min-Bin Lin
33
8
0
09 Oct 2024
Knowledge-Augmented Reasoning for EUAIA Compliance and Adversarial
  Robustness of LLMs
Knowledge-Augmented Reasoning for EUAIA Compliance and Adversarial Robustness of LLMs
Tomas Bueno Momcilovic
Dian Balta
Beat Buesser
Giulio Zizzo
Mark Purcell
AAML
21
0
0
04 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang-Yu He
Yi Zeng
AAML
45
0
0
03 Oct 2024
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
Hanrong Zhang
Jingyuan Huang
Kai Mei
Yifei Yao
Zhenting Wang
Chenlu Zhan
Hongwei Wang
Yongfeng Zhang
AAML
LLMAG
ELM
48
18
0
03 Oct 2024
Prompt Obfuscation for Large Language Models
Prompt Obfuscation for Large Language Models
David Pape
Thorsten Eisenhofer
Thorsten Eisenhofer
Lea Schönherr
AAML
31
2
0
17 Sep 2024
MindGuard: Towards Accessible and Sitgma-free Mental Health First Aid
  via Edge LLM
MindGuard: Towards Accessible and Sitgma-free Mental Health First Aid via Edge LLM
Sijie Ji
Xinzhe Zheng
Jiawei Sun
Renqi Chen
Wei Gao
Mani Srivastava
AI4MH
32
2
0
16 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
47
1
0
05 Sep 2024
HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes
HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes
Xuanyu Su
Yansong Li
Diana Inkpen
Nathalie Japkowicz
VLM
81
2
0
11 Aug 2024
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy
  Failure for Jailbreak Attacks
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
Yue Zhou
Henry Peng Zou
Barbara Maria Di Eugenio
Yang Zhang
HILM
LRM
37
1
0
01 Jul 2024
Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents
Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents
Avital Shafran
R. Schuster
Vitaly Shmatikov
37
27
0
09 Jun 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang
Daoyuan Wu
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Shuai Wang
Yingjiu Li
Yang Liu
Ning Liu
Juergen Rahmel
AAML
68
8
0
08 Jun 2024
All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
Kazuhiro Takemoto
32
21
0
18 Jan 2024
Token-Level Adversarial Prompt Detection Based on Perplexity Measures
  and Contextual Information
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information
Zhengmian Hu
Gang Wu
Saayan Mitra
Ruiyi Zhang
Tong Sun
Heng-Chiao Huang
Vishy Swaminathan
14
23
0
20 Nov 2023
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Yichen Gong
Delong Ran
Jinyuan Liu
Conglei Wang
Tianshuo Cong
Anyu Wang
Sisi Duan
Xiaoyun Wang
MLLM
129
117
0
09 Nov 2023
Improving alignment of dialogue agents via targeted human judgements
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALM
AAML
225
495
0
28 Sep 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Globally-Robust Neural Networks
Globally-Robust Neural Networks
Klas Leino
Zifan Wang
Matt Fredrikson
AAML
OOD
80
125
0
16 Feb 2021
1