ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.06407
  4. Cited By
Rethinking How to Evaluate Language Model Jailbreak

Rethinking How to Evaluate Language Model Jailbreak

9 April 2024
Hongyu Cai
Arjun Arunasalam
Leo Y. Lin
Antonio Bianchi
Z. Berkay Celik
    ALM
ArXivPDFHTML

Papers citing "Rethinking How to Evaluate Language Model Jailbreak"

8 / 8 papers shown
Title
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent
  Enhanced Explanation Evaluation Framework
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework
Fan Liu
Yue Feng
Zhao Xu
Lixin Su
Xinyu Ma
Dawei Yin
Hao Liu
ELM
22
7
0
11 Oct 2024
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in
  LLM Text Generation
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation
Xiaoze Liu
Ting Sun
Tianyang Xu
Feijie Wu
Cunxiang Wang
Xiaoqian Wang
Jing Gao
AAML
DeLMO
AILaw
39
15
0
18 Jun 2024
Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
Shangqing Tu
Zhuoran Pan
Wenxuan Wang
Zhexin Zhang
Yuliang Sun
Jifan Yu
Hongning Wang
Lei Hou
Juanzi Li
ALM
42
1
0
17 Jun 2024
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a
  Dependency Lens
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
Lin Lu
Hai Yan
Zenghui Yuan
Jiawen Shi
Wenqi Wei
Pin-Yu Chen
Pan Zhou
AAML
44
8
0
06 Jun 2024
Don't Say No: Jailbreaking LLM by Suppressing Refusal
Don't Say No: Jailbreaking LLM by Suppressing Refusal
Yukai Zhou
Wenjie Wang
AAML
29
15
0
25 Apr 2024
Poisoning Language Models During Instruction Tuning
Poisoning Language Models During Instruction Tuning
Alexander Wan
Eric Wallace
Sheng Shen
Dan Klein
SILM
90
124
0
01 May 2023
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
303
11,730
0
04 Mar 2022
Fine-Tuning Language Models from Human Preferences
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
275
1,561
0
18 Sep 2019
1