ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.19358
  4. Cited By
Robustifying Safety-Aligned Large Language Models through Clean Data
  Curation

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

24 May 2024
Xiaoqun Liu
Jiacheng Liang
Muchao Ye
Zhaohan Xi
    AAML
ArXivPDFHTML

Papers citing "Robustifying Safety-Aligned Large Language Models through Clean Data Curation"

18 / 18 papers shown
Title
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Zihan Guan
Mengxuan Hu
Ronghang Zhu
Sheng R. Li
Anil Vullikanti
AAML
11
0
0
11 May 2025
Teaching Models to Understand (but not Generate) High-risk Data
Teaching Models to Understand (but not Generate) High-risk Data
Ryan Yixiang Wang
Matthew Finlayson
Luca Soldaini
Swabha Swayamdipta
Robin Jia
24
0
0
05 May 2025
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Aladin Djuhera
S. Kadhe
Farhan Ahmed
Syed Zawad
Holger Boche
MoMe
46
0
0
21 Mar 2025
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
Terry Tong
Fei-Yue Wang
Zhe Zhao
M. Chen
AAML
ELM
37
1
0
01 Mar 2025
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation
Guozhi Liu
Weiwei Lin
Tiansheng Huang
Ruichao Mo
Qi Mu
Li Shen
AAML
54
9
0
13 Oct 2024
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A
  Survey
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
AAML
38
21
0
26 Sep 2024
Generative Pre-trained Ranking Model with Over-parameterization at
  Web-Scale (Extended Abstract)
Generative Pre-trained Ranking Model with Over-parameterization at Web-Scale (Extended Abstract)
Yuchen Li
Haoyi Xiong
Linghe Kong
Jiang Bian
Shuaiqiang Wang
Guihai Chen
Dawei Yin
22
0
0
25 Sep 2024
Pre-trained Graphformer-based Ranking at Web-scale Search (Extended
  Abstract)
Pre-trained Graphformer-based Ranking at Web-scale Search (Extended Abstract)
Yuchen Li
Haoyi Xiong
Linghe Kong
Zeyi Sun
Hongyang Chen
Shuaiqiang Wang
Dawei Yin
29
0
0
25 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
47
1
0
05 Sep 2024
Content Moderation by LLM: From Accuracy to Legitimacy
Content Moderation by LLM: From Accuracy to Legitimacy
Tao Huang
AILaw
27
3
0
05 Sep 2024
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Zhao Xu
Fan Liu
Hao Liu
AAML
32
7
0
13 Jun 2024
Lazy Safety Alignment for Large Language Models against Harmful
  Fine-tuning
Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
40
23
0
28 May 2024
Vaccine: Perturbation-aware Alignment for Large Language Model
Vaccine: Perturbation-aware Alignment for Large Language Model
Tiansheng Huang
Sihao Hu
Ling Liu
42
32
0
02 Feb 2024
On the Safety of Open-Sourced Large Language Models: Does Alignment
  Really Prevent Them From Being Misused?
On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?
Hangfan Zhang
Zhimeng Guo
Huaisheng Zhu
Bochuan Cao
Lu Lin
Jinyuan Jia
Jinghui Chen
Di Wu
63
23
0
02 Oct 2023
Conformal Nucleus Sampling
Conformal Nucleus Sampling
Shauli Ravfogel
Carlos Wert Carvajal
M.F. Eggl
UQLM
59
20
0
04 May 2023
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text
  Style Transfer
Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer
Fanchao Qi
Yangyi Chen
Xurui Zhang
Mukai Li
Zhiyuan Liu
Maosong Sun
AAML
SILM
75
171
0
14 Oct 2021
Data Poisoning Attacks and Defenses to Crowdsourcing Systems
Data Poisoning Attacks and Defenses to Crowdsourcing Systems
Minghong Fang
Minghao Sun
Qi Li
Neil Zhenqiang Gong
Jinhua Tian
Jia-Wei Liu
39
34
0
18 Feb 2021
1