56
1

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Abstract

Safety alignment is critical in pre-training large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose Adversary-aware DPO (ADPO)\textit{Adversary-aware DPO (ADPO)}, a novel training framework that explicitly considers adversarial. Adversary-aware DPO (ADPO)\textit{Adversary-aware DPO (ADPO)} integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. ADPO\textit{ADPO} introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversarial-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining these innovations, ADPO\textit{ADPO} ensures that VLMs remain robust and reliable even in the presence of sophisticated jailbreak attacks. Extensive experiments demonstrate that ADPO\textit{ADPO} outperforms baselines in the safety alignment and general utility of VLMs.

View on arXiv
@article{weng2025_2502.11455,
  title={ Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training },
  author={ Fenghua Weng and Jian Lou and Jun Feng and Minlie Huang and Wenjie Wang },
  journal={arXiv preprint arXiv:2502.11455},
  year={ 2025 }
}
Comments on this paper