Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Safety alignment is critical in pre-training large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose , a novel training framework that explicitly considers adversarial. integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversarial-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining these innovations, ensures that VLMs remain robust and reliable even in the presence of sophisticated jailbreak attacks. Extensive experiments demonstrate that outperforms baselines in the safety alignment and general utility of VLMs.
View on arXiv@article{weng2025_2502.11455, title={ Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training }, author={ Fenghua Weng and Jian Lou and Jun Feng and Minlie Huang and Wenjie Wang }, journal={arXiv preprint arXiv:2502.11455}, year={ 2025 } }