Toward Breaking Watermarks in Distortion-free Large Language Models

25 February 2025

Abstract

In recent years, LLM watermarking has emerged as an attractive safeguard against AI-generated content, with promising applications in many real-world domains. However, there are growing concerns that the current LLM watermarking schemes are vulnerable to expert adversaries wishing to reverse-engineer the watermarking mechanisms. Prior work in "breaking" or "stealing" LLM watermarks mainly focuses on the distribution-modifying algorithm of Kirchenbauer et al. (2023), which perturbs the logit vector before sampling. In this work, we focus on reverse-engineering the other prominent LLM watermarking scheme, distortion-free watermarking (Kuditipudi et al. 2024), which preserves the underlying token distribution by using a hidden watermarking key sequence. We demonstrate that, even under a more sophisticated watermarking scheme, it is possible to "compromise" the LLM and carry out a "spoofing" attack. Specifically, we propose a mixed integer linear programming framework that accurately estimates the secret key used for watermarking using only a few samples of the watermarked dataset. Our initial findings challenge the current theoretical claims on the robustness and usability of existing LLM watermarking techniques.

View on arXiv

@article{reynolds2025_2502.18608,
  title={ Toward Breaking Watermarks in Distortion-free Large Language Models },
  author={ Shayleen Reynolds and Saheed Obitayo and Niccolò Dalmasso and Dung Daniel T. Ngo and Vamsi K. Potluru and Manuela Veloso },
  journal={arXiv preprint arXiv:2502.18608},
  year={ 2025 }
}

Comments on this paper