v1v2 (latest)
A Watermark for Black-Box Language Models
- WaLM
Main:12 Pages
9 Figures
Bibliography:3 Pages
7 Tables
Appendix:19 Pages
Abstract
Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require white-box access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. black-box access), boasts a distortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
View on arXivComments on this paper
