Proving membership in LLM pretraining data via data watermarks

Proving membership in LLM pretraining data via data watermarks

16 February 2024

Johnny Tian-Zheng Wei

Ryan Yixiang Wang

Papers citing "Proving membership in LLM pretraining data via data watermarks"

9 / 9 papers shown

Title
The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text Matthieu Meeus Lukas Wutschitz Santiago Zanella Béguelin Shruti Tople Reza Shokri 75 0 0 24 Feb 2025
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions Yujuan Fu Özlem Uzuner Meliha Yetisgen Fei Xia 55 3 0 24 Oct 2024
Ward: Provable RAG Dataset Inference via LLM Watermarks Nikola Jovanović Robin Staab Maximilian Baader Martin Vechev 107 1 0 04 Oct 2024
Large Language Models Memorize Sensor Datasets! Implications on Human Activity Recognition Research H. Haresamudram Hrudhai Rajasekhar Nikhil Murlidhar Shanbhogue Thomas Ploetz 29 1 0 09 Jun 2024
The Mosaic Memory of Large Language Models Igor Shilov Matthieu Meeus Yves-Alexandre de Montjoye 39 3 0 24 May 2024
Data Portraits: Recording Foundation Model Training Data Marc Marone Benjamin Van Durme 135 30 0 06 Mar 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 306 11,909 0 04 Mar 2022
Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers Kenny Peng Arunesh Mathur Arvind Narayanan 97 93 0 06 Aug 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 248 1,986 0 31 Dec 2020