How Do Semantically Equivalent Code Transformations Impact Membership Inference on LLMs for Code?

17 December 2025

Hua Yang

Alejandro Velasco

Thanh Le-Cong

Md Nazmul Haque

Bowen Xu

Denys Poshyvanyk

ArXiv (abs)PDF HTML Github (1★)

Main:11 Pages

3 Figures

Bibliography:2 Pages

7 Tables

Abstract

The success of large language models for code relies on vast amounts of code data, including public open-source repositories, such as GitHub, and private, confidential code from companies. This raises concerns about intellectual property compliance and the potential unauthorized use of license-restricted code. While membership inference (MI) techniques have been proposed to detect such unauthorized usage, their effectiveness can be undermined by semantically equivalent code transformation techniques, which modify code syntax while preserving semantic.

View on arXiv

Comments on this paper