ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.18473
40
0

Disproving Program Equivalence with LLMs

5 February 2025
Miltiadis Allamanis
Pengcheng Yin
ArXivPDFHTML
Abstract

To evaluate large language models (LLMs) for code, research has used manually created unit test-based benchmarks. However, these tests are often inadequate, missing corner cases and other implementation-specific oddities. This work introduces ProbeGen, a whitebox method that takes two or more executable pieces of code and searches for counterexamples to their equivalence. Comparing code semantics requires a deep understanding of code. We demonstrate that LLMs with execution feedback perform well at this task. In a common code synthesis benchmark, ProbeGen disproves 18% of samples considered equivalent to the ground truth by the benchmark-provided unit tests. Additionally, using ProbeGen, we can semantically cluster LLM samples for semantic self-consistency, improving pass@1 by 10% by unifying syntactically distinct but semantically similar samples.

View on arXiv
@article{allamanis2025_2502.18473,
  title={ Disproving Program Equivalence with LLMs },
  author={ Miltiadis Allamanis and Pengcheng Yin },
  journal={arXiv preprint arXiv:2502.18473},
  year={ 2025 }
}
Comments on this paper