v1v2v3 (latest)

APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries

27 April 2025

ArXiv (abs)PDF HTML Github

Main:11 Pages

5 Figures

Bibliography:1 Pages

7 Tables

Appendix:4 Pages

Abstract

While frontier formal mathematics systems now routinely develop repository-scale proof engineering artifacts requiring multi-file coordination and semantic correctness beyond compilation, existing evaluation benchmarks remain focused on isolated theorem proving. We introduce Automated Proof Engineering (APE), the first systematic framework for evaluating repository-scale proof engineering through dual verification that validates both syntactic compilation and semantic requirement satisfaction in pinned library environments. We present a complete infrastructure comprising APE-Bench, which automatically extracts proof engineering tasks from real library commit histories, and APE-Harness, a unified execution framework based on task contract abstraction. This contract-based design enables standardized evaluation across diverse formal mathematics tasks and fair systematic comparison of different agent implementations (including our APE-Agent reference scaffold alongside Claude Code and Codex CLI) on identical task specifications. We demonstrate the framework's effectiveness through comprehensive evaluation. All code and benchmark dataset are released as open-source atthis https URL.

View on arXiv

Comments on this paper