AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Main:8 Pages
14 Figures
Bibliography:3 Pages
26 Tables
Appendix:21 Pages
Abstract
Large language models (LLMs) are increasingly used to automate data analysis through executable code generation. Yet, data science tasks often admit multiple statistically valid solutions, e.g. different modeling strategies, making it critical to understand the reasoning behind analyses, not just their outcomes. While manual review of LLM-generated code can help ensure statistical soundness, it is labor-intensive and requires expertise. A more scalable approach is to evaluate the underlying workflows - the logical plans guiding code generation. However, it remains unclear how to assess whether a LLM-generated workflow supports reproducible implementations.
View on arXivComments on this paper
