AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

23 February 2025

ArXiv (abs)PDF HTML Github

Main:8 Pages

14 Figures

Bibliography:3 Pages

26 Tables

Appendix:21 Pages

Abstract

Large language models (LLMs) are increasingly used to automate data analysis through executable code generation. Yet, data science tasks often admit multiple statistically valid solutions, e.g. different modeling strategies, making it critical to understand the reasoning behind analyses, not just their outcomes. While manual review of LLM-generated code can help ensure statistical soundness, it is labor-intensive and requires expertise. A more scalable approach is to evaluate the underlying workflows - the logical plans guiding code generation. However, it remains unclear how to assess whether a LLM-generated workflow supports reproducible implementations.

View on arXiv

Comments on this paper