MACEval: A Multi-Agent Continual Evaluation Network for Large Models
- ALMELM
Hundreds of benchmarks dedicated to evaluating large models have been presented over the past few years. However, most of them remain closed-ended and are prone to overfitting due to the potential data contamination. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation. In this paper, we introduce MACEval, a Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define new metrics to quantify performance longitudinally. MACEval employs an interactive and autonomous evaluation mode, utilizing role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 23 large models demonstrate the effectiveness of MACEval, which also lightens the evaluation process and reduces a considerable amount of overhead. We hope that MACEval can broaden future directions of large model evaluation. Project page:this https URL.
View on arXiv