SPHERE: An Evaluation Card for Human-AI Systems

24 March 2025

Abstract

In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.

View on arXiv

@article{ma2025_2504.07971,
  title={ SPHERE: An Evaluation Card for Human-AI Systems },
  author={ Qianou Ma and Dora Zhao and Xinran Zhao and Chenglei Si and Chenyang Yang and Ryan Louie and Ehud Reiter and Diyi Yang and Tongshuang Wu },
  journal={arXiv preprint arXiv:2504.07971},
  year={ 2025 }
}

Comments on this paper