ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.10512
57
21

Position: AI Evaluation Should Learn from How We Test Humans

18 June 2023
Yan Zhuang
Q. Liu
Yuting Ning
Wei Huang
Rui Lv
Zhenya Huang
Guanhao Zhao
Zheng-Wei Zhang
    ELM
    ALM
ArXivPDFHTML
Abstract

As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model's evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model's observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.

View on arXiv
@article{zhuang2025_2306.10512,
  title={ Position: AI Evaluation Should Learn from How We Test Humans },
  author={ Yan Zhuang and Qi Liu and Zachary A. Pardos and Patrick C. Kyllonen and Jiyun Zu and Zhenya Huang and Shijin Wang and Enhong Chen },
  journal={arXiv preprint arXiv:2306.10512},
  year={ 2025 }
}
Comments on this paper