ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.02054
19
1

Comparing Criteria Development Across Domain Experts, Lay Users, and Models in Large Language Model Evaluation

2 October 2024
Annalisa Szymanski
Simret Araya Gebreegziabher
Oghenemaro Anuyah
Ronald A Metoyer
T. Li
    ALM
    ELM
ArXivPDFHTML
Abstract

Large Language Models (LLMs) are increasingly utilized for domain-specific tasks, yet integrating domain expertise into evaluating their outputs remains challenging. A common approach to evaluating LLMs is to use metrics, or criteria, which are assertions used to assess performance that help ensure that their outputs align with domain-specific standards. Previous efforts have involved developers, lay users, or the LLMs themselves in creating these criteria, however, evaluation particularly from a domain expertise perspective, remains understudied. This study explores how domain experts contribute to LLM evaluation by comparing their criteria with those generated by LLMs and lay users. We further investigate how the criteria-setting process evolves, analyzing changes between a priori and a posteriori stages. Our findings emphasize the importance of involving domain experts early in the evaluation process while utilizing complementary strengths of lay users and LLMs. We suggest implications for designing workflows that leverage these strengths at different evaluation stages.

View on arXiv
Comments on this paper