ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.16820
89
13

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

25 April 2024
Olivia Wiles
Chuhan Zhang
Isabela Albuquerque
Ivana Kajić
Su Wang
Emanuele Bugliarello
Yasumasa Onoe
Chris Knutsen
Cyrus Rashtchian
Jordi Pont-Tuset
Aida Nematzadeh
Anant Nawalgaria
Jordi Pont-Tuset
Aida Nematzadeh
    EGVM
ArXivPDFHTML
Abstract

While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.

View on arXiv
@article{wiles2025_2404.16820,
  title={ Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings },
  author={ Olivia Wiles and Chuhan Zhang and Isabela Albuquerque and Ivana Kajić and Su Wang and Emanuele Bugliarello and Yasumasa Onoe and Pinelopi Papalampidi and Ira Ktena and Chris Knutsen and Cyrus Rashtchian and Anant Nawalgaria and Jordi Pont-Tuset and Aida Nematzadeh },
  journal={arXiv preprint arXiv:2404.16820},
  year={ 2025 }
}
Comments on this paper