130
10

ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

Yu-Chung Hsiao
Fedir Zubach
Maria Wang
Jindong Chen
Victor Carbune
Jason Lin
Maria Wang
Yun Zhu
Jindong Chen
Abstract

We introduce ScreenQA, a novel benchmarking dataset designed to advance screen content understanding through question answering. The existing screen datasets are focused either on low-level structural and component understanding, or on a much higher-level composite task such as navigation and task completion for autonomous agents. ScreenQA attempts to bridge this gap. By annotating 86k question-answer pairs over the RICO dataset, we aim to benchmark the screen reading comprehension capacity, thereby laying the foundation for vision-based automation over screenshots. Our annotations encompass full answers, short answer phrases, and corresponding UI contents with bounding boxes, enabling four subtasks to address various application scenarios. We evaluate the dataset's efficacy using both open-weight and proprietary models in zero-shot, fine-tuned, and transfer learning settings. We further demonstrate positive transfer to web applications, highlighting its potential beyond mobile applications.

View on arXiv
@article{hsiao2025_2209.08199,
  title={ ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots },
  author={ Yu-Chung Hsiao and Fedir Zubach and Gilles Baechler and Srinivas Sunkara and Victor Carbune and Jason Lin and Maria Wang and Yun Zhu and Jindong Chen },
  journal={arXiv preprint arXiv:2209.08199},
  year={ 2025 }
}
Comments on this paper