Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2209.08199
Cited By

ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

v1v2v3v4 (latest)

ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

North American Chapter of the Association for Computational Linguistics (NAACL), 2022

16 September 2022

Victor Carbune

Jason Lin

Maria Wang

Yun Zhu

Jindong Chen

ArXiv (abs)PDF HTML

Papers citing "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots"

50 / 67 papers shown

Jina-VLM: Small Multilingual Vision Language Model

Jina-VLM: Small Multilingual Vision Language Model

Andreas Koukounas

Georgios Mastrapas

Florian Hönicke

Sedigheh Eslami

Guillaume Roncari

335

0

0

03 Dec 2025

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

121

0

0

25 Nov 2025

NVIDIA Nemotron Nano V2 VL

NVIDIA Nemotron Nano V2 VL

Amala Sanjay Deshmukh

Kateryna Chumachenko

Tuomas Rintamaki

...

Krzysztof Pawelec

309

2

0

06 Nov 2025

Composition-Grounded Instruction Synthesis for Visual Reasoning

Composition-Grounded Instruction Synthesis for Visual Reasoning

84

0

0

16 Oct 2025

Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents

Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents

...

191

2

0

03 Aug 2025

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

...

268

20

0

25 Jul 2025

MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

...

413

6

0

19 Jul 2025

Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System

Zhuosheng Zhang

214

6

0

10 Jun 2025

VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

...

425

3

0

03 Jun 2025

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

...

335

13

0

29 May 2025

Data Metabolism: An Efficient Data Design Schema For Vision Language Model

Data Metabolism: An Efficient Data Design Schema For Vision Language Model

381

2

0

10 Apr 2025

Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models

Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models

302

1

0

10 Apr 2025

MP-GUI: Modality Perception with MLLMs for GUI Understanding

MP-GUI: Modality Perception with MLLMs for GUI UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

338

9

0

18 Mar 2025

PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks

PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks

452

2

0

06 Mar 2025

SpiritSight Agent: Advanced GUI Agent with One Look

SpiritSight Agent: Advanced GUI Agent with One LookComputer Vision and Pattern Recognition (CVPR), 2025

412

11

0

05 Mar 2025

RWKV-UI: UI Understanding with Enhanced Perception and Reasoning

RWKV-UI: UI Understanding with Enhanced Perception and Reasoning

147

0

0

06 Feb 2025

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile DevicesComputer Vision and Pattern Recognition (CVPR), 2024

...

197

19

0

16 Nov 2024

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and GenerationComputer Vision and Pattern Recognition (CVPR), 2024

...

390

264

0

17 Oct 2024

Harnessing Webpage UIs for Text-Rich Visual Understanding

Harnessing Webpage UIs for Text-Rich Visual UnderstandingInternational Conference on Learning Representations (ICLR), 2024

Chenyan Xiong

Graham Neubig

368

21

0

17 Oct 2024

TinyClick: Single-Turn Agent for Empowering GUI Automation

TinyClick: Single-Turn Agent for Empowering GUI Automation

Pawel Pawlowski

Krystian Zawistowski

Wojciech Lapacz

Sebastien Postansque

Jakub Hoscilowicz

391

9

0

09 Oct 2024

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

357

12

0

02 Oct 2024

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang

Mingfei Gao

...

Zirui Wang

Yinfei Yang

303

66

1

30 Sep 2024

MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI
Understanding

MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI UnderstandingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Qinzhuo Wu

Weikai Xu

Wei Liu

Tao Tan

Jianfeng Liu

Jian Luan

Bin Wang

294

42

0

23 Sep 2024

MobileViews: A Large-Scale Mobile GUI Dataset

MobileViews: A Large-Scale Mobile GUI Dataset

Li Zhang

Shangguang Wang

Yuanchun Li

Mengwei Xu

206

13

0

22 Sep 2024

POINTS: Improving Your Vision-language Model with Affordable Strategies

POINTS: Improving Your Vision-language Model with Affordable Strategies

259

12

0

07 Sep 2024

WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

Srinivas Sunkara

Gilles Baechler

Jindong Chen

300

12

0

06 Sep 2024

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Sanghyun Woo

Manoj Middepogu

...

359

626

0

24 Jun 2024

On Efficient Language and Vision Assistants for Visually-Situated
Natural Language Understanding: What Matters in Reading and Reasoning

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning

Minjoon Seo

235

4

0

17 Jun 2024

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal
Large Language Models

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

...

Yujiu Yang

Yingchun Wang

277

32

0

11 Jun 2024

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team Thomas Mesnard

Surya Bhupatiraju

...

Kathleen Kenealy

589

836

0

13 Mar 2024

DeepSeek-VL: Towards Real-World Vision-Language Understanding

DeepSeek-VL: Towards Real-World Vision-Language Understanding

...

Chengqi Deng

434

642

0

08 Mar 2024

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Gilles Baechler

Srinivas Sunkara

Abhanshu Sharma

829

96

0

07 Feb 2024

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web
Tasks

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Graham Neubig

Ruslan Salakhutdinov

Daniel Fried

273

0

0

24 Jan 2024

WebVLN: Vision-and-Language Navigation on Websites

WebVLN: Vision-and-Language Navigation on Websites

Hsiang-Ting Chen

Qi Wu

232

18

0

25 Dec 2023

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Alexander Kolesnikov

Jialin Wu

...

291

139

0

13 Oct 2023

Referring to Screen Texts with Voice Assistants

Referring to Screen Texts with Voice AssistantsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Shruti Bhargava

Hoang Long Nguyen

Vincent Renkens

205

2

0

10 Jun 2023

WebUI: A Dataset for Enhancing Visual UI Understanding with Web
Semantics

WebUI: A Dataset for Enhancing Visual UI Understanding with Web SemanticsInternational Conference on Human Factors in Computing Systems (CHI), 2023

Jeffrey Nichols

Jeffrey P. Bigham

227

85

0

30 Jan 2023

Towards Better Semantic Understanding of Mobile Interfaces

Towards Better Semantic Understanding of Mobile InterfacesInternational Conference on Computational Linguistics (COLING), 2022

Srinivas Sunkara

Gilles Baechler

Abhanshu Sharma

222

32

0

06 Oct 2022

Enabling Conversational Interaction with Mobile UI using Large Language
Models

Enabling Conversational Interaction with Mobile UI using Large Language ModelsInternational Conference on Human Factors in Computing Systems (CHI), 2022

400

173

0

18 Sep 2022

ChartQA: A Benchmark for Question Answering about Charts with Visual and
Logical Reasoning

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical ReasoningFindings (Findings), 2022

415

1,126

0

19 Mar 2022

A Dataset for Interactive Vision-Language Navigation with Unknown
Command Feasibility

A Dataset for Interactive Vision-Language Navigation with Unknown Command FeasibilityEuropean Conference on Computer Vision (ECCV), 2022

Bryan A. Plummer

409

80

0

04 Feb 2022

Learning to Denoise Raw Mobile UI Layouts for Improving Datasets at
Scale

Learning to Denoise Raw Mobile UI Layouts for Improving Datasets at ScaleInternational Conference on Human Factors in Computing Systems (CHI), 2022

Gilles Baechler

270

56

0

11 Jan 2022

FinQA: A Dataset of Numerical Reasoning over Financial Data

FinQA: A Dataset of Numerical Reasoning over Financial DataConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

...

Matthew I. Beane

Ting-Hao 'Kenneth' Huang

Bryan R. Routledge

580

516

0

01 Sep 2021

Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning

Screen2Words: Automatic Mobile UI Summarization with Multimodal LearningACM Symposium on User Interface Software and Technology (UIST), 2021

817

193

0

07 Aug 2021

UIBert: Learning Generic Multimodal Representations for UI Understanding

UIBert: Learning Generic Multimodal Representations for UI UnderstandingInternational Joint Conference on Artificial Intelligence (IJCAI), 2021

Srinivas Sunkara

Abhinav Rastogi

Blaise Agüera y Arcas

258

111

0

29 Jul 2021

Multimodal Icon Annotation For Mobile Applications

Multimodal Icon Annotation For Mobile ApplicationsInternational Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI), 2021

182

20

0

09 Jul 2021

ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction

ICDAR2019 Competition on Scanned Receipt OCR and Information ExtractionIEEE International Conference on Document Analysis and Recognition (ICDAR), 2019

Dimosthenis Karatzas

197

380

0

18 Mar 2021

Widget Captioning: Generating Natural Language Description for Mobile
User Interface Elements

Widget Captioning: Generating Natural Language Description for Mobile User Interface ElementsConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

160

135

0

08 Oct 2020

DocVQA: A Dataset for VQA on Document Images

DocVQA: A Dataset for VQA on Document Images

Dimosthenis Karatzas

677

1,094

0

01 Jul 2020

Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI
Components by Deep Learning

Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep LearningInternational Conference on Software Engineering (ICSE), 2020

Liming Zhu

243

144

0

01 Mar 2020