ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2209.08199
  4. Cited By
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
v1v2v3v4 (latest)

ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

North American Chapter of the Association for Computational Linguistics (NAACL), 2022
16 September 2022
Yu-Chung Hsiao
Fedir Zubach
Maria Wang
Jindong Chen
Victor Carbune
Jason Lin
Maria Wang
Yun Zhu
Jindong Chen
    RALM
ArXiv (abs)PDFHTML

Papers citing "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots"

50 / 67 papers shown
Jina-VLM: Small Multilingual Vision Language Model
Jina-VLM: Small Multilingual Vision Language Model
Andreas Koukounas
Georgios Mastrapas
Florian Hönicke
Sedigheh Eslami
Guillaume Roncari
Scott Martens
Han Xiao
MLLM
335
0
0
03 Dec 2025
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Yunze Man
S. S. Wang
Guowen Zhang
Johan Bjorck
Zhiqi Li
Liang-Yan Gui
Jim Fan
Jan Kautz
Yu Wang
Zhiding Yu
121
0
0
25 Nov 2025
NVIDIA Nemotron Nano V2 VL
NVIDIA Nemotron Nano V2 VL
Nvidia
Amala Sanjay Deshmukh
Kateryna Chumachenko
Tuomas Rintamaki
Matthieu Le
...
Krzysztof Pawelec
Michael Evans
Katherine Luna
Jie Lou
Erick Galinkin
VLM
309
2
0
06 Nov 2025
Composition-Grounded Instruction Synthesis for Visual Reasoning
Composition-Grounded Instruction Synthesis for Visual Reasoning
Xinyi Gu
Jiayuan Mao
Zhang-Wei Hong
Zhuoran Yu
Pengyuan Li
Dhiraj Joshi
Rogerio Feris
Zexue He
ReLMLRM
84
0
0
16 Oct 2025
Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents
Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents
Yuhan Guo
Cong Guo
Aiwen Sun
Hongliang He
Xinyu Yang
...
Jiang Duan
Yijia Xiao
Liangjian Wen
Hai-Ming Xu
Yong Dai
LRM
191
2
0
03 Aug 2025
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
X. Wang
Zhenyu Wu
JingJing Xie
Zichen Ding
Bowen Yang
...
Weijie Su
X. Zhu
Wei Shen
Jifeng Dai
Wenhai Wang
LLMAG
268
20
0
25 Jul 2025
MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning
MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning
Liujian Tang
Shaokang Dong
Y. Huang
Minqi Xiang
Hongtao Ruan
...
Qi Zhang
Kang Wang
Y. Zhang
Y. Wang
Yuran Wang
LM&Ro
413
6
0
19 Jul 2025
Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System
Yuan Guo
Tingjia Miao
Zheng Wu
Pengzhou Cheng
Ming Zhou
Zhuosheng Zhang
214
6
0
10 Jun 2025
VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning
VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning
Hao Yan
Handong Zheng
Hao Wang
Liang Yin
Xingchen Liu
...
Minghui Liao
Chao Weng
Wei Chen
Yuliang Liu
Xiang Bai
LRM
425
3
0
03 Jun 2025
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Chenyu Yang
Shiqian Su
Shi-Qi Liu
Xuan Dong
Yue Yu
...
Hao Li
Wenhai Wang
Yu Qiao
Xizhou Zhu
Jifeng Dai
OffRL
335
13
0
29 May 2025
Data Metabolism: An Efficient Data Design Schema For Vision Language Model
Data Metabolism: An Efficient Data Design Schema For Vision Language Model
Jingyuan Zhang
Hongzhi Zhang
Zhou Haonan
Chenxi Sun
Xingguang Ji
Jiakang Wang
Fanheng Kong
Wenshu Fan
Qi Wang
Fuzheng Zhang
VLM
381
2
0
10 Apr 2025
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
Xingguang Ji
Jiakang Wang
Hongzhi Zhang
Jingyuan Zhang
Haonan Zhou
Chenxi Sun
Wenshu Fan
Qi Wang
Fuzheng Zhang
MLLMVLM
302
1
0
10 Apr 2025
MP-GUI: Modality Perception with MLLMs for GUI Understanding
MP-GUI: Modality Perception with MLLMs for GUI UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025
Ziwei Wang
Weizhi Chen
Leyang Yang
Sheng Zhou
Shengchu Zhao
Hanbei Zhan
Jiongchao Jin
Liangcheng Li
Zirui Shao
Jiajun Bu
338
9
0
18 Mar 2025
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
Feng Ni
Kui Huang
Yao Lu
Wenyu Lv
Guanzhong Wang
Zeyu Chen
Wenshu Fan
VLM
452
2
0
06 Mar 2025
SpiritSight Agent: Advanced GUI Agent with One Look
SpiritSight Agent: Advanced GUI Agent with One LookComputer Vision and Pattern Recognition (CVPR), 2025
Zhiyuan Huang
Ziming Cheng
Junting Pan
Zhaohui Hou
Mingjie Zhan
LLMAG
412
11
0
05 Mar 2025
RWKV-UI: UI Understanding with Enhanced Perception and Reasoning
RWKV-UI: UI Understanding with Enhanced Perception and Reasoning
Jiaxi Yang
Haowen Hou
ReLMLRM
147
0
0
06 Feb 2025
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
  Language Models on Mobile Devices
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile DevicesComputer Vision and Pattern Recognition (CVPR), 2024
Xudong Lu
Yinghao Chen
Cheng Chen
Hui Tan
Boheng Chen
...
Aojun Zhou
Yafei Wen
Xiaoxin Chen
Shuai Ren
Jiaming Song
197
19
0
16 Nov 2024
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
  and Generation
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and GenerationComputer Vision and Pattern Recognition (CVPR), 2024
Chengyue Wu
Xiaokang Chen
Z. F. Wu
Yiyang Ma
Xingchao Liu
...
Wen Liu
Zhenda Xie
Xingkai Yu
Chong Ruan
Ping Luo
AI4TS
390
264
0
17 Oct 2024
Harnessing Webpage UIs for Text-Rich Visual Understanding
Harnessing Webpage UIs for Text-Rich Visual UnderstandingInternational Conference on Learning Representations (ICLR), 2024
Junpeng Liu
Tianyue Ou
Yifan Song
Yuxiao Qu
Wai Lam
Chenyan Xiong
Lei Ma
Graham Neubig
Xiang Yue
368
21
0
17 Oct 2024
TinyClick: Single-Turn Agent for Empowering GUI Automation
TinyClick: Single-Turn Agent for Empowering GUI Automation
Pawel Pawlowski
Krystian Zawistowski
Wojciech Lapacz
Marcin Skorupa
Adam Wiacek
Sebastien Postansque
Jakub Hoscilowicz
LRMLLMAGMLLM
391
9
0
09 Oct 2024
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
Mengzhao Jia
Wenhao Yu
Kaixin Ma
Tianqing Fang
Z. Zhang
Siru Ouyang
Hongming Zhang
Meng Jiang
Dong Yu
VLM
357
12
0
02 Oct 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
...
Haoxuan You
Zirui Wang
Afshin Dehghan
Peter Grasch
Yinfei Yang
VLMMLLM
303
66
1
30 Sep 2024
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI
  Understanding
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI UnderstandingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Qinzhuo Wu
Weikai Xu
Wei Liu
Tao Tan
Jianfeng Liu
Ang Li
Jian Luan
Bin Wang
Shuo Shang
VLM
294
42
0
23 Sep 2024
MobileViews: A Large-Scale Mobile GUI Dataset
MobileViews: A Large-Scale Mobile GUI Dataset
Longxi Gao
Li Zhang
Shihe Wang
Shangguang Wang
Yuanchun Li
Mengwei Xu
206
13
0
22 Sep 2024
POINTS: Improving Your Vision-language Model with Affordable Strategies
POINTS: Improving Your Vision-language Model with Affordable Strategies
Yuan Liu
Zhongyin Zhao
Ziyuan Zhuang
Le Tian
Xiao Zhou
Jie Zhou
VLM
259
12
0
07 Sep 2024
WebQuest: A Benchmark for Multimodal QA on Web Page Sequences
WebQuest: A Benchmark for Multimodal QA on Web Page Sequences
Maria Wang
Srinivas Sunkara
Gilles Baechler
Jason Lin
Yun Zhu
Fedir Zubach
Lei Shu
Jindong Chen
LRMLLMAG
300
12
0
06 Sep 2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong
Ellis L Brown
Penghao Wu
Sanghyun Woo
Manoj Middepogu
...
Xichen Pan
Austin Wang
Rob Fergus
Yann LeCun
Saining Xie
3DVMLLM
359
626
0
24 Jun 2024
On Efficient Language and Vision Assistants for Visually-Situated
  Natural Language Understanding: What Matters in Reading and Reasoning
On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning
Geewook Kim
Minjoon Seo
VLM
235
4
0
17 Jun 2024
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal
  Large Language Models
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
Tianle Gu
Zeyang Zhou
Kexin Huang
Dandan Liang
Yixu Wang
...
Keqing Wang
Yujiu Yang
Yan Teng
Botian Shi
Yingchun Wang
ELM
277
32
0
11 Jun 2024
Gemma: Open Models Based on Gemini Research and Technology
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team
Gemma Team Thomas Mesnard
Cassidy Hardin
Robert Dadashi
Surya Bhupatiraju
...
Armand Joulin
Noah Fiedel
Evan Senter
Alek Andreev
Kathleen Kenealy
VLMLLMAG
589
836
0
13 Mar 2024
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu
Wen Liu
Bo Zhang
Bing-Li Wang
Kai Dong
...
Yaofeng Sun
Chengqi Deng
Hanwei Xu
Zhenda Xie
Chong Ruan
VLM
434
642
0
08 Mar 2024
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Carbune
Jason Lin
Jindong Chen
Abhanshu Sharma
829
96
0
07 Feb 2024
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web
  Tasks
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
Jing Yu Koh
Robert Lo
Lawrence Jang
Vikram Duvvur
Ming Chong Lim
Po-Yu Huang
Graham Neubig
Shuyan Zhou
Ruslan Salakhutdinov
Daniel Fried
273
0
0
24 Jan 2024
WebVLN: Vision-and-Language Navigation on Websites
WebVLN: Vision-and-Language Navigation on Websites
Qi Chen
D. Pitawela
Chongyang Zhao
Gengze Zhou
Hsiang-Ting Chen
Qi Wu
232
18
0
25 Dec 2023
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Xi Chen
Xiao Wang
Lucas Beyer
Alexander Kolesnikov
Jialin Wu
...
Keran Rong
Tianli Yu
Daniel Keysers
Xiao-Qi Zhai
Radu Soricut
MLLMVLM
291
139
0
13 Oct 2023
Referring to Screen Texts with Voice Assistants
Referring to Screen Texts with Voice AssistantsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Shruti Bhargava
Anand Dhoot
I. Jonsson
Hoang Long Nguyen
Alkesh Patel
Hong-ye Yu
Vincent Renkens
205
2
0
10 Jun 2023
WebUI: A Dataset for Enhancing Visual UI Understanding with Web
  Semantics
WebUI: A Dataset for Enhancing Visual UI Understanding with Web SemanticsInternational Conference on Human Factors in Computing Systems (CHI), 2023
Jason Wu
Siyan Wang
Siman Shen
Yi-Hao Peng
Jeffrey Nichols
Jeffrey P. Bigham
227
85
0
30 Jan 2023
Towards Better Semantic Understanding of Mobile Interfaces
Towards Better Semantic Understanding of Mobile InterfacesInternational Conference on Computational Linguistics (COLING), 2022
Srinivas Sunkara
Maria Wang
Lijuan Liu
Gilles Baechler
Yu-Chung Hsiao
Jindong Chen
Chen
Abhanshu Sharma
James Stout
222
32
0
06 Oct 2022
Enabling Conversational Interaction with Mobile UI using Large Language
  Models
Enabling Conversational Interaction with Mobile UI using Large Language ModelsInternational Conference on Human Factors in Computing Systems (CHI), 2022
Bryan Wang
Gang Li
Yang Li
400
173
0
18 Sep 2022
ChartQA: A Benchmark for Question Answering about Charts with Visual and
  Logical Reasoning
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical ReasoningFindings (Findings), 2022
Ahmed Masry
Do Xuan Long
J. Tan
Shafiq Joty
Enamul Hoque
AIMat
415
1,126
0
19 Mar 2022
A Dataset for Interactive Vision-Language Navigation with Unknown
  Command Feasibility
A Dataset for Interactive Vision-Language Navigation with Unknown Command FeasibilityEuropean Conference on Computer Vision (ECCV), 2022
Andrea Burns
Deniz Arsan
Sanjna Agrawal
Ranjitha Kumar
Kate Saenko
Bryan A. Plummer
409
80
0
04 Feb 2022
Learning to Denoise Raw Mobile UI Layouts for Improving Datasets at
  Scale
Learning to Denoise Raw Mobile UI Layouts for Improving Datasets at ScaleInternational Conference on Human Factors in Computing Systems (CHI), 2022
Gang Li
Gilles Baechler
Manuel Tragut
Yang Li
270
56
0
11 Jan 2022
FinQA: A Dataset of Numerical Reasoning over Financial Data
FinQA: A Dataset of Numerical Reasoning over Financial DataConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Zhiyu Chen
Wenhu Chen
Charese Smiley
Sameena Shah
Iana Borova
...
Reema N Moussa
Matthew I. Beane
Ting-Hao 'Kenneth' Huang
Bryan R. Routledge
Wenjie Wang
AIMat
580
516
0
01 Sep 2021
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
Screen2Words: Automatic Mobile UI Summarization with Multimodal LearningACM Symposium on User Interface Software and Technology (UIST), 2021
Bryan Wang
Gang Li
Xin Zhou
Zhourong Chen
Tovi Grossman
Yang Li
817
193
0
07 Aug 2021
UIBert: Learning Generic Multimodal Representations for UI Understanding
UIBert: Learning Generic Multimodal Representations for UI UnderstandingInternational Joint Conference on Artificial Intelligence (IJCAI), 2021
Chongyang Bai
Xiaoxue Zang
Ying Xu
Srinivas Sunkara
Abhinav Rastogi
Jindong Chen
Blaise Agüera y Arcas
258
111
0
29 Jul 2021
Multimodal Icon Annotation For Mobile Applications
Multimodal Icon Annotation For Mobile ApplicationsInternational Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI), 2021
Xiaoxue Zang
Ying Xu
Jindong Chen
182
20
0
09 Jul 2021
ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction
ICDAR2019 Competition on Scanned Receipt OCR and Information ExtractionIEEE International Conference on Document Analysis and Recognition (ICDAR), 2019
Zheng Huang
Kai Chen
Jianhua He
X. Bai
Dimosthenis Karatzas
Shijian Lu
C. V. Jawahar
197
380
0
18 Mar 2021
Widget Captioning: Generating Natural Language Description for Mobile
  User Interface Elements
Widget Captioning: Generating Natural Language Description for Mobile User Interface ElementsConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Yongqian Li
Gang Li
Luheng He
Jingjie Zheng
Hong Li
Zhiwei Guan
160
135
0
08 Oct 2020
DocVQA: A Dataset for VQA on Document Images
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
677
1,094
0
01 Jul 2020
Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI
  Components by Deep Learning
Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep LearningInternational Conference on Software Engineering (ICSE), 2020
Jieshan Chen
Chunyang Chen
Zhenchang Xing
Xiwei Xu
Liming Zhu
Guoqiang Li
Jinshui Wang
243
144
0
01 Mar 2020
12
Next