9

PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios

Xudong Lu
Huankang Guan
Yang Bo
Jinpeng Chen
Xintong Guo
Shuhan Li
Fang Liu
Peiwen Sun
Xueying Li
Wei Zhang
Xue Yang
Rui Liu
Hongsheng Li
Main:8 Pages
15 Figures
Bibliography:3 Pages
9 Tables
Appendix:7 Pages
Abstract

Multimodal Large Language Models excel at offline audio-visual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introduce PhoStream, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal a temporal asymmetry in LLM-judged scores (0-100): models perform well on Instant and Backward tasks (Gemini 3 Pro exceeds 80), but drop sharply on Forward tasks (16.40), largely due to early responses before the required visual and audio cues appear. This highlights a fundamental limitation: current MLLMs struggle to decide when to speak, not just what to say. Code and datasets used in this work will be made publicly accessible atthis https URL.

View on arXiv
Comments on this paper