Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models

Vision-language models (VLMs) have shown remarkable progress in offline tasks such as image captioning and video question answering. However, real-time interactive environments impose new demands on VLMs, requiring them to generate utterances that are not only semantically accurate but also precisely timed. We identify two core capabilities necessary for such settings -- and -- and propose a new benchmark task, , to evaluate them. TGLG requires models to generate utterances in response to streaming video such that both content and timing align with dynamic visual input. To support this benchmark, we curate evaluation datasets from sports broadcasting and egocentric human interaction domains, and introduce a new metric, , to evaluate TGLG by jointly measuring semantic similarity and temporal alignment. Finally, we present , a model that interleaves visual and linguistic tokens in a time-synchronized manner, enabling real-time language generation without relying on turn-based assumptions. Experimental results show that VLM-TSI significantly outperforms a strong baseline, yet overall performance remains modest -- highlighting the difficulty of TGLG and motivating further research in real-time VLMs. Code and data available .
View on arXiv@article{yu2025_2505.11326, title={ Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models }, author={ Keunwoo Peter Yu and Joyce Chai }, journal={arXiv preprint arXiv:2505.11326}, year={ 2025 } }