ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.07745
28
0

SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding

10 April 2025
Yangliu Hu
Zikai Song
Na Feng
Yawei Luo
Junqing Yu
Yi-Ping Phoebe Chen
Wei Yang
ArXivPDFHTML
Abstract

Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF2^22T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos; (2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities. We assessed multiple models and validated the effectiveness of SF2^22T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.

View on arXiv
@article{hu2025_2504.07745,
  title={ SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding },
  author={ Yangliu Hu and Zikai Song and Na Feng and Yawei Luo and Junqing Yu and Yi-Ping Phoebe Chen and Wei Yang },
  journal={arXiv preprint arXiv:2504.07745},
  year={ 2025 }
}
Comments on this paper