FSet: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

Analyzing Fast, Frequent, and Fine-grained (F) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce FSet, a benchmark that consists of video datasets for precise F event detection. Datasets in FSet are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently, FSet contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on FSet, revealing substantial challenges for existing techniques. Additionally, we propose a new method, FED, for F event detections, achieving superior performance. The dataset, model, and benchmark code are available atthis https URL.
View on arXiv@article{liu2025_2504.08222, title={ F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos }, author={ Zhaoyu Liu and Kan Jiang and Murong Ma and Zhe Hou and Yun Lin and Jin Song Dong }, journal={arXiv preprint arXiv:2504.08222}, year={ 2025 } }