ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.01798
  4. Cited By
SEAL: Semantic Attention Learning for Long Video Representation
v1v2v3 (latest)

SEAL: Semantic Attention Learning for Long Video Representation

Computer Vision and Pattern Recognition (CVPR), 2024
2 December 2024
Lan Wang
Yujia Chen
Wen-Sheng Chu
Vishnu Boddeti
Du Tran
    VLM
ArXiv (abs)PDFHTML

Papers citing "SEAL: Semantic Attention Learning for Long Video Representation"

42 / 42 papers shown
Title
SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
Yuan Sheng
Y. Hao
Chenxu Li
Shuo Wang
Xiangnan He
92
0
0
23 Oct 2025
Poisoning Prompt-Guided Sampling in Video Large Language Models
Poisoning Prompt-Guided Sampling in Video Large Language Models
Yuxin Cao
Wei Song
Jingling Xue
Jin Song Dong
AAML
105
1
0
25 Sep 2025
Training-Free Multi-Style Fusion Through Reference-Based Adaptive Modulation
Training-Free Multi-Style Fusion Through Reference-Based Adaptive Modulation
Xu Liu
Yibo Lu
Xinxian Wang
Xinyu Wu
DiffM
118
3
0
23 Sep 2025
Prompt-Guided Dual Latent Steering for Inversion Problems
Prompt-Guided Dual Latent Steering for Inversion Problems
Yichen Wu
Xu Liu
Chenxuan Zhao
Xinyu Wu
DiffMLLMSV
176
3
0
23 Sep 2025
Failures to Surface Harmful Contents in Video Large Language Models
Failures to Surface Harmful Contents in Video Large Language Models
Yuxin Cao
Wei Song
Derui Wang
Jingling Xue
Jin Song Dong
AAML
139
3
0
14 Aug 2025
Enhancing Long Video Question Answering with Scene-Localized Frame Grouping
Enhancing Long Video Question Answering with Scene-Localized Frame Grouping
Xuyi Yang
Wenhao Zhang
Hongbo Jin
Lin Liu
Hongbo Xu
Yongwei Nie
Fei Richard Yu
Fei Ma
168
1
0
05 Aug 2025
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary ResolutionInternational Conference on Learning Representations (ICLR), 2024
Zuyan Liu
Yuhao Dong
Ziwei Liu
Winston Hu
Jiwen Lu
Yongming Rao
ObjD
580
131
0
19 Sep 2024
EgoVideo: Exploring Egocentric Foundation Model and Downstream
  Adaptation
EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation
Baoqi Pei
Guo Chen
Jilan Xu
Yuping He
Yicheng Liu
...
Yifei Huang
Yali Wang
Tong Lu
Limin Wang
Yu Qiao
EgoV
473
33
0
26 Jun 2024
Hallucination Mitigation Prompts Long-term Video Understanding
Hallucination Mitigation Prompts Long-term Video Understanding
Yiwei Sun
Zhihang Liu
Chuanbin Liu
Bowei Pu
Zhihan Zhang
Hongtao Xie
VLMMLLM
206
5
0
17 Jun 2024
LVBench: An Extreme Long Video Understanding Benchmark
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang
Zehai He
Wenyi Hong
Yean Cheng
Xiaohan Zhang
...
Shiyu Huang
Bin Xu
Yuxiao Dong
Ming Ding
Jie Tang
ELMVLM
538
199
0
12 Jun 2024
Streaming Long Video Understanding with Large Language Models
Streaming Long Video Understanding with Large Language Models
Rui Qian
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Shuangrui Ding
Dahua Lin
Yuan Liu
VLM
223
110
0
25 May 2024
YOLOv10: Real-Time End-to-End Object Detection
YOLOv10: Real-Time End-to-End Object DetectionNeural Information Processing Systems (NeurIPS), 2024
Ao Wang
Hui Chen
Lihao Liu
Kai Chen
Zijia Lin
Jungong Han
Guiguang Ding
3DH
266
2,962
0
23 May 2024
MovieChat+: Question-aware Sparse Memory for Long Video Question
  Answering
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
Enxin Song
Wenhao Chai
Tianbo Ye
Lei Li
Xi Li
Gaoang Wang
VLMMLLM
235
51
0
26 Apr 2024
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
  Dense Captioning
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu
Yilin Zhao
Daquan Zhou
Zhijie Lin
See Kiong Ng
Jiashi Feng
MLLMVLM
256
273
0
25 Apr 2024
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
  Models with Open-Source Suites
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
...
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLMVLM
510
972
0
25 Apr 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
381
62
0
09 Apr 2024
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
  Understanding
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He
Hengduo Li
Young Kyun Jang
Menglin Jia
Xuefei Cao
Ashish Shah
Abhinav Shrivastava
Ser-Nam Lim
MLLM
344
174
0
08 Apr 2024
Koala: Key frame-conditioned long video-LLM
Koala: Key frame-conditioned long video-LLM
Reuben Tan
Ximeng Sun
Ping Hu
Jui-hsien Wang
Hanieh Deilamsalehy
Bryan A. Plummer
Bryan C. Russell
Kate Saenko
339
61
0
05 Apr 2024
LongVLM: Efficient Long Video Understanding via Large Language Models
LongVLM: Efficient Long Video Understanding via Large Language ModelsEuropean Conference on Computer Vision (ECCV), 2024
Yuetian Weng
Mingfei Han
Haoyu He
Xiaojun Chang
Bohan Zhuang
VLM
324
123
0
04 Apr 2024
SnAG: Scalable and Accurate Video Grounding
SnAG: Scalable and Accurate Video GroundingComputer Vision and Pattern Recognition (CVPR), 2024
Fangzhou Mu
Sicheng Mo
Yin Li
265
27
0
02 Apr 2024
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Yue Fan
Xiaojian Ma
Rujie Wu
Yuntao Du
Jiaqi Li
Zhi Gao
Qing Li
VLMLLMAG
287
146
0
18 Mar 2024
NetTrack: Tracking Highly Dynamic Objects with a Net
NetTrack: Tracking Highly Dynamic Objects with a Net
Guang-Zheng Zheng
Shijie Lin
Haobo Zuo
Changhong Fu
Jia Pan
251
23
0
17 Mar 2024
Yi: Open Foundation Models by 01.AI
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLMLRM
805
761
0
07 Mar 2024
Text-Conditioned Resampler For Long Form Video Understanding
Text-Conditioned Resampler For Long Form Video Understanding
Bruno Korbar
Yongqin Xian
A. Tonioni
Andrew Zisserman
Federico Tombari
288
23
0
19 Dec 2023
TimeChat: A Time-sensitive Multimodal Large Language Model for Long
  Video Understanding
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023
Shuhuai Ren
Linli Yao
Shicheng Li
Xu Sun
Lu Hou
VLMMLLM
336
342
0
04 Dec 2023
MovieChat: From Dense Token to Sparse Memory for Long Video
  Understanding
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023
Enxin Song
Wenhao Chai
Guanhong Wang
Yucheng Zhang
Haoyang Zhou
...
Tianbo Ye
Yanting Zhang
Yang Lu
Lei Li
Gaoang Wang
VLMMLLM
575
450
0
31 Jul 2023
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and
  Language Models
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
MLLM
385
940
0
08 Jun 2023
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video UnderstandingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Hang Zhang
Xin Li
Lidong Bing
MLLM
542
1,466
0
05 Jun 2023
VideoChat: Chat-Centric Video Understanding
VideoChat: Chat-Centric Video Understanding
Kunchang Li
Yinan He
Yi Wang
Yizhuo Li
Wen Wang
Ping Luo
Yali Wang
Limin Wang
Yu Qiao
MLLM
367
777
0
10 May 2023
Segment Anything
Segment AnythingIEEE International Conference on Computer Vision (ICCV), 2023
A. Kirillov
Eric Mintun
Nikhila Ravi
Hanzi Mao
Chloe Rolland
...
Spencer Whitehead
Alexander C. Berg
Wan-Yen Lo
Piotr Dollár
Ross B. Girshick
MLLMVLM
927
10,923
0
05 Apr 2023
Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding
  in Long Videos
Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long VideosIEEE International Conference on Computer Vision (ICCV), 2023
Yulin Pan
Xiangteng He
Biao Gong
Yiliang Lv
Yujun Shen
Yuxin Peng
Deli Zhao
200
22
0
15 Mar 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsInternational Conference on Machine Learning (ICML), 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLMMLLM
1.2K
6,546
0
30 Jan 2023
EVA: Exploring the Limits of Masked Visual Representation Learning at
  Scale
EVA: Exploring the Limits of Masked Visual Representation Learning at ScaleComputer Vision and Pattern Recognition (CVPR), 2022
Yuxin Fang
Wen Wang
Binhui Xie
Quan-Sen Sun
Ledell Yu Wu
Xinggang Wang
Tiejun Huang
Xinlong Wang
Yue Cao
VLMCLIP
559
890
0
14 Nov 2022
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video
  Temporal Grounding
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal GroundingAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Zhijian Hou
Wanjun Zhong
Lei Ji
Difei Gao
Kun Yan
W. Chan
Chong-Wah Ngo
Zheng Shou
Nan Duan
AI4TS
241
33
0
22 Sep 2022
BoT-SORT: Robust Associations Multi-Pedestrian Tracking
BoT-SORT: Robust Associations Multi-Pedestrian Tracking
Nir Aharon
Roy Orfaig
B. Bobrovsky
VOT
378
706
0
29 Jun 2022
Egocentric Video-Language Pretraining
Egocentric Video-Language PretrainingNeural Information Processing Systems (NeurIPS), 2022
Kevin Qinghong Lin
Alex Jinpeng Wang
Mattia Soldan
Michael Wray
Rui Yan
...
Hongfa Wang
Dima Damen
Guohao Li
Wei Liu
Mike Zheng Shou
VLMEgoV
244
245
0
03 Jun 2022
From Representation to Reasoning: Towards both Evidence and Commonsense
  Reasoning for Video Question-Answering
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-AnsweringComputer Vision and Pattern Recognition (CVPR), 2022
Jiangtong Li
Li Niu
Liqing Zhang
167
65
0
30 May 2022
Frame-wise Action Representations for Long Videos via Sequence
  Contrastive Learning
Frame-wise Action Representations for Long Videos via Sequence Contrastive LearningComputer Vision and Pattern Recognition (CVPR), 2022
Minghao Chen
Fangyun Wei
Chong Li
Deng Cai
AI4TS
224
44
0
28 Mar 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
916
1,446
0
13 Oct 2021
The "something something" video database for learning and evaluating
  visual common sense
The "something something" video database for learning and evaluating visual common senseIEEE International Conference on Computer Vision (ICCV), 2017
Raghav Goyal
Samira Ebrahimi Kahou
Vincent Michalski
Joanna Materzynska
S. Westphal
...
Moritz Mueller-Freitag
F. Hoppe
Christian Thurau
Ingo Bax
Roland Memisevic
VLM
376
1,763
0
13 Jun 2017
The Kinetics Human Action Video Dataset
The Kinetics Human Action Video Dataset
W. Kay
João Carreira
Karen Simonyan
Brian Zhang
Chloe Hillier
...
Tim Green
T. Back
Apostol Natsev
Mustafa Suleyman
Andrew Zisserman
618
4,197
0
19 May 2017
The THUMOS Challenge on Action Recognition for Videos "in the Wild"
The THUMOS Challenge on Action Recognition for Videos "in the Wild"
Haroon Idrees
Amir Zamir
Yu-Gang Jiang
Alexander N. Gorban
Ivan Laptev
Rahul Sukthankar
M. Shah
254
812
0
21 Apr 2016
1