Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.05615
Cited By
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
9 June 2024
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives"
13 / 13 papers shown
Title
Agentic Keyframe Search for Video Question Answering
Sunqi Fan
Meng-Hao Guo
Shuojin Yang
45
0
0
20 Mar 2025
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
Khang H. N. Vo
D. Q. Nguyen
T. Nguyen
Tho Quan
45
0
0
09 Mar 2025
Do Language Models Understand Time?
Xi Ding
Lei Wang
158
0
0
18 Dec 2024
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou
Tianze Luo
Guiyang Xie
Victor
Zhang
...
Guangcong Wang
Juanyang Chen
Zhuochen Wang
Hansheng Zhang
Huaijian Zhang
VLM
31
6
0
27 Sep 2024
Topic-aware Causal Intervention for Counterfactual Detection
Thong Nguyen
Truc-My Nguyen
20
1
0
25 Sep 2024
GalleryGPT: Analyzing Paintings with Large Multimodal Models
Yi Bin
Wenhao Shi
Yujuan Ding
Zhiqiang Hu
Zheng Wang
Yang Yang
See-Kiong Ng
H. Shen
MLLM
20
11
0
01 Aug 2024
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen
Yi Bin
Xiaobao Wu
Xinshuai Dong
Zhiyuan Hu
Khoi M. Le
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
26
5
0
04 Jul 2024
Encoding and Controlling Global Semantics for Long-form Video Question Answering
Thong Nguyen
Zhiyuan Hu
Xiaobao Wu
Cong-Duy Nguyen
See-Kiong Ng
A. Luu
32
2
0
30 May 2024
A CLIP-Hitchhiker's Guide to Long Video Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
CLIP
113
60
0
17 May 2022
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLM
ViT
59
44
0
21 Sep 2021
Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering
Jungin Park
Jiyoung Lee
K. Sohn
114
99
0
29 Apr 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
231
573
0
22 Apr 2021
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
Jie Lei
Licheng Yu
Tamara L. Berg
Mohit Bansal
106
268
0
24 Jan 2020
1