ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2002.06353
  4. Cited By
UniVL: A Unified Video and Language Pre-Training Model for Multimodal
  Understanding and Generation
v1v2v3 (latest)

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

15 February 2020
Huaishao Luo
Lei Ji
Ding Wang
Haoyang Huang
Nan Duan
Tianrui Li
Jason Li
Xilin Chen
Ming Zhou
    VLM
ArXiv (abs)PDFHTML

Papers citing "UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"

50 / 294 papers shown
Leveraging Foundation Models for Multimodal Graph-Based Action Recognition
Leveraging Foundation Models for Multimodal Graph-Based Action Recognition
Fatemeh Ziaeetabar
Florentin Wörgötter
474
3
0
21 May 2025
ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting
ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting
Jian Hu
Dimitrios Korkinof
S. Gong
Mariano Beguerisse-Díaz
VLM
259
0
0
22 Apr 2025
Parameter-Efficient Continual Fine-Tuning: A Survey
Parameter-Efficient Continual Fine-Tuning: A Survey
Eric Nuertey Coleman
Luigi Quarantiello
Ziyue Liu
Qinwen Yang
Samrat Mukherjee
J. Hurtado
Vincenzo Lomonaco
CLL
466
8
0
18 Apr 2025
FocusedAD: Character-centric Movie Audio Description
FocusedAD: Character-centric Movie Audio Description
Xiaojun Ye
C. Wang
Yiren Song
Sheng Zhou
Liangcheng Li
Jiajun Bu
VGen
457
5
0
16 Apr 2025
F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos
F3^33Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from VideosInternational Conference on Learning Representations (ICLR), 2025
Zhaoyu Liu
Kan Jiang
Murong Ma
Zhe Hou
Yun Lin
Jin Song Dong
364
5
0
11 Apr 2025
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
Sofian Chaybouti
Walid Bousselham
Moritz Wolter
Hilde Kuehne
936
1
0
07 Apr 2025
REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding
REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding
Sakib Reza
Xiyun Song
Heather Yu
Zongfang Lin
Mohsen Moghaddam
Mario Sznaier
320
0
0
07 Apr 2025
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMsComputer Vision and Pattern Recognition (CVPR), 2025
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
337
5
0
31 Mar 2025
Learning to Generate Long-term Future Narrations Describing Activities of Daily Living
Learning to Generate Long-term Future Narrations Describing Activities of Daily Living
Ramanathan Rajendiran
Debaditya Roy
Basura Fernando
VGen
369
1
0
03 Mar 2025
CrossOver: 3D Scene Cross-Modal Alignment
CrossOver: 3D Scene Cross-Modal AlignmentComputer Vision and Pattern Recognition (CVPR), 2025
S. Sarkar
O. Mikšík
Marc Pollefeys
Daniel Barath
Iro Armeni
3DPC
507
12
0
20 Feb 2025
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Hierarchical Banzhaf Interaction for General Video-Language Representation LearningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Peng Jin
Haoyang Li
Li Yuan
Shuicheng Yan
Jie Chen
484
3
0
31 Dec 2024
Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track
Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track
D. Gupta
Dina Demner-Fushman
LM&MA
280
3
0
15 Dec 2024
Explainable and Interpretable Multimodal Large Language Models: A
  Comprehensive Survey
Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
Yunkai Dang
Kaichen Huang
Jiahao Huo
Yibo Yan
Shijie Huang
...
Kun Wang
Yong Liu
Jing Shao
Hui Xiong
Xuming Hu
LRM
476
62
0
03 Dec 2024
TechCoach: Towards Technical-Point-Aware Descriptive Action Coaching
TechCoach: Towards Technical-Point-Aware Descriptive Action Coaching
Yuan-Ming Li
An-Lan Wang
Kun-Yu Lin
Yu-Ming Tang
Ling-an Zeng
Jian-Fang Hu
Wei-Shi Zheng
584
8
0
26 Nov 2024
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Jungeun Kim
Hyeongwoo Jeon
Jongseong Bae
Ha Young Kim
SLR
353
7
0
25 Nov 2024
Multi-Modal interpretable automatic video captioning
Multi-Modal interpretable automatic video captioning
Antoine Hanna-Asaad
Decky Aspandi
Titus Zaharia
282
1
0
11 Nov 2024
Sensor2Text: Enabling Natural Language Interactions for Daily Activity
  Tracking Using Wearable Sensors
Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable SensorsProceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2024
Wenqiang Chen
Jiaxuan Cheng
Leyao Wang
Wei Zhao
Wojciech Matusik
335
18
0
26 Oct 2024
It's Just Another Day: Unique Video Captioning by Discriminative
  Prompting
It's Just Another Day: Unique Video Captioning by Discriminative PromptingAsian Conference on Computer Vision (ACCV), 2024
Toby Perrett
Tengda Han
Dima Damen
Andrew Zisserman
287
3
0
15 Oct 2024
Bridging Text and Image for Artist Style Transfer via Contrastive
  Learning
Bridging Text and Image for Artist Style Transfer via Contrastive Learning
Zhi-Song Liu
Li-Wen Wang
Jun Xiao
Vicky Kalogeiton
CLIPVLM
290
0
0
12 Oct 2024
Multi-granularity Contrastive Cross-modal Collaborative Generation for
  End-to-End Long-term Video Question Answering
Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question AnsweringIEEE Transactions on Image Processing (TIP), 2024
Ting Yu
Kunhao Fu
Jian Zhang
Qingming Huang
Jun Yu
273
11
0
12 Oct 2024
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video
  Paragraph Captioning
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning
Eileen Wang
Caren Han
Josiah Poon
272
1
0
12 Oct 2024
Exploring Efficient Foundational Multi-modal Models for Video
  Summarization
Exploring Efficient Foundational Multi-modal Models for Video Summarization
Karan Samel
Apoorva Beedu
Nitish Sontakke
Irfan Essa
200
2
0
09 Oct 2024
Grounding is All You Need? Dual Temporal Grounding for Video Dialog
Grounding is All You Need? Dual Temporal Grounding for Video Dialog
You Qin
Wei Ji
Xinze Lan
Hao Fei
Xun Yang
Dan Guo
Roger Zimmermann
Lizi Liao
VGen
353
2
0
08 Oct 2024
EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts
EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts
Yuto Haneji
Taichi Nishimura
Hirotaka Kameko
Keisuke Shirai
Tomoya Yoshida
Keiya Kajimura
Koki Yamamoto
Taiyu Cui
Tomohiro Nishimoto
Shinsuke Mori
EgoV
432
0
0
07 Oct 2024
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge AugmentationNeural Information Processing Systems (NeurIPS), 2024
Kun Yuan
V. Srivastav
Nassir Navab
N. Padoy
510
37
0
30 Sep 2024
Learning to Localize Actions in Instructional Videos with LLM-Based
  Multi-Pathway Text-Video Alignment
Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video AlignmentEuropean Conference on Computer Vision (ECCV), 2024
Yuxiao Chen
Keqin Li
Wentao Bao
Deep Patel
Yu Kong
Martin Renqiang Min
Dimitris N. Metaxas
DiffM
358
8
0
22 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal
  Reasoning with Large Language Models
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
468
6
0
19 Sep 2024
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal PromptingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Yongqi Wang
Xinxiao Wu
Shuo Yang
Jiebo Luo
1.0K
4
0
19 Sep 2024
Recent Advances in Multimodal Affective Computing: An NLP Perspective
Recent Advances in Multimodal Affective Computing: An NLP Perspective
Guimin Hu
Yi Xin
Weimin Lyu
Haojian Huang
Chang Sun
Zehan Zhu
Lin Gui
Ruichu Cai
435
23
0
11 Sep 2024
Enhancing Long Video Understanding via Hierarchical Event-Based Memory
Enhancing Long Video Understanding via Hierarchical Event-Based Memory
Dingxin Cheng
Mingda Li
Jingyu Liu
Yongxin Guo
Bin Jiang
Qingbin Liu
Xi Chen
Bo Zhao
313
15
0
10 Sep 2024
Assessing Modality Bias in Video Question Answering Benchmarks with
  Multimodal Large Language Models
Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language ModelsAAAI Conference on Artificial Intelligence (AAAI), 2024
Jean Park
Kuk Jin Jang
Basam Alasaly
Sriharsha Mopidevi
Andrew Zolensky
Eric Eaton
Insup Lee
Kevin Johnson
322
18
0
22 Aug 2024
T2VIndexer: A Generative Video Indexer for Efficient Text-Video
  Retrieval
T2VIndexer: A Generative Video Indexer for Efficient Text-Video RetrievalACM Multimedia (MM), 2024
Yili Li
Jing Yu
Keke Gai
Bang Liu
Gang Xiong
Qi Wu
DiffMVGen
270
6
0
21 Aug 2024
COM Kitchens: An Unedited Overhead-view Video Dataset as a
  Vision-Language Benchmark
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language BenchmarkEuropean Conference on Computer Vision (ECCV), 2024
Koki Maeda
Tosho Hirasawa
Atsushi Hashimoto
Jun Harashima
Leszek Rybicki
Yusuke Fukasawa
Yoshitaka Ushiku
311
3
0
05 Aug 2024
Language-driven Grasp Detection with Mask-guided Attention
Language-driven Grasp Detection with Mask-guided AttentionIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2024
Tuan V. Vo
M. Vu
Baoru Huang
An Vuong
Ngan Le
T. Vo
Anh Nguyen
234
6
0
29 Jul 2024
MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video
  Detection on YouTube and Bilibili
MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video Detection on YouTube and BilibiliACM Multimedia (MM), 2024
Han Wang
Tan Rui Yang
Usman Naseem
Roy Ka-wei Lee
308
32
0
28 Jul 2024
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
Ego-VPA: Egocentric Video Understanding with Parameter-efficient AdaptationIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Tz-Ying Wu
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
EgoV
569
0
0
28 Jul 2024
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith
Shravan Venkatraman
Modigari Narendra
Vigya Sharma
492
11
0
26 Jul 2024
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
Junyu Xie
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
300
22
0
22 Jul 2024
Nearest Neighbor Future Captioning: Generating Descriptions for Possible
  Collisions in Object Placement Tasks
Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks
Takumi Komatsu
Motonari Kambara
Shumpei Hatanaka
Haruka Matsuo
Tsubasa Hirakawa
Takayoshi Yamashita
H. Fujiyoshi
Komei Sugiura
359
2
0
18 Jul 2024
SoupLM: Model Integration in Large Language and Multi-Modal Models
SoupLM: Model Integration in Large Language and Multi-Modal Models
Yue Bai
Zichen Zhang
Jiasen Lu
Yun Fu
MoMe
206
1
0
11 Jul 2024
Meta-optimized Angular Margin Contrastive Framework for Video-Language
  Representation Learning
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen
Yi Bin
Xiaobao Wu
Xinshuai Dong
Zhiyuan Hu
Khoi M. Le
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
553
11
0
04 Jul 2024
Enhancing Video-Language Representations with Structural Spatio-Temporal
  Alignment
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Hao Fei
Shengqiong Wu
Meishan Zhang
Hao Fei
Tat-Seng Chua
Shuicheng Yan
AI4TS
310
73
0
27 Jun 2024
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal NarrativeInternational Conference on Learning Representations (ICLR), 2024
Asmar Nadeem
Faegheh Sardari
R. Dawes
Syed Sameed Husain
Adrian Hilton
Armin Mustafa
511
9
0
10 Jun 2024
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data PerspectivesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
642
39
1
09 Jun 2024
Seeing the Unseen: Visual Metaphor Captioning for Videos
Seeing the Unseen: Visual Metaphor Captioning for VideosConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Abisek Rajakumar Kalarani
Pushpak Bhattacharyya
Sumit Shekhar
VLM
179
1
0
07 Jun 2024
Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification
Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification
Xun Guo
Yiheng Deng
Yunfeng Yan
Feng Zhu
Yizhou Wang
Mengwei He
Qingsong Xie
Donglian Qi
Wanli Ouyang
Weizhen He
566
7
0
28 May 2024
A Novel Fusion Architecture for PD Detection Using Semi-Supervised
  Speech Embeddings
A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings
Tariq Adnan
Abdelrahman Abdelkader
Zipei Liu
Ekram Hossain
Sooyong Park
Md. Saiful Islam
Ehsan Hoque
180
5
0
21 May 2024
MICap: A Unified Model for Identity-aware Movie Descriptions
MICap: A Unified Model for Identity-aware Movie DescriptionsComputer Vision and Pattern Recognition (CVPR), 2024
Haran Raajesh
Naveen Reddy Desanur
Zeeshan Khan
Makarand Tapaswi
376
7
0
19 May 2024
Unified Video-Language Pre-training with Synchronized Audio
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
299
2
0
12 May 2024
Narrative Action Evaluation with Prompt-Guided Multimodal Interaction
Narrative Action Evaluation with Prompt-Guided Multimodal Interaction
Shiyi Zhang
Sule Bai
Guangyi Chen
Lei Chen
Jiwen Lu
Junle Wang
Yansong Tang
285
26
0
22 Apr 2024
123456
Next
Page 1 of 6