Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.01861
Cited By
v1
v2
v3 (latest)
Jointly Modeling Embedding and Translation to Bridge Video and Language
7 May 2015
Yingwei Pan
Tao Mei
Ting Yao
Houqiang Li
Y. Rui
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Jointly Modeling Embedding and Translation to Bridge Video and Language"
50 / 199 papers shown
Title
Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning
Junan Chen
Trung Thanh Nguyen
Takahiro Komamizu
Ichiro Ide
52
0
0
11 Oct 2025
Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables
Yu Gui
Cong Ma
Zongming Ma
SSL
305
2
0
18 May 2025
MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field
IEEE Transactions on Visualization and Computer Graphics (TVCG), 2023
Zijian Győző Yang
Zhongwei Qiu
Chang Xu
Dongmei Fu
361
3
0
28 Jan 2025
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
European Conference on Computer Vision (ECCV), 2024
Jianjie Luo
Jingwen Chen
Yehao Li
Yingwei Pan
Jianlin Feng
Hongyang Chao
Ting Yao
DiffM
VLM
269
2
0
03 Jan 2025
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Peng Jin
Haoyang Li
Li Yuan
Shuicheng Yan
Jie Chen
387
4
0
31 Dec 2024
Resolving Word Vagueness with Scenario-guided Adapter for Natural Language Inference
Yuqi Liu
Mengyu Li
Di Liang
Ximing Li
Fausto Giunchiglia
Lan Huang
Xiaoyue Feng
Renchu Guan
186
10
0
21 May 2024
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
Minkuk Kim
Hyeon Bae Kim
Jinyoung Moon
Jinwoo Choi
Seong Tae Kim
159
39
0
11 Apr 2024
Cross-Modal Reasoning with Event Correlation for Video Question Answering
Chengxiang Yin
Zhengping Che
Kun Wu
Zhiyuan Xu
Qinru Qiu
Jian Tang
170
0
0
20 Dec 2023
Multi Sentence Description of Complex Manipulation Action Videos
Machine Vision and Applications (MVA), 2023
Fatemeh Ziaeetabar
Reza Safabakhsh
S. Momtazi
M. Tamosiunaite
Florentin Wörgötter
208
7
0
13 Nov 2023
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
308
21
0
23 Sep 2023
Zero-shot Composed Text-Image Retrieval
British Machine Vision Conference (BMVC), 2023
Yikun Liu
Jiangchao Yao
Ya Zhang
Yanfeng Wang
Weidi Xie
181
31
0
12 Jun 2023
SEM-POS: Grammatically and Semantically Correct Video Captioning
Asmar Nadeem
A. Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
185
10
0
26 Mar 2023
ADAPT: Action-aware Driving Caption Transformer
IEEE International Conference on Robotics and Automation (ICRA), 2023
Bu Jin
Xinyi Liu
Yupeng Zheng
Pengfei Li
Hao Zhao
Tong Zhang
Yuhang Zheng
Guyue Zhou
Jingjing Liu
375
93
0
01 Feb 2023
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Fenglin Liu
Xian Wu
Chenyu You
Shen Ge
Yuexian Zou
Xu Sun
238
28
0
22 Nov 2022
Prophet Attention: Predicting Attention with Future Attention for Image Captioning
Neural Information Processing Systems (NeurIPS), 2022
Fenglin Liu
Xuancheng Ren
Xian Wu
Wei Fan
Yuexian Zou
Xu Sun
224
50
0
19 Oct 2022
TLDW: Extreme Multimodal Summarisation of News Videos
Peggy Tang
Kun Hu
Lei Zhang
Jiebo Luo
Zhiyong Wang
185
11
0
16 Oct 2022
Cross Modal Compression: Towards Human-comprehensible Semantic Compression
ACM Multimedia (MM), 2021
Jiguo Li
Chuanmin Jia
Xinfeng Zhang
Siwei Ma
Wen Gao
136
27
0
06 Sep 2022
Video Captioning: a comparative review of where we are and which could be the route
Computer Vision and Image Understanding (CVIU), 2022
Daniela Moctezuma
Tania A. Ramirez-delreal
Guillermo Ruiz
Othón González-Chávez
193
14
0
12 Apr 2022
Temporal Alignment Networks for Long-term Video
Computer Vision and Pattern Recognition (CVPR), 2022
Tengda Han
Weidi Xie
Andrew Zisserman
AI4TS
155
103
0
06 Apr 2022
Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
W. Ramos
M. Silva
Edson R. Araujo
Victor Moura
Keller Clayderman Martins de Oliveira
Leandro Soriano Marcolino
Erickson R. Nascimento
VGen
171
4
0
29 Mar 2022
Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos
Computer Vision and Pattern Recognition (CVPR), 2022
Tomávs Souvcek
Jean-Baptiste Alayrac
Antoine Miech
Ivan Laptev
Josef Sivic
214
43
0
22 Mar 2022
MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes
European Conference on Computer Vision (ECCV), 2022
Yang Jiao
Shaoxiang Chen
Zequn Jie
Wenke Huang
Lin Ma
Yu-Gang Jiang
3DPC
223
58
0
10 Mar 2022
Exploiting long-term temporal dynamics for video captioning
World wide web (Bussum) (WWW), 2018
Yuyu Guo
Jingqiu Zhang
Lianli Gao
126
18
0
22 Feb 2022
Temporal Sentence Grounding in Videos: A Survey and Future Directions
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Hao Zhang
Aixin Sun
Wei Jing
Qiufeng Wang
3DGS
362
49
0
20 Jan 2022
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Yehao Li
Jiahao Fan
Yingwei Pan
Ting Yao
Weiyao Lin
Tao Mei
MLLM
ObjD
209
24
0
11 Jan 2022
Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation
International Conference on Information Photonics (ICIP), 2021
Philipp Harzig
Moritz Einfalt
Rainer Lienhart
ViT
153
3
0
28 Dec 2021
CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising
Jianjie Luo
Yehao Li
Yingwei Pan
Ting Yao
Hongyang Chao
Tao Mei
VLM
145
45
0
14 Dec 2021
Controllable Video Captioning with an Exemplar Sentence
Yitian Yuan
Lin Ma
Jingwen Wang
Wenwu Zhu
173
21
0
02 Dec 2021
Syntax Customized Video Captioning by Imitating Exemplar Sentences
Yitian Yuan
Lin Ma
Wenwu Zhu
152
8
0
02 Dec 2021
Hierarchical Modular Network for Video Captioning
Hanhua Ye
Guorong Li
Yuankai Qi
Shuhui Wang
Qingming Huang
Ming-Hsuan Yang
218
88
0
24 Nov 2021
Co-segmentation Inspired Attention Module for Video-based Computer Vision Tasks
Computer Vision and Image Understanding (CVIU), 2021
Arulkumar Subramaniam
Jayesh Vaidya
Muhammed Ameen
Athira M. Nambiar
Anurag Mittal
324
7
0
14 Nov 2021
CLIP4Caption: CLIP for Video Caption
Mingkang Tang
Zhanyu Wang
Zhenhua Liu
Fengyun Rao
Dian Li
Xiu Li
CLIP
VLM
241
173
0
13 Oct 2021
A Survey on Temporal Sentence Grounding in Videos
Xiaohan Lan
Yitian Yuan
Xin Eric Wang
Zhi Wang
Wenwu Zhu
307
57
0
16 Sep 2021
Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention
ACM Multimedia (ACM MM), 2021
Katsuyuki Nakamura
Hiroki Ohashi
Mitsuhiro Okada
EgoV
204
14
0
07 Sep 2021
Maximum Likelihood Estimation for Multimodal Learning with Missing Modality
Fei Ma
Xiangxiang Xu
Shao-Lun Huang
Lin Zhang
167
16
0
24 Aug 2021
X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics
Yehao Li
Yingwei Pan
Jingwen Chen
Ting Yao
Tao Mei
VLM
178
36
0
18 Aug 2021
End-to-End Dense Video Captioning with Parallel Decoding
Teng Wang
Ruimao Zhang
Zhichao Lu
Feng Zheng
Ran Cheng
Ping Luo
3DV
235
223
0
17 Aug 2021
O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning
Findings (Findings), 2021
Fenglin Liu
Xuancheng Ren
Xian Wu
Bang-ju Yang
Shen Ge
Yuexian Zou
Xu Sun
235
37
0
05 Aug 2021
Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers
Interspeech (Interspeech), 2021
Chiori Hori
Takaaki Hori
Jonathan Le Roux
110
4
0
04 Aug 2021
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions
Information Fusion (Inf. Fusion), 2021
Anil Rahate
Rahee Walambe
S. Ramanna
K. Kotecha
365
174
0
29 Jul 2021
Looking for the Signs: Identifying Isolated Sign Instances in Continuous Video Footage
IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2021
Tao Jiang
Necati Cihan Camgöz
Richard Bowden
99
14
0
21 Jul 2021
VPN++: Rethinking Video-Pose embeddings for understanding Activities of Daily Living
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Srijan Das
Rui Dai
Di Yang
Francois Bremond
ViT
310
84
0
17 May 2021
Video Corpus Moment Retrieval with Contrastive Learning
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021
Hao Zhang
Aixin Sun
Wei Jing
Guoshun Nan
Liangli Zhen
Qiufeng Wang
Rick Siow Mong Goh
269
102
0
13 May 2021
A Bi-Encoder LSTM Model For Learning Unstructured Dialogs
Diwanshu Shekhar
P. Negi
Mohammad H. Mahoor
100
2
0
25 Apr 2021
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
Computer Vision and Pattern Recognition (CVPR), 2021
Xiaohan Wang
Linchao Zhu
Yi Yang
365
210
0
20 Apr 2021
Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding
Computer Vision and Pattern Recognition (CVPR), 2021
Hao Zhou
Chongyang Zhang
Yan Luo
Yanjun Chen
Chuanping Hu
155
55
0
31 Mar 2021
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
Computer Vision and Pattern Recognition (CVPR), 2021
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
326
159
0
30 Mar 2021
A Comprehensive Review of the Video-to-Text Problem
Artificial Intelligence Review (AIR), 2021
Jesus Perez-Martin
B. Bustos
S. Guimarães
I. Sipiran
Jorge A. Pérez
Grethel Coello Said
261
18
0
27 Mar 2021
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
AAAI Conference on Artificial Intelligence (AAAI), 2021
Yehao Li
Yingwei Pan
Ting Yao
Jingwen Chen
Tao Mei
VLM
156
58
0
27 Jan 2021
End-to-End Video Question-Answer Generation with Generator-Pretester Network
Hung-Ting Su
Chen-Hsi Chang
Po-Wei Shen
Yu-Siang Wang
Ya-Liang Chang
Yu-Cheng Chang
Pu-Jen Cheng
Winston H. Hsu
131
37
0
05 Jan 2021
1
2
3
4
Next