ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1505.01861
  4. Cited By
Jointly Modeling Embedding and Translation to Bridge Video and Language
v1v2v3 (latest)

Jointly Modeling Embedding and Translation to Bridge Video and Language

7 May 2015
Yingwei Pan
Tao Mei
Ting Yao
Houqiang Li
Y. Rui
ArXiv (abs)PDFHTML

Papers citing "Jointly Modeling Embedding and Translation to Bridge Video and Language"

50 / 199 papers shown
Title
Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning
Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning
Junan Chen
Trung Thanh Nguyen
Takahiro Komamizu
Ichiro Ide
52
0
0
11 Oct 2025
Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables
Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables
Yu Gui
Cong Ma
Zongming Ma
SSL
305
2
0
18 May 2025
MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field
MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance FieldIEEE Transactions on Visualization and Computer Graphics (TVCG), 2023
Zijian Győző Yang
Zhongwei Qiu
Chang Xu
Dongmei Fu
361
3
0
28 Jan 2025
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image CaptioningEuropean Conference on Computer Vision (ECCV), 2024
Jianjie Luo
Jingwen Chen
Yehao Li
Yingwei Pan
Jianlin Feng
Hongyang Chao
Ting Yao
DiffMVLM
269
2
0
03 Jan 2025
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Hierarchical Banzhaf Interaction for General Video-Language Representation LearningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Peng Jin
Haoyang Li
Li Yuan
Shuicheng Yan
Jie Chen
387
4
0
31 Dec 2024
Resolving Word Vagueness with Scenario-guided Adapter for Natural
  Language Inference
Resolving Word Vagueness with Scenario-guided Adapter for Natural Language Inference
Yuqi Liu
Mengyu Li
Di Liang
Ximing Li
Fausto Giunchiglia
Lan Huang
Xiaoyue Feng
Renchu Guan
186
10
0
21 May 2024
Do You Remember? Dense Video Captioning with Cross-Modal Memory
  Retrieval
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
Minkuk Kim
Hyeon Bae Kim
Jinyoung Moon
Jinwoo Choi
Seong Tae Kim
159
39
0
11 Apr 2024
Cross-Modal Reasoning with Event Correlation for Video Question
  Answering
Cross-Modal Reasoning with Event Correlation for Video Question Answering
Chengxiang Yin
Zhengping Che
Kun Wu
Zhiyuan Xu
Qinru Qiu
Jian Tang
170
0
0
20 Dec 2023
Multi Sentence Description of Complex Manipulation Action Videos
Multi Sentence Description of Complex Manipulation Action VideosMachine Vision and Applications (MVA), 2023
Fatemeh Ziaeetabar
Reza Safabakhsh
S. Momtazi
M. Tamosiunaite
Florentin Wörgötter
208
7
0
13 Nov 2023
A Survey on Image-text Multimodal Models
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
308
21
0
23 Sep 2023
Zero-shot Composed Text-Image Retrieval
Zero-shot Composed Text-Image RetrievalBritish Machine Vision Conference (BMVC), 2023
Yikun Liu
Jiangchao Yao
Ya Zhang
Yanfeng Wang
Weidi Xie
181
31
0
12 Jun 2023
SEM-POS: Grammatically and Semantically Correct Video Captioning
SEM-POS: Grammatically and Semantically Correct Video Captioning
Asmar Nadeem
A. Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
185
10
0
26 Mar 2023
ADAPT: Action-aware Driving Caption Transformer
ADAPT: Action-aware Driving Caption TransformerIEEE International Conference on Robotics and Automation (ICRA), 2023
Bu Jin
Xinyi Liu
Yupeng Zheng
Pengfei Li
Hao Zhao
Tong Zhang
Yuhang Zheng
Guyue Zhou
Jingjing Liu
375
93
0
01 Feb 2023
Aligning Source Visual and Target Language Domains for Unpaired Video
  Captioning
Aligning Source Visual and Target Language Domains for Unpaired Video CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Fenglin Liu
Xian Wu
Chenyu You
Shen Ge
Yuexian Zou
Xu Sun
238
28
0
22 Nov 2022
Prophet Attention: Predicting Attention with Future Attention for Image
  Captioning
Prophet Attention: Predicting Attention with Future Attention for Image CaptioningNeural Information Processing Systems (NeurIPS), 2022
Fenglin Liu
Xuancheng Ren
Xian Wu
Wei Fan
Yuexian Zou
Xu Sun
224
50
0
19 Oct 2022
TLDW: Extreme Multimodal Summarisation of News Videos
TLDW: Extreme Multimodal Summarisation of News Videos
Peggy Tang
Kun Hu
Lei Zhang
Jiebo Luo
Zhiyong Wang
185
11
0
16 Oct 2022
Cross Modal Compression: Towards Human-comprehensible Semantic
  Compression
Cross Modal Compression: Towards Human-comprehensible Semantic CompressionACM Multimedia (MM), 2021
Jiguo Li
Chuanmin Jia
Xinfeng Zhang
Siwei Ma
Wen Gao
136
27
0
06 Sep 2022
Video Captioning: a comparative review of where we are and which could
  be the route
Video Captioning: a comparative review of where we are and which could be the routeComputer Vision and Image Understanding (CVIU), 2022
Daniela Moctezuma
Tania A. Ramirez-delreal
Guillermo Ruiz
Othón González-Chávez
193
14
0
12 Apr 2022
Temporal Alignment Networks for Long-term Video
Temporal Alignment Networks for Long-term VideoComputer Vision and Pattern Recognition (CVPR), 2022
Tengda Han
Weidi Xie
Andrew Zisserman
AI4TS
155
103
0
06 Apr 2022
Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement
  Learning Method
Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning MethodIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
W. Ramos
M. Silva
Edson R. Araujo
Victor Moura
Keller Clayderman Martins de Oliveira
Leandro Soriano Marcolino
Erickson R. Nascimento
VGen
171
4
0
29 Mar 2022
Look for the Change: Learning Object States and State-Modifying Actions
  from Untrimmed Web Videos
Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web VideosComputer Vision and Pattern Recognition (CVPR), 2022
Tomávs Souvcek
Jean-Baptiste Alayrac
Antoine Miech
Ivan Laptev
Josef Sivic
214
43
0
22 Mar 2022
MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes
MORE: Multi-Order RElation Mining for Dense Captioning in 3D ScenesEuropean Conference on Computer Vision (ECCV), 2022
Yang Jiao
Shaoxiang Chen
Zequn Jie
Wenke Huang
Lin Ma
Yu-Gang Jiang
3DPC
223
58
0
10 Mar 2022
Exploiting long-term temporal dynamics for video captioning
Exploiting long-term temporal dynamics for video captioningWorld wide web (Bussum) (WWW), 2018
Yuyu Guo
Jingqiu Zhang
Lianli Gao
126
18
0
22 Feb 2022
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal Sentence Grounding in Videos: A Survey and Future DirectionsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Hao Zhang
Aixin Sun
Wei Jing
Qiufeng Wang
3DGS
362
49
0
20 Jan 2022
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
  Vision-Language Pre-training
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Yehao Li
Jiahao Fan
Yingwei Pan
Ting Yao
Weiyao Lin
Tao Mei
MLLMObjD
209
24
0
11 Jan 2022
Synchronized Audio-Visual Frames with Fractional Positional Encoding for
  Transformers in Video-to-Text Translation
Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text TranslationInternational Conference on Information Photonics (ICIP), 2021
Philipp Harzig
Moritz Einfalt
Rainer Lienhart
ViT
153
3
0
28 Dec 2021
CoCo-BERT: Improving Video-Language Pre-training with Contrastive
  Cross-modal Matching and Denoising
CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising
Jianjie Luo
Yehao Li
Yingwei Pan
Ting Yao
Hongyang Chao
Tao Mei
VLM
145
45
0
14 Dec 2021
Controllable Video Captioning with an Exemplar Sentence
Controllable Video Captioning with an Exemplar Sentence
Yitian Yuan
Lin Ma
Jingwen Wang
Wenwu Zhu
173
21
0
02 Dec 2021
Syntax Customized Video Captioning by Imitating Exemplar Sentences
Syntax Customized Video Captioning by Imitating Exemplar Sentences
Yitian Yuan
Lin Ma
Wenwu Zhu
152
8
0
02 Dec 2021
Hierarchical Modular Network for Video Captioning
Hierarchical Modular Network for Video Captioning
Hanhua Ye
Guorong Li
Yuankai Qi
Shuhui Wang
Qingming Huang
Ming-Hsuan Yang
218
88
0
24 Nov 2021
Co-segmentation Inspired Attention Module for Video-based Computer
  Vision Tasks
Co-segmentation Inspired Attention Module for Video-based Computer Vision TasksComputer Vision and Image Understanding (CVIU), 2021
Arulkumar Subramaniam
Jayesh Vaidya
Muhammed Ameen
Athira M. Nambiar
Anurag Mittal
324
7
0
14 Nov 2021
CLIP4Caption: CLIP for Video Caption
CLIP4Caption: CLIP for Video Caption
Mingkang Tang
Zhanyu Wang
Zhenhua Liu
Fengyun Rao
Dian Li
Xiu Li
CLIPVLM
241
173
0
13 Oct 2021
A Survey on Temporal Sentence Grounding in Videos
A Survey on Temporal Sentence Grounding in Videos
Xiaohan Lan
Yitian Yuan
Xin Eric Wang
Zhi Wang
Wenwu Zhu
307
57
0
16 Sep 2021
Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal
  Attention
Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal AttentionACM Multimedia (ACM MM), 2021
Katsuyuki Nakamura
Hiroki Ohashi
Mitsuhiro Okada
EgoV
204
14
0
07 Sep 2021
Maximum Likelihood Estimation for Multimodal Learning with Missing
  Modality
Maximum Likelihood Estimation for Multimodal Learning with Missing Modality
Fei Ma
Xiangxiang Xu
Shao-Lun Huang
Lin Zhang
167
16
0
24 Aug 2021
X-modaler: A Versatile and High-performance Codebase for Cross-modal
  Analytics
X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics
Yehao Li
Yingwei Pan
Jingwen Chen
Ting Yao
Tao Mei
VLM
178
36
0
18 Aug 2021
End-to-End Dense Video Captioning with Parallel Decoding
End-to-End Dense Video Captioning with Parallel Decoding
Teng Wang
Ruimao Zhang
Zhichao Lu
Feng Zheng
Ran Cheng
Ping Luo
3DV
235
223
0
17 Aug 2021
O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable
  Video Captioning
O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video CaptioningFindings (Findings), 2021
Fenglin Liu
Xuancheng Ren
Xian Wu
Bang-ju Yang
Shen Ge
Yuexian Zou
Xu Sun
235
37
0
05 Aug 2021
Optimizing Latency for Online Video CaptioningUsing Audio-Visual
  Transformers
Optimizing Latency for Online Video CaptioningUsing Audio-Visual TransformersInterspeech (Interspeech), 2021
Chiori Hori
Takaaki Hori
Jonathan Le Roux
110
4
0
04 Aug 2021
Multimodal Co-learning: Challenges, Applications with Datasets, Recent
  Advances and Future Directions
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future DirectionsInformation Fusion (Inf. Fusion), 2021
Anil Rahate
Rahee Walambe
S. Ramanna
K. Kotecha
365
174
0
29 Jul 2021
Looking for the Signs: Identifying Isolated Sign Instances in Continuous
  Video Footage
Looking for the Signs: Identifying Isolated Sign Instances in Continuous Video FootageIEEE International Conference on Automatic Face & Gesture Recognition (FG), 2021
Tao Jiang
Necati Cihan Camgöz
Richard Bowden
99
14
0
21 Jul 2021
VPN++: Rethinking Video-Pose embeddings for understanding Activities of
  Daily Living
VPN++: Rethinking Video-Pose embeddings for understanding Activities of Daily LivingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Srijan Das
Rui Dai
Di Yang
Francois Bremond
ViT
310
84
0
17 May 2021
Video Corpus Moment Retrieval with Contrastive Learning
Video Corpus Moment Retrieval with Contrastive LearningAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021
Hao Zhang
Aixin Sun
Wei Jing
Guoshun Nan
Liangli Zhen
Qiufeng Wang
Rick Siow Mong Goh
269
102
0
13 May 2021
A Bi-Encoder LSTM Model For Learning Unstructured Dialogs
A Bi-Encoder LSTM Model For Learning Unstructured Dialogs
Diwanshu Shekhar
P. Negi
Mohammad H. Mahoor
100
2
0
25 Apr 2021
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
T2VLAD: Global-Local Sequence Alignment for Text-Video RetrievalComputer Vision and Pattern Recognition (CVPR), 2021
Xiaohan Wang
Linchao Zhu
Yi Yang
365
210
0
20 Apr 2021
Embracing Uncertainty: Decoupling and De-bias for Robust Temporal
  Grounding
Embracing Uncertainty: Decoupling and De-bias for Robust Temporal GroundingComputer Vision and Pattern Recognition (CVPR), 2021
Hao Zhou
Chongyang Zhang
Yan Luo
Yanjun Chen
Chuanping Hu
155
55
0
31 Mar 2021
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
  Transformers
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with TransformersComputer Vision and Pattern Recognition (CVPR), 2021
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
326
159
0
30 Mar 2021
A Comprehensive Review of the Video-to-Text Problem
A Comprehensive Review of the Video-to-Text ProblemArtificial Intelligence Review (AIR), 2021
Jesus Perez-Martin
B. Bustos
S. Guimarães
I. Sipiran
Jorge A. Pérez
Grethel Coello Said
261
18
0
27 Mar 2021
Scheduled Sampling in Vision-Language Pretraining with Decoupled
  Encoder-Decoder Network
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder NetworkAAAI Conference on Artificial Intelligence (AAAI), 2021
Yehao Li
Yingwei Pan
Ting Yao
Jingwen Chen
Tao Mei
VLM
156
58
0
27 Jan 2021
End-to-End Video Question-Answer Generation with Generator-Pretester
  Network
End-to-End Video Question-Answer Generation with Generator-Pretester Network
Hung-Ting Su
Chen-Hsi Chang
Po-Wei Shen
Yu-Siang Wang
Ya-Liang Chang
Yu-Cheng Chang
Pu-Jen Cheng
Winston H. Hsu
131
37
0
05 Jan 2021
1234
Next