ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1804.05448
  4. Cited By
Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal
  Attentions for Video Captioning

Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

15 April 2018
Xinze Wang
Yuan-fang Wang
William Yang Wang
ArXiv (abs)PDFHTML

Papers citing "Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning"

41 / 41 papers shown
Hierarchical Augmentation and Distillation for Class Incremental
  Audio-Visual Video Recognition
Hierarchical Augmentation and Distillation for Class Incremental Audio-Visual Video RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Yukun Zuo
Hantao Yao
Liansheng Zhuang
Changsheng Xu
324
5
0
11 Jan 2024
Attention Based Encoder Decoder Model for Video Captioning in Nepali
  (2023)
Attention Based Encoder Decoder Model for Video Captioning in Nepali (2023)
Kabita Parajuli
S. R. Joshi
262
0
0
12 Dec 2023
Student Classroom Behavior Detection based on Spatio-Temporal Network
  and Multi-Model Fusion
Student Classroom Behavior Detection based on Spatio-Temporal Network and Multi-Model Fusion
Fan Yang
Xiaofei Wang
291
2
0
25 Oct 2023
SCB-Dataset3: A Benchmark for Detecting Student Classroom Behavior
SCB-Dataset3: A Benchmark for Detecting Student Classroom Behavior
Fan Yang
Tao Wang
122
31
0
04 Oct 2023
Collaborative Three-Stream Transformers for Video Captioning
Collaborative Three-Stream Transformers for Video CaptioningComputer Vision and Image Understanding (CVIU), 2023
Hao Wang
Libo Zhang
Hengrui Fan
Tiejian Luo
193
8
0
18 Sep 2023
Audio-Visual Class-Incremental Learning
Audio-Visual Class-Incremental LearningIEEE International Conference on Computer Vision (ICCV), 2023
Weiguo Pian
Shentong Mo
Yunhui Guo
Yapeng Tian
CLLVLM
219
33
0
21 Aug 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMeMLLM
308
54
0
30 Jul 2023
Implicit and Explicit Commonsense for Multi-sentence Video Captioning
Implicit and Explicit Commonsense for Multi-sentence Video CaptioningComputer Vision and Image Understanding (CVIU), 2023
Shih-Han Chou
James J. Little
Leonid Sigal
171
3
0
14 Mar 2023
Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight
Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent DaylightInternational Journal of Computer Vision (IJCV), 2022
Yunhua Zhang
Hazel Doughty
Cees G. M. Snoek
VLM
303
2
0
05 Dec 2022
Discrete Contrastive Diffusion for Cross-Modal Music and Image
  Generation
Discrete Contrastive Diffusion for Cross-Modal Music and Image GenerationInternational Conference on Learning Representations (ICLR), 2022
Ye Zhu
Yuehua Wu
Kyle Olszewski
Jian Ren
Sergey Tulyakov
Yan Yan
DiffM
378
56
0
15 Jun 2022
Quantized GAN for Complex Music Generation from Dance Videos
Quantized GAN for Complex Music Generation from Dance VideosEuropean Conference on Computer Vision (ECCV), 2022
Ye Zhu
Kyle Olszewski
Yuehua Wu
Panos Achlioptas
Menglei Chai
Yan Yan
Sergey Tulyakov
MGen
219
56
0
01 Apr 2022
End-to-end Generative Pretraining for Multimodal Video Captioning
End-to-end Generative Pretraining for Multimodal Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2022
Paul Hongsuck Seo
Arsha Nagrani
Anurag Arnab
Cordelia Schmid
281
184
0
20 Jan 2022
Space-Time Memory Network for Sounding Object Localization in Videos
Space-Time Memory Network for Sounding Object Localization in VideosBritish Machine Vision Conference (BMVC), 2021
Sizhe Li
Yapeng Tian
Chenliang Xu
123
12
0
10 Nov 2021
Contrastive Learning of Visual-Semantic Embeddings
Contrastive Learning of Visual-Semantic Embeddings
Anurag Jain
Yashaswi Verma
SSL
143
1
0
17 Oct 2021
Feature-Supervised Action Modality Transfer
Feature-Supervised Action Modality TransferInternational Conference on Pattern Recognition (ICPR), 2021
Fida Mohammad Thoker
Cees G. M. Snoek
101
2
0
06 Aug 2021
Cyclic Co-Learning of Sounding Object Visual Grounding and Sound
  Separation
Cyclic Co-Learning of Sounding Object Visual Grounding and Sound SeparationComputer Vision and Pattern Recognition (CVPR), 2021
Yapeng Tian
Di Hu
Chenliang Xu
ObjD
189
92
0
05 Apr 2021
A Comprehensive Review of the Video-to-Text Problem
A Comprehensive Review of the Video-to-Text ProblemArtificial Intelligence Review (AIR), 2021
Jesus Perez-Martin
B. Bustos
S. Guimarães
I. Sipiran
Jorge A. Pérez
Grethel Coello Said
264
18
0
27 Mar 2021
Repetitive Activity Counting by Sight and Sound
Repetitive Activity Counting by Sight and SoundComputer Vision and Pattern Recognition (CVPR), 2021
Yunhua Zhang
Ling Shao
Cees G. M. Snoek
84
59
0
24 Mar 2021
The MSR-Video to Text Dataset with Clean Annotations
The MSR-Video to Text Dataset with Clean AnnotationsComputer Vision and Image Understanding (CVIU), 2021
Haoran Chen
Jianmin Li
Simone Frintrop
Xiaolin Hu
235
18
0
12 Feb 2021
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video
  Parsing
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
Yapeng Tian
Dingzeyu Li
Chenliang Xu
261
209
0
21 Jul 2020
Adversarial Robustness of Deep Sensor Fusion Models
Adversarial Robustness of Deep Sensor Fusion ModelsIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2020
Shaojie Wang
Tong Wu
Ayan Chakrabarti
Yevgeniy Vorobeychik
AAML
201
16
0
23 Jun 2020
Keyframe Segmentation and Positional Encoding for Video-guided Machine
  Translation Challenge 2020
Keyframe Segmentation and Positional Encoding for Video-guided Machine Translation Challenge 2020
Tosho Hirasawa
Zhishen Yang
Mamoru Komachi
Naoaki Okazaki
VGen
63
11
0
23 Jun 2020
Multi-modal Feature Fusion with Feature Attention for VATEX Captioning
  Challenge 2020
Multi-modal Feature Fusion with Feature Attention for VATEX Captioning Challenge 2020
Ke Lin
Zhuoxin Gan
Liwei Wang
115
8
0
05 Jun 2020
A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal
  Transformer
A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer
Vladimir E. Iashin
Esa Rahtu
220
128
0
17 May 2020
Multi-modal Dense Video Captioning
Multi-modal Dense Video Captioning
Vladimir E. Iashin
Esa Rahtu
325
199
0
17 Mar 2020
Video Caption Dataset for Describing Human Actions in Japanese
Video Caption Dataset for Describing Human Actions in JapaneseInternational Conference on Language Resources and Evaluation (LREC), 2020
Yutaro Shigeto
Yuya Yoshikawa
Jiaqing Lin
A. Takeuchi
92
3
0
10 Mar 2020
Spatio-Temporal Ranked-Attention Networks for Video Captioning
Spatio-Temporal Ranked-Attention Networks for Video CaptioningIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2020
A. Cherian
Jue Wang
Chiori Hori
Tim K. Marks
AI4TS
117
22
0
17 Jan 2020
Delving Deeper into the Decoder for Video Captioning
Delving Deeper into the Decoder for Video CaptioningEuropean Conference on Artificial Intelligence (ECAI), 2020
Haoran Chen
Jianmin Li
Xiaolin Hu
188
38
0
16 Jan 2020
Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video
  Captioning
Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video CaptioningConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Tao Jin
Siyu Huang
Yingming Li
Zhongfei Zhang
204
22
0
01 Nov 2019
Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event
  Captioning
Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event CaptioningIEEE International Conference on Computer Vision (ICCV), 2019
Tanzila Rahman
Bicheng Xu
Leonid Sigal
193
86
0
22 Sep 2019
A Semantics-Assisted Video Captioning Model Trained with Scheduled
  Sampling
A Semantics-Assisted Video Captioning Model Trained with Scheduled SamplingFrontiers in Robotics and AI (Front. Robot. AI), 2019
Haoran Chen
Ke Lin
A. Maye
Jianmin Li
Xiaoling Hu
155
49
0
31 Aug 2019
Watch It Twice: Video Captioning with a Refocused Video Encoder
Watch It Twice: Video Captioning with a Refocused Video EncoderACM Multimedia (ACM MM), 2019
Xiangxi Shi
Jianfei Cai
Shafiq Joty
Jiuxiang Gu
146
29
0
21 Jul 2019
Object-aware Aggregation with Bidirectional Temporal Graph for Video
  Captioning
Object-aware Aggregation with Bidirectional Temporal Graph for Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2019
Junchao Zhang
Yuxin Peng
176
188
0
11 Jun 2019
VATEX: A Large-Scale, High-Quality Multilingual Dataset for
  Video-and-Language Research
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
Xin Eric Wang
Jiawei Wu
Junkun Chen
Lei Li
Yuan-fang Wang
William Yang Wang
501
639
0
06 Apr 2019
Attending Category Disentangled Global Context for Image Classification
Keke Tang
Guodong Wei
Runnan Chen
Jie Zhu
Zhaoquan Gu
Wenping Wang
235
0
0
17 Dec 2018
An Attempt towards Interpretable Audio-Visual Video Captioning
An Attempt towards Interpretable Audio-Visual Video Captioning
Yapeng Tian
Chenxiao Guan
Justin Goodman
Marc Moore
Chenliang Xu
168
21
0
07 Dec 2018
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning
  for Vision-Language Navigation
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language NavigationComputer Vision and Pattern Recognition (CVPR), 2018
Xin Eric Wang
Qiuyuan Huang
Asli Celikyilmaz
Jianfeng Gao
Dinghan Shen
Yuan-fang Wang
William Yang Wang
Lei Zhang
LM&RoSSL
402
598
0
25 Nov 2018
Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video
  Captioning
Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video CaptioningAAAI Conference on Artificial Intelligence (AAAI), 2018
Yoonchang Sung
Jiawei Wu
Da Zhang
Yu-Chuan Su
Erfaun Noorani
224
39
0
07 Nov 2018
No Metrics Are Perfect: Adversarial Reward Learning for Visual
  Storytelling
No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling
Xin Eric Wang
Wenhu Chen
Yuan-fang Wang
William Yang Wang
208
164
0
24 Apr 2018
Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement
  Learning for Planned-Ahead Vision-and-Language Navigation
Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
Xin Eric Wang
Wenhan Xiong
Hongmin Wang
William Yang Wang
272
212
0
21 Mar 2018
Video Captioning via Hierarchical Reinforcement Learning
Video Captioning via Hierarchical Reinforcement Learning
Xin Eric Wang
Wenhu Chen
Jiawei Wu
Yuan-fang Wang
William Yang Wang
205
249
0
29 Nov 2017
1