Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
1804.05448
Cited By
Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning
15 April 2018
Xinze Wang
Yuan-fang Wang
William Yang Wang
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning"
41 / 41 papers shown
Title
Hierarchical Augmentation and Distillation for Class Incremental Audio-Visual Video Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Yukun Zuo
Hantao Yao
Liansheng Zhuang
Changsheng Xu
240
5
0
11 Jan 2024
Attention Based Encoder Decoder Model for Video Captioning in Nepali (2023)
Kabita Parajuli
S. R. Joshi
211
0
0
12 Dec 2023
Student Classroom Behavior Detection based on Spatio-Temporal Network and Multi-Model Fusion
Fan Yang
Xiaofei Wang
263
2
0
25 Oct 2023
SCB-Dataset3: A Benchmark for Detecting Student Classroom Behavior
Fan Yang
Tao Wang
98
31
0
04 Oct 2023
Collaborative Three-Stream Transformers for Video Captioning
Computer Vision and Image Understanding (CVIU), 2023
Hao Wang
Libo Zhang
Hengrui Fan
Tiejian Luo
127
8
0
18 Sep 2023
Audio-Visual Class-Incremental Learning
IEEE International Conference on Computer Vision (ICCV), 2023
Weiguo Pian
Shentong Mo
Yunhui Guo
Yapeng Tian
CLL
VLM
154
33
0
21 Aug 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
271
54
0
30 Jul 2023
Implicit and Explicit Commonsense for Multi-sentence Video Captioning
Computer Vision and Image Understanding (CVIU), 2023
Shih-Han Chou
James J. Little
Leonid Sigal
150
3
0
14 Mar 2023
Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight
International Journal of Computer Vision (IJCV), 2022
Yunhua Zhang
Hazel Doughty
Cees G. M. Snoek
VLM
227
2
0
05 Dec 2022
Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation
International Conference on Learning Representations (ICLR), 2022
Ye Zhu
Yuehua Wu
Kyle Olszewski
Jian Ren
Sergey Tulyakov
Yan Yan
DiffM
342
56
0
15 Jun 2022
Quantized GAN for Complex Music Generation from Dance Videos
European Conference on Computer Vision (ECCV), 2022
Ye Zhu
Kyle Olszewski
Yuehua Wu
Panos Achlioptas
Menglei Chai
Yan Yan
Sergey Tulyakov
MGen
204
55
0
01 Apr 2022
End-to-end Generative Pretraining for Multimodal Video Captioning
Computer Vision and Pattern Recognition (CVPR), 2022
Paul Hongsuck Seo
Arsha Nagrani
Anurag Arnab
Cordelia Schmid
232
184
0
20 Jan 2022
Space-Time Memory Network for Sounding Object Localization in Videos
British Machine Vision Conference (BMVC), 2021
Sizhe Li
Yapeng Tian
Chenliang Xu
111
12
0
10 Nov 2021
Contrastive Learning of Visual-Semantic Embeddings
Anurag Jain
Yashaswi Verma
SSL
134
1
0
17 Oct 2021
Feature-Supervised Action Modality Transfer
International Conference on Pattern Recognition (ICPR), 2021
Fida Mohammad Thoker
Cees G. M. Snoek
77
2
0
06 Aug 2021
Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation
Computer Vision and Pattern Recognition (CVPR), 2021
Yapeng Tian
Di Hu
Chenliang Xu
ObjD
171
92
0
05 Apr 2021
A Comprehensive Review of the Video-to-Text Problem
Artificial Intelligence Review (AIR), 2021
Jesus Perez-Martin
B. Bustos
S. Guimarães
I. Sipiran
Jorge A. Pérez
Grethel Coello Said
221
18
0
27 Mar 2021
Repetitive Activity Counting by Sight and Sound
Computer Vision and Pattern Recognition (CVPR), 2021
Yunhua Zhang
Ling Shao
Cees G. M. Snoek
45
53
0
24 Mar 2021
The MSR-Video to Text Dataset with Clean Annotations
Computer Vision and Image Understanding (CVIU), 2021
Haoran Chen
Jianmin Li
Simone Frintrop
Xiaolin Hu
202
18
0
12 Feb 2021
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
Yapeng Tian
Dingzeyu Li
Chenliang Xu
228
207
0
21 Jul 2020
Adversarial Robustness of Deep Sensor Fusion Models
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2020
Shaojie Wang
Tong Wu
Ayan Chakrabarti
Yevgeniy Vorobeychik
AAML
170
16
0
23 Jun 2020
Keyframe Segmentation and Positional Encoding for Video-guided Machine Translation Challenge 2020
Tosho Hirasawa
Zhishen Yang
Mamoru Komachi
Naoaki Okazaki
VGen
63
11
0
23 Jun 2020
Multi-modal Feature Fusion with Feature Attention for VATEX Captioning Challenge 2020
Ke Lin
Zhuoxin Gan
Liwei Wang
111
8
0
05 Jun 2020
A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer
Vladimir E. Iashin
Esa Rahtu
219
148
0
17 May 2020
Multi-modal Dense Video Captioning
Vladimir E. Iashin
Esa Rahtu
273
198
0
17 Mar 2020
Video Caption Dataset for Describing Human Actions in Japanese
International Conference on Language Resources and Evaluation (LREC), 2020
Yutaro Shigeto
Yuya Yoshikawa
Jiaqing Lin
A. Takeuchi
84
3
0
10 Mar 2020
Spatio-Temporal Ranked-Attention Networks for Video Captioning
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2020
A. Cherian
Jue Wang
Chiori Hori
Tim K. Marks
AI4TS
117
22
0
17 Jan 2020
Delving Deeper into the Decoder for Video Captioning
European Conference on Artificial Intelligence (ECAI), 2020
Haoran Chen
Jianmin Li
Xiaolin Hu
155
38
0
16 Jan 2020
Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Tao Jin
Siyu Huang
Yingming Li
Zhongfei Zhang
148
21
0
01 Nov 2019
Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning
IEEE International Conference on Computer Vision (ICCV), 2019
Tanzila Rahman
Bicheng Xu
Leonid Sigal
133
86
0
22 Sep 2019
A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling
Frontiers in Robotics and AI (Front. Robot. AI), 2019
Haoran Chen
Ke Lin
A. Maye
Jianmin Li
Xiaoling Hu
153
49
0
31 Aug 2019
Watch It Twice: Video Captioning with a Refocused Video Encoder
ACM Multimedia (ACM MM), 2019
Xiangxi Shi
Jianfei Cai
Shafiq Joty
Jiuxiang Gu
134
28
0
21 Jul 2019
Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning
Computer Vision and Pattern Recognition (CVPR), 2019
Junchao Zhang
Yuxin Peng
160
187
0
11 Jun 2019
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
Xin Eric Wang
Jiawei Wu
Junkun Chen
Lei Li
Yuan-fang Wang
William Yang Wang
454
632
0
06 Apr 2019
Attending Category Disentangled Global Context for Image Classification
Keke Tang
Guodong Wei
Runnan Chen
Jie Zhu
Zhaoquan Gu
Wenping Wang
217
0
0
17 Dec 2018
An Attempt towards Interpretable Audio-Visual Video Captioning
Yapeng Tian
Chenxiao Guan
Justin Goodman
Marc Moore
Chenliang Xu
164
20
0
07 Dec 2018
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
Computer Vision and Pattern Recognition (CVPR), 2018
Xin Eric Wang
Qiuyuan Huang
Asli Celikyilmaz
Jianfeng Gao
Dinghan Shen
Yuan-fang Wang
William Yang Wang
Lei Zhang
LM&Ro
SSL
339
590
0
25 Nov 2018
Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning
AAAI Conference on Artificial Intelligence (AAAI), 2018
Yoonchang Sung
Jiawei Wu
Da Zhang
Yu-Chuan Su
Erfaun Noorani
204
39
0
07 Nov 2018
No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling
Xin Eric Wang
Wenhu Chen
Yuan-fang Wang
William Yang Wang
174
164
0
24 Apr 2018
Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
Xin Eric Wang
Wenhan Xiong
Hongmin Wang
William Yang Wang
208
210
0
21 Mar 2018
Video Captioning via Hierarchical Reinforcement Learning
Xin Eric Wang
Wenhu Chen
Jiawei Wu
Yuan-fang Wang
William Yang Wang
190
248
0
29 Nov 2017
1