Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1701.03126
Cited By
Attention-Based Multimodal Fusion for Video Description
11 January 2017
Chiori Hori
Takaaki Hori
Teng-Yok Lee
Kazuhiro Sumi
J. Hershey
Tim K. Marks
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Attention-Based Multimodal Fusion for Video Description"
31 / 31 papers shown
Title
Video ReCap: Recursive Captioning of Hour-Long Videos
Md. Mohaiminul Islam
Ngan Ho
Xitong Yang
Tushar Nagarajan
Lorenzo Torresani
Gedas Bertasius
VGen
VLM
27
44
0
20 Feb 2024
Modality Mixer Exploiting Complementary Information for Multi-modal Action Recognition
Sumin Lee
Sangmin Woo
Muhammad Adi Nugroho
Changick Kim
25
0
0
21 Nov 2023
SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion
Danish Nazir
Marcus Liwicki
D. Stricker
Muhammad Zeshan Afzal
VLM
MDE
13
45
0
28 Apr 2022
Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation
Damien Robert
Bruno Vallet
Loic Landrieu
3DPC
31
69
0
15 Apr 2022
Guiding Attention using Partial-Order Relationships for Image Captioning
Murad Popattia
Muhammad Rafi
Rizwan Qureshi
Shah Nawaz
19
4
0
15 Apr 2022
Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation
Dan Berrebbi
Jiatong Shi
Brian Yan
Osbel López-Francisco
Jonathan D. Amith
Shinji Watanabe
8
26
0
05 Apr 2022
Dynamic Multimodal Fusion
Zihui Xue
R. Marculescu
34
47
0
31 Mar 2022
Audio-Driven Talking Face Video Generation with Dynamic Convolution Kernels
Zipeng Ye
Mengfei Xia
Ran Yi
Juyong Zhang
Yu-Kun Lai
Xuanteng Huang
Guoxin Zhang
Yong-jin Liu
CVBM
22
39
0
16 Jan 2022
Attention-based Multi-hypothesis Fusion for Speech Summarization
Takatomo Kano
A. Ogawa
Marc Delcroix
Shinji Watanabe
22
13
0
16 Nov 2021
The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks
Darius Petermann
G. Wichern
Zhong-Qiu Wang
Jonathan Le Roux
21
37
0
19 Oct 2021
Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention
Katsuyuki Nakamura
Hiroki Ohashi
Mitsuhiro Okada
EgoV
31
12
0
07 Sep 2021
Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers
Chiori Hori
Takaaki Hori
Jonathan Le Roux
17
4
0
04 Aug 2021
TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog
Wubo Li
Dongwei Jiang
Wei Zou
Xiangang Li
18
6
0
21 Oct 2020
Exploiting Multi-Modal Features From Pre-trained Networks for Alzheimer's Dementia Recognition
Junghyun Koo
Jie Hwan Lee
Jaewoo Pyo
Yujin Jo
Kyogu Lee
11
58
0
09 Sep 2020
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos
Shaoxiang Chen
Wenhao Jiang
Wei Liu
Yu-Gang Jiang
23
101
0
28 Jul 2020
SBAT: Video Captioning with Sparse Boundary-Aware Transformer
Tao Jin
Siyu Huang
Ming Chen
Yingming Li
Zhongfei Zhang
30
52
0
23 Jul 2020
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
Shijie Geng
Peng Gao
Moitreya Chatterjee
Chiori Hori
Jonathan Le Roux
Yongfeng Zhang
Hongsheng Li
A. Cherian
19
11
0
08 Jul 2020
Multi-modal Automated Speech Scoring using Attention Fusion
Manraj Singh Grover
Yaman Kumar Singla
Sumit Sarin
Payman Vafaee
Mika Hama
R. Shah
11
11
0
17 May 2020
How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition
George Sterpu
Christian Saam
N. Harte
29
28
0
17 Apr 2020
Spatio-Temporal Ranked-Attention Networks for Video Captioning
A. Cherian
Jue Wang
Chiori Hori
Tim K. Marks
AI4TS
20
19
0
17 Jan 2020
Delving Deeper into the Decoder for Video Captioning
Haoran Chen
Jianmin Li
Xiaolin Hu
26
34
0
16 Jan 2020
Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning
Tanzila Rahman
Bicheng Xu
Leonid Sigal
25
77
0
22 Sep 2019
Selective Sensor Fusion for Neural Visual-Inertial Odometry
Changhao Chen
Stefano Rosa
Yishu Miao
Chris Xiaoxuan Lu
Wei Yu Wu
Andrew Markham
A. Trigoni
14
132
0
04 Mar 2019
Weakly Supervised Dense Event Captioning in Videos
Xuguang Duan
Wen-bing Huang
Chuang Gan
Jingdong Wang
Wenwu Zhu
Junzhou Huang
25
148
0
10 Dec 2018
An Attempt towards Interpretable Audio-Visual Video Captioning
Yapeng Tian
Chenxiao Guan
Justin Goodman
Marc Moore
Chenliang Xu
22
20
0
07 Dec 2018
Stream attention-based multi-array end-to-end speech recognition
Xiaofei Wang
Ruizhi Li
Sri Harish Reddy Mallidi
Takaaki Hori
Shinji Watanabe
H. Hermansky
9
21
0
12 Nov 2018
PVNet: A Joint Convolutional Network of Point Cloud and Multi-View for 3D Shape Recognition
Haoxuan You
Yifan Feng
R. Ji
Yue Gao
3DPC
34
169
0
23 Aug 2018
End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features
Chiori Hori
Huda AlAmri
Jue Wang
G. Wichern
Takaaki Hori
...
Raphael Gontijo-Lopes
Abhishek Das
Irfan Essa
Dhruv Batra
Devi Parikh
VGen
16
125
0
21 Jun 2018
Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7
Huda AlAmri
Vincent Cartillier
Raphael Gontijo-Lopes
Abhishek Das
Jue Wang
...
Dhruv Batra
Devi Parikh
A. Cherian
Tim K. Marks
Chiori Hori
17
32
0
01 Jun 2018
ECO: Efficient Convolutional Network for Online Video Understanding
Mohammadreza Zolfaghari
Kamaljeet Singh
Thomas Brox
125
496
0
24 Apr 2018
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
Silvio Giancola
Mohieddine Amine
Tarek Dghaily
Bernard Ghanem
AI4TS
19
193
0
12 Apr 2018
1