Attention-Based Multimodal Fusion for Video Description

11 January 2017

Papers citing "Attention-Based Multimodal Fusion for Video Description"

31 / 31 papers shown

Title
Video ReCap: Recursive Captioning of Hour-Long Videos Md. Mohaiminul Islam Ngan Ho Xitong Yang Tushar Nagarajan Lorenzo Torresani Gedas Bertasius VGen VLM 27 44 0 20 Feb 2024
Modality Mixer Exploiting Complementary Information for Multi-modal Action Recognition Sumin Lee Sangmin Woo Muhammad Adi Nugroho Changick Kim 25 0 0 21 Nov 2023
SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion Danish Nazir Marcus Liwicki D. Stricker Muhammad Zeshan Afzal VLM MDE 13 45 0 28 Apr 2022
Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation Damien Robert Bruno Vallet Loic Landrieu 3DPC 31 69 0 15 Apr 2022
Guiding Attention using Partial-Order Relationships for Image Captioning Murad Popattia Muhammad Rafi Rizwan Qureshi Shah Nawaz 19 4 0 15 Apr 2022
Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation Dan Berrebbi Jiatong Shi Brian Yan Osbel López-Francisco Jonathan D. Amith Shinji Watanabe 8 26 0 05 Apr 2022
Dynamic Multimodal Fusion Zihui Xue R. Marculescu 34 47 0 31 Mar 2022
Audio-Driven Talking Face Video Generation with Dynamic Convolution Kernels Zipeng Ye Mengfei Xia Ran Yi Juyong Zhang Yu-Kun Lai Xuanteng Huang Guoxin Zhang Yong-jin Liu CVBM 22 39 0 16 Jan 2022
Attention-based Multi-hypothesis Fusion for Speech Summarization Takatomo Kano A. Ogawa Marc Delcroix Shinji Watanabe 22 13 0 16 Nov 2021
The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks Darius Petermann G. Wichern Zhong-Qiu Wang Jonathan Le Roux 21 37 0 19 Oct 2021
Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention Katsuyuki Nakamura Hiroki Ohashi Mitsuhiro Okada EgoV 31 12 0 07 Sep 2021
Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers Chiori Hori Takaaki Hori Jonathan Le Roux 17 4 0 04 Aug 2021
TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog Wubo Li Dongwei Jiang Wei Zou Xiangang Li 18 6 0 21 Oct 2020
Exploiting Multi-Modal Features From Pre-trained Networks for Alzheimer's Dementia Recognition Junghyun Koo Jie Hwan Lee Jaewoo Pyo Yujin Jo Kyogu Lee 11 58 0 09 Sep 2020
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos Shaoxiang Chen Wenhao Jiang Wei Liu Yu-Gang Jiang 23 101 0 28 Jul 2020
SBAT: Video Captioning with Sparse Boundary-Aware Transformer Tao Jin Siyu Huang Ming Chen Yingming Li Zhongfei Zhang 30 52 0 23 Jul 2020
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers Shijie Geng Peng Gao Moitreya Chatterjee Chiori Hori Jonathan Le Roux Yongfeng Zhang Hongsheng Li A. Cherian 19 11 0 08 Jul 2020
Multi-modal Automated Speech Scoring using Attention Fusion Manraj Singh Grover Yaman Kumar Singla Sumit Sarin Payman Vafaee Mika Hama R. Shah 11 11 0 17 May 2020
How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition George Sterpu Christian Saam N. Harte 29 28 0 17 Apr 2020
Spatio-Temporal Ranked-Attention Networks for Video Captioning A. Cherian Jue Wang Chiori Hori Tim K. Marks AI4TS 20 19 0 17 Jan 2020
Delving Deeper into the Decoder for Video Captioning Haoran Chen Jianmin Li Xiaolin Hu 26 34 0 16 Jan 2020
Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning Tanzila Rahman Bicheng Xu Leonid Sigal 25 77 0 22 Sep 2019
Selective Sensor Fusion for Neural Visual-Inertial Odometry Changhao Chen Stefano Rosa Yishu Miao Chris Xiaoxuan Lu Wei Yu Wu Andrew Markham A. Trigoni 14 132 0 04 Mar 2019
Weakly Supervised Dense Event Captioning in Videos Xuguang Duan Wen-bing Huang Chuang Gan Jingdong Wang Wenwu Zhu Junzhou Huang 25 148 0 10 Dec 2018
An Attempt towards Interpretable Audio-Visual Video Captioning Yapeng Tian Chenxiao Guan Justin Goodman Marc Moore Chenliang Xu 22 20 0 07 Dec 2018
Stream attention-based multi-array end-to-end speech recognition Xiaofei Wang Ruizhi Li Sri Harish Reddy Mallidi Takaaki Hori Shinji Watanabe H. Hermansky 9 21 0 12 Nov 2018
PVNet: A Joint Convolutional Network of Point Cloud and Multi-View for 3D Shape Recognition Haoxuan You Yifan Feng R. Ji Yue Gao 3DPC 34 169 0 23 Aug 2018
End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features Chiori Hori Huda AlAmri Jue Wang G. Wichern Takaaki Hori ... Raphael Gontijo-Lopes Abhishek Das Irfan Essa Dhruv Batra Devi Parikh VGen 16 125 0 21 Jun 2018
Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7 Huda AlAmri Vincent Cartillier Raphael Gontijo-Lopes Abhishek Das Jue Wang ... Dhruv Batra Devi Parikh A. Cherian Tim K. Marks Chiori Hori 17 32 0 01 Jun 2018
ECO: Efficient Convolutional Network for Online Video Understanding Mohammadreza Zolfaghari Kamaljeet Singh Thomas Brox 125 496 0 24 Apr 2018
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos Silvio Giancola Mohieddine Amine Tarek Dghaily Bernard Ghanem AI4TS 19 193 0 12 Apr 2018