Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1811.00347
Cited By
How2: A Large-scale Dataset for Multimodal Language Understanding
1 November 2018
Ramon Sanabria
Ozan Caglayan
Shruti Palaskar
Desmond Elliott
Loïc Barrault
Lucia Specia
Florian Metze
VGen
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"How2: A Large-scale Dataset for Multimodal Language Understanding"
50 / 53 papers shown
Title
TopicVD: A Topic-Based Dataset of Video-Guided Multimodal Machine Translation for Documentaries
Jinze Lv
Jian Chen
Zi Long
Xianghua Fu
Yin Chen
VGen
42
0
0
09 May 2025
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
81
2
0
10 Jan 2025
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Wenqi Zhang
Hang Zhang
Xin Li
Jiashuo Sun
Yongliang Shen
Weiming Lu
Deli Zhao
Yueting Zhuang
Lidong Bing
VLM
43
2
0
01 Jan 2025
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
Weijia Wu
Mingyu Liu
Zeyu Zhu
Xi Xia
Haoen Feng
Wen Wang
Kevin Qinghong Lin
Chunhua Shen
Mike Zheng Shou
DiffM
VGen
122
1
0
22 Nov 2024
The State and Fate of Summarization Datasets: A Survey
Noam Dahan
Gabriel Stanovsky
HILM
52
0
0
07 Nov 2024
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Qiuheng Wang
Yukai Shi
Jiarong Ou
R. J. Chen
Ke Lin
...
Mingwu Zheng
Xin Tao
Fei Yang
Pengfei Wan
Di Zhang
VGen
88
19
0
10 Oct 2024
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models
Guangzhi Sun
Wenyi Yu
Changli Tang
Xianzhao Chen
Tian Tan
Wei Li
Lu Lu
Zejun Ma
Chao Zhang
30
12
0
09 Oct 2023
Deep Variational Multivariate Information Bottleneck -- A Framework for Variational Losses
Eslam Abdelaleem
I. Nemenman
K. M. Martini
30
5
0
05 Oct 2023
Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization
Kohei Matsuura
Takanori Ashihara
Takafumi Moriya
Tomohiro Tanaka
Takatomo Kano
A. Ogawa
Marc Delcroix
29
9
0
07 Jun 2023
Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video
Zenan Xu
Xiaojun Meng
Yasheng Wang
Qinliang Su
Zexuan Qiu
Xin Jiang
Qun Liu
33
3
0
08 May 2023
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
Paul Hongsuck Seo
Arsha Nagrani
Cordelia Schmid
29
15
0
29 Mar 2023
Align and Attend: Multimodal Summarization with Dual Contrastive Losses
Bo He
Jun Wang
Jielin Qiu
Trung Bui
Abhinav Shrivastava
Zhaowen Wang
22
65
0
13 Mar 2023
Multimodal Speech Recognition for Language-Guided Embodied Agents
Allen Chang
Xiaoyuan Zhu
Aarav Monga
Seoho Ahn
Tejas Srinivasan
Jesse Thomason
AuLLM
24
3
0
27 Feb 2023
Factual Consistency Oriented Speech Recognition
Naoyuki Kanda
Takuya Yoshioka
Yang Liu
43
0
0
24 Feb 2023
CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization
Litian Zhang
Xiaoming Zhang
Ziming Guo
Zhipeng Liu
27
7
0
20 Feb 2023
A Large-Scale Outdoor Multi-modal Dataset and Benchmark for Novel View Synthesis and Implicit Scene Reconstruction
Chongshan Lu
Fukun Yin
Xin Chen
Tao Chen
YU Gang
Jiayuan Fan
25
31
0
17 Jan 2023
Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization
Yunlong Liang
Fandong Meng
Jinan Xu
Jiaan Wang
Jinan Xu
Jie Zhou
33
19
0
15 Dec 2022
Hierarchical3D Adapters for Long Video-to-text Summarization
Pinelopi Papalampidi
Mirella Lapata
VGen
29
12
0
10 Oct 2022
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Andrew Rouditchenko
Yung-Sung Chuang
Nina Shvetsova
Samuel Thomas
Rogerio Feris
Brian Kingsbury
Leonid Karlinsky
David Harwath
Hilde Kuehne
James R. Glass
VLM
34
4
0
07 Oct 2022
Topic Detection in Continuous Sign Language Videos
Álvaro Budria
Laia Tarrés
Gerard I. Gállego
Francesc Moreno-Noguer
Jordi Torres
Xavier Giró-i-Nieto
SLR
VLM
47
1
0
01 Sep 2022
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
Dan Oneaţă
H. Cucu
19
19
0
27 Apr 2022
Summarizing a virtual robot's past actions in natural language
Chad DeChant
Daniel Bauer
LM&Ro
31
4
0
13 Mar 2022
VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine Translation
Yihang Li
Shuichiro Shimizu
Weiqi Gu
Chenhui Chu
Sadao Kurohashi
24
13
0
20 Jan 2022
InstaIndoor and Multi-modal Deep Learning for Indoor Scene Recognition
A. Glavan
Estefanía Talavera
21
10
0
23 Dec 2021
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue
Tiankai Hang
Yanhong Zeng
Yuchong Sun
Bei Liu
Huan Yang
Jianlong Fu
B. Guo
AI4TS
VLM
31
189
0
19 Nov 2021
Attention-based Multi-hypothesis Fusion for Speech Summarization
Takatomo Kano
A. Ogawa
Marc Delcroix
Shinji Watanabe
22
13
0
16 Nov 2021
Speech Summarization using Restricted Self-Attention
Roshan S. Sharma
Shruti Palaskar
A. Black
Florian Metze
30
33
0
12 Oct 2021
Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization
Tiezheng Yu
Wenliang Dai
Zihan Liu
Pascale Fung
32
73
0
06 Sep 2021
N24News: A New Dataset for Multimodal News Classification
Zhen Wang
Xu Shan
Xiangxie Zhang
Jie Yang
VLM
23
33
0
30 Aug 2021
ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization
Zijian Zhang
Chang Shu
Youxin Chen
Jing Xiao
Qian Zhang
Lu Zheng
23
5
0
11 Aug 2021
Between Flexibility and Consistency: Joint Generation of Captions and Subtitles
Alina Karakanta
Marco Gaido
Matteo Negri
Marco Turchi
30
9
0
13 Jul 2021
Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition
Christian Huber
Juan Hussain
Sebastian Stüker
A. Waibel
21
24
0
05 Jul 2021
Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct Speech Translation
Marco Gaido
Matteo Negri
Mauro Cettolo
Marco Turchi
VLM
58
25
0
23 Apr 2021
Weakly-supervised Multi-task Learning for Multimodal Affect Recognition
Wenliang Dai
Samuel Cahyawijaya
Yejin Bang
Pascale Fung
CVBM
33
11
0
23 Apr 2021
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
Mingyang Zhou
Luowei Zhou
Shuohang Wang
Yu Cheng
Linjie Li
Zhou Yu
Jingjing Liu
MLLM
VLM
31
89
0
01 Apr 2021
An Empirical Study of End-to-end Simultaneous Speech Translation Decoding Strategies
H. Nguyen
Yannick Esteve
Laurent Besacier
24
19
0
04 Mar 2021
Multimodal Pretraining for Dense Video Captioning
Gabriel Huang
Bo Pang
Zhenhai Zhu
Clara E. Rivera
Radu Soricut
21
81
0
10 Nov 2020
TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog
Wubo Li
Dongwei Jiang
Wei Zou
Xiangang Li
23
6
0
21 Oct 2020
Fine-Grained Grounding for Multimodal Speech Recognition
Tejas Srinivasan
Ramon Sanabria
Florian Metze
Desmond Elliott
23
11
0
05 Oct 2020
On Target Segmentation for Direct Speech Translation
Mattia Antonino Di Gangi
Marco Gaido
Matteo Negri
Marco Turchi
37
14
0
10 Sep 2020
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko
Angie Boggust
David Harwath
Brian Chen
D. Joshi
...
Rogerio Feris
Brian Kingsbury
M. Picheny
Antonio Torralba
James R. Glass
SSL
22
141
0
16 Jun 2020
End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020
Marco Gaido
Mattia Antonino Di Gangi
Matteo Negri
Marco Turchi
19
53
0
04 Jun 2020
A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks
Angela S. Lin
Sudha Rao
Asli Celikyilmaz
E. Nouri
Chris Brockett
Debadeepta Dey
Bill Dolan
23
24
0
19 May 2020
Multiresolution and Multimodal Speech Recognition with Transformers
Georgios Paraskevopoulos
Srinivas Parthasarathy
Aparna Khare
Shiva Sundaram
25
29
0
29 Apr 2020
Speech Translation and the End-to-End Promise: Taking Stock of Where We Are
Matthias Sperber
Matthias Paulik
30
106
0
14 Apr 2020
Multimodal Machine Translation through Visuals and Speech
U. Sulubacak
Ozan Caglayan
Stig-Arne Gronroos
Aku Rouhe
Desmond Elliott
Lucia Specia
Jörg Tiedemann
49
73
0
28 Nov 2019
Transformer-based Cascaded Multimodal Speech Translation
Zixiu "Alex" Wu
Ozan Caglayan
Julia Ive
Josiah Wang
Lucia Specia
25
7
0
29 Oct 2019
Predicting Actions to Help Predict Translations
Zixiu "Alex" Wu
Julia Ive
Josiah Wang
Pranava Madhyastha
Lucia Specia
11
7
0
05 Aug 2019
MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible
Marcely Zanon Boito
William N. Havard
Mahault Garnerin
Éric Le Ferrand
Laurent Besacier
32
47
0
30 Jul 2019
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
Aditya Mogadala
M. Kalimuthu
Dietrich Klakow
VLM
20
132
0
22 Jul 2019
1
2
Next