ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1802.06424
  4. Cited By
End-to-end Audiovisual Speech Recognition

End-to-end Audiovisual Speech Recognition

18 February 2018
Stavros Petridis
Themos Stafylakis
Pingchuan Ma
Feipeng Cai
Georgios Tzimiropoulos
M. Pantic
ArXivPDFHTML

Papers citing "End-to-end Audiovisual Speech Recognition"

50 / 101 papers shown
Title
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Jeong Hun Yeo
Hyeongseop Rha
Se Jin Park
Y. Ro
56
0
0
14 Mar 2025
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
Andrew Rouditchenko
Saurabhchand Bhati
Samuel Thomas
Hilde Kuehne
Rogerio Feris
116
1
0
03 Feb 2025
Uncovering the Visual Contribution in Audio-Visual Speech Recognition
Uncovering the Visual Contribution in Audio-Visual Speech Recognition
Zhaofeng Lin
Naomi Harte
86
1
0
20 Jan 2025
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
Bowen Hao
Dongliang Zhou
Xiaojie Li
Xingyu Zhang
Liang Xie
Jianlong Wu
Erwei Yin
42
1
0
08 Jan 2025
Unified Speech Recognition: A Single Model for Auditory, Visual, and
  Audiovisual Inputs
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
A. Haliassos
Rodrigo Mira
Honglie Chen
Zoe Landgraf
Stavros Petridis
M. Pantic
SSL
37
5
0
04 Nov 2024
RAL:Redundancy-Aware Lipreading Model Based on Differential Learning
  with Symmetric Views
RAL:Redundancy-Aware Lipreading Model Based on Differential Learning with Symmetric Views
Zejun gu
Junxia jiang
36
0
0
09 Sep 2024
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust
  Audio-Visual Speech Recognition
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition
Sungnyun Kim
Kangwook Jang
Sangmin Bae
Hoirin Kim
Se-Young Yun
50
3
0
04 Jul 2024
MSRS: Training Multimodal Speech Recognition Models from Scratch with
  Sparse Mask Optimization
MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization
Adriana Fernandez-Lopez
Honglie Chen
Pingchuan Ma
Lu Yin
Q. Xiao
Stavros Petridis
Shiwei Liu
Maja Pantic
46
2
0
25 Jun 2024
Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation
Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation
Se Jin Park
Chae Won Kim
Hyeongseop Rha
Minsu Kim
Joanna Hong
Jeong Hun Yeo
Yong Man Ro
CVBM
AuLLM
48
6
0
12 Jun 2024
Visually Grounded Speech Models have a Mutual Exclusivity Bias
Visually Grounded Speech Models have a Mutual Exclusivity Bias
Leanne Nortje
Dan Oneaţă
Yevgen Matusevych
Herman Kamper
SSL
47
0
0
20 Mar 2024
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and
  Context-Aware Visual Speech Processing
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
Jeong Hun Yeo
Seunghee Han
Minsu Kim
Y. Ro
56
32
0
23 Feb 2024
Cross-Attention Fusion of Visual and Geometric Features for Large
  Vocabulary Arabic Lipreading
Cross-Attention Fusion of Visual and Geometric Features for Large Vocabulary Arabic Lipreading
Samar Daou
Ahmed Rekik
A. Ben-Hamadou
Abdelaziz Kallel
31
3
0
18 Feb 2024
Efficient Training for Multilingual Visual Speech Recognition:
  Pre-training with Discretized Visual Speech Representation
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
Minsu Kim
Jeong Hun Yeo
Se Jin Park
J. Choi
Y. Ro
27
5
0
18 Jan 2024
Hourglass-AVSR: Down-Up Sampling-based Computational Efficiency Model
  for Audio-Visual Speech Recognition
Hourglass-AVSR: Down-Up Sampling-based Computational Efficiency Model for Audio-Visual Speech Recognition
Fan Yu
Haoxu Wang
Ziyang Ma
Shiliang Zhang
57
2
0
14 Dec 2023
On Robustness to Missing Video for Audiovisual Speech Recognition
On Robustness to Missing Video for Audiovisual Speech Recognition
Oscar Chang
Otavio Braga
H. Liao
Dmitriy Serdyuk
Olivier Siohan
26
11
0
13 Dec 2023
Enabling Resource-efficient AIoT System with Cross-level Optimization: A
  survey
Enabling Resource-efficient AIoT System with Cross-level Optimization: A survey
Sicong Liu
Bin Guo
Cheng Fang
Ziqi Wang
Shiyan Luo
Zimu Zhou
Zhiwen Yu
AI4CE
37
22
0
27 Sep 2023
Visual Speech Recognition for Languages with Limited Labeled Data using
  Automatic Labels from Whisper
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper
Jeong Hun Yeo
Minsu Kim
Shinji Watanabe
Y. Ro
VLM
34
12
0
15 Sep 2023
Let There Be Sound: Reconstructing High Quality Speech from Silent
  Videos
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
Ji-Hoon Kim
Jaehun Kim
Joon Son Chung
30
5
0
29 Aug 2023
Lip Reading for Low-resource Languages by Learning and Combining General
  Speech Knowledge and Language-specific Knowledge
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
Minsu Kim
Jeong Hun Yeo
J. Choi
Y. Ro
34
16
0
18 Aug 2023
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
  Compressing Audio Knowledge of a Pretrained Model
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
Jeong Hun Yeo
Minsu Kim
J. Choi
Dae Hoe Kim
Y. Ro
26
18
0
15 Aug 2023
Audio-visual video-to-speech synthesis with synthesized input audio
Audio-visual video-to-speech synthesis with synthesized input audio
Triantafyllos Kefalas
Yannis Panagakis
M. Pantic
VGen
DiffM
38
1
0
31 Jul 2023
SparseVSR: Lightweight and Noise Robust Visual Speech Recognition
SparseVSR: Lightweight and Noise Robust Visual Speech Recognition
Adriana Fernandez-Lopez
Honglie Chen
Pingchuan Ma
A. Haliassos
Stavros Petridis
M. Pantic
VLM
33
7
0
10 Jul 2023
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
  and Recognition
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Guinan Li
Jiajun Deng
Mengzhe Geng
Zengrui Jin
Tianzi Wang
Shujie Hu
Mingyu Cui
Helen M. Meng
Xunying Liu
37
10
0
06 Jul 2023
Large-scale unsupervised audio pre-training for video-to-speech
  synthesis
Large-scale unsupervised audio pre-training for video-to-speech synthesis
Triantafyllos Kefalas
Yannis Panagakis
M. Pantic
VGen
32
3
0
27 Jun 2023
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes
  with Spatiotemporal Annotations of Sound Events
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Kazuki Shimada
A. Politis
Parthasaarathy Sudarsanam
D. Krause
Kengo Uchida
...
Yuichiro Koyama
Naoya Takahashi
Shusuke Takahashi
Tuomas Virtanen
Yuki Mitsufuji
63
36
0
15 Jun 2023
A multimodal dynamical variational autoencoder for audiovisual speech
  representation learning
A multimodal dynamical variational autoencoder for audiovisual speech representation learning
Samir Sadok
Simon Leglaive
Laurent Girin
Xavier Alameda-Pineda
Renaud Séguier
28
11
0
05 May 2023
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
  AV-ASR
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
Paul Hongsuck Seo
Arsha Nagrani
Cordelia Schmid
29
15
0
29 Mar 2023
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Pingchuan Ma
A. Haliassos
Adriana Fernandez-Lopez
Honglie Chen
Stavros Petridis
M. Pantic
27
106
0
25 Mar 2023
ModEFormer: Modality-Preserving Embedding for Audio-Video
  Synchronization using Transformers
ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using Transformers
Akash Gupta
Rohun Tripathi
Won-Kap Jang
29
6
0
21 Mar 2023
Learning Cross-lingual Visual Speech Representations
Learning Cross-lingual Visual Speech Representations
Andreas Zinonos
A. Haliassos
Pingchuan Ma
Stavros Petridis
M. Pantic
SSL
22
8
0
14 Mar 2023
Deep Visual Forced Alignment: Learning to Align Transcription with
  Talking Face Video
Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video
Minsu Kim
Chae Won Kim
Y. Ro
CVBM
DiffM
35
3
0
27 Feb 2023
Lip-to-Speech Synthesis in the Wild with Multi-task Learning
Lip-to-Speech Synthesis in the Wild with Multi-task Learning
Minsu Kim
Joanna Hong
Y. Ro
6
21
0
17 Feb 2023
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech
  Recognition
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
Minsu Kim
Hyungil Kim
Y. Ro
VLM
13
18
0
16 Feb 2023
LipLearner: Customizable Silent Speech Interactions on Mobile Devices
LipLearner: Customizable Silent Speech Interactions on Mobile Devices
Zixiong Su
Shitao Fang
Jun Rekimoto
18
26
0
12 Feb 2023
LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark
  Transformers
LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers
Feng Xue
Yu Li
Deyin Liu
Yincen Xie
Lin Wu
Richang Hong
28
12
0
04 Feb 2023
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
  Speech Representation Learning
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
Qiu-shi Zhu
Long Zhou
Zi-Hua Zhang
Shujie Liu
Binxing Jiao
Jie Zhang
Lirong Dai
Daxin Jiang
Jinyu Li
Furu Wei
33
37
0
21 Nov 2022
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders
Rodrigo Mira
Buye Xu
Jacob Donley
Anurag Kumar
Stavros Petridis
V. Ithapu
M. Pantic
22
13
0
20 Nov 2022
AVATAR submission to the Ego4D AV Transcription Challenge
AVATAR submission to the Ego4D AV Transcription Challenge
Paul Hongsuck Seo
Arsha Nagrani
Cordelia Schmid
22
0
0
18 Nov 2022
Streaming Audio-Visual Speech Recognition with Alignment Regularization
Streaming Audio-Visual Speech Recognition with Alignment Regularization
Pingchuan Ma
Niko Moritz
Stavros Petridis
Christian Fuegen
M. Pantic
37
2
0
03 Nov 2022
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network
Junjie Li
Meng Ge
Zexu Pan
Longbiao Wang
J. Dang
18
10
0
09 Oct 2022
Training Strategies for Improved Lip-reading
Training Strategies for Improved Lip-reading
Pingchuan Ma
Yujiang Wang
Stavros Petridis
Jie Shen
M. Pantic
28
46
0
03 Sep 2022
Speaker-adaptive Lip Reading with User-dependent Padding
Speaker-adaptive Lip Reading with User-dependent Padding
Minsu Kim
Hyunjun Kim
Y. Ro
17
20
0
09 Aug 2022
MM-ALT: A Multimodal Automatic Lyric Transcription System
MM-ALT: A Multimodal Automatic Lyric Transcription System
Xiangming Gu
Longshen Ou
Danielle Ong
Ye Wang
16
13
0
13 Jul 2022
Visual Context-driven Audio Feature Enhancement for Robust End-to-End
  Audio-Visual Speech Recognition
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
Joanna Hong
Minsu Kim
Daehun Yoo
Y. Ro
26
20
0
13 Jul 2022
AVATAR: Unconstrained Audiovisual Speech Recognition
AVATAR: Unconstrained Audiovisual Speech Recognition
Valentin Gabeur
Paul Hongsuck Seo
Arsha Nagrani
Chen Sun
Alahari Karteek
Cordelia Schmid
15
11
0
15 Jun 2022
Is Lip Region-of-Interest Sufficient for Lipreading?
Is Lip Region-of-Interest Sufficient for Lipreading?
Jing-Xuan Zhang
Genshun Wan
Jia-Yu Pan
24
6
0
28 May 2022
Deep Learning for Visual Speech Analysis: A Survey
Deep Learning for Visual Speech Analysis: A Survey
Changchong Sheng
Gangyao Kuang
L. Bai
Chen Hou
Y. Guo
Xin Xu
M. Pietikäinen
Li Liu
VLM
29
33
0
22 May 2022
SVTS: Scalable Video-to-Speech Synthesis
SVTS: Scalable Video-to-Speech Synthesis
Rodrigo Mira
A. Haliassos
Stavros Petridis
Björn W. Schuller
M. Pantic
14
32
0
04 May 2022
Improving Multimodal Speech Recognition by Data Augmentation and Speech
  Representations
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
Dan Oneaţă
H. Cucu
19
19
0
27 Apr 2022
Lip to Speech Synthesis with Visual Context Attentional GAN
Lip to Speech Synthesis with Visual Context Attentional GAN
Minsu Kim
Joanna Hong
Y. Ro
25
51
0
04 Apr 2022
123
Next