ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2201.10439
  4. Cited By
Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition
  for Single and Multi-Person Video
v1v2v3 (latest)

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

Interspeech (Interspeech), 2022
25 January 2022
Dmitriy Serdyuk
Otavio Braga
Olivier Siohan
    ViT
ArXiv (abs)PDFHTML

Papers citing "Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video"

26 / 26 papers shown
Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction
Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction
Matthew Kit Khinn Teng
Haibo Zhang
Takeshi Saitoh
200
1
0
25 Jul 2025
CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge
CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge
Zehua Liu
Xiaolou Li
Chen Chen
Lantian Li
D. Wang
336
1
0
27 May 2025
VALLR: Visual ASR Language Model for Lip Reading
VALLR: Visual ASR Language Model for Lip Reading
Marshall Thomas
Edward Fish
Richard Bowden
389
7
0
27 Mar 2025
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation ModelsPattern Recognition (Pattern Recogn.), 2025
Jing-Xuan Zhang
Genshun Wan
Jianqing Gao
Zhen-Hua Ling
355
13
0
09 Feb 2025
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech RecognitionIEEE Signal Processing Letters (IEEE SPL), 2025
Andrew Rouditchenko
Saurabhchand Bhati
Samuel Thomas
Hilde Kuehne
Rogerio Feris
598
4
0
03 Feb 2025
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
Joanna Hong
Sanjeel Parekh
Honglie Chen
Jacob Donley
Ke Tan
Buye Xu
Anurag Kumar
286
0
0
30 Jan 2025
Unified Speech Recognition: A Single Model for Auditory, Visual, and
  Audiovisual Inputs
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual InputsNeural Information Processing Systems (NeurIPS), 2024
A. Haliassos
Rodrigo Mira
Honglie Chen
Zoe Landgraf
Stavros Petridis
Maja Pantic
SSL
420
16
0
04 Nov 2024
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Large Language Models are Strong Audio-Visual Speech Recognition LearnersIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Umberto Cappellazzo
Minsu Kim
Honglie Chen
Pingchuan Ma
Stavros Petridis
Daniele Falavigna
Alessio Brutti
Maja Pantic
467
40
0
18 Sep 2024
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End
  Crossmodal Audio Token Synchronization
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
Young Jin Ahn
Jungwoo Park
Sangha Park
Jonghyun Choi
Kee-Eung Kim
256
15
0
18 Jun 2024
Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder
Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder
He Wang
Pengcheng Guo
Xucheng Wan
Huan Zhou
Lei Xie
291
5
0
08 Apr 2024
BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory
  Speech Recognition
BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
A. Haliassos
Andreas Zinonos
Rodrigo Mira
Stavros Petridis
Maja Pantic
VLMSSLAI4TS
344
26
0
02 Apr 2024
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast
  Conformer
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast ConformerIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Maxime Burchi
Krishna C. Puvvada
Jagadeesh Balam
Boris Ginsburg
Radu Timofte
266
19
0
14 Mar 2024
TorchAudio 2.1: Advancing speech recognition, self-supervised learning,
  and audio processing components for PyTorch
TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorchAutomatic Speech Recognition & Understanding (ASRU), 2023
Jeff Hwang
Moto Hira
Caroline Chen
Xiaohui Zhang
Zhaoheng Ni
...
Yumeng Tao
Robin Scheibler
Samuele Cornell
Sean Kim
Stavros Petridis
308
37
0
27 Oct 2023
AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Andrew Rouditchenko
R. Collobert
Tatiana Likhomanenko
VLM
272
6
0
29 Sep 2023
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
  Compressing Audio Knowledge of a Pretrained Model
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained ModelIEEE transactions on multimedia (IEEE TMM), 2023
Jeong Hun Yeo
Minsu Kim
J. Choi
Dae Hoe Kim
Y. Ro
260
27
0
15 Aug 2023
Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using
  Spatial Transformer Networks
Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer NetworksInterspeech (Interspeech), 2023
L. Tóth
Amin Honarmandi Shandiz
G. Gosztolya
T. Csapó
351
9
0
30 May 2023
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic
  Supervision
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic SupervisionComputer Vision and Pattern Recognition (CVPR), 2023
Xubo Liu
Egor Lakomkin
Konstantinos Vougioukas
Pingchuan Ma
Honglie Chen
...
Niko Moritz
J. Kolár
Stavros Petridis
Maja Pantic
Christian Fuegen
513
27
0
30 Mar 2023
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Auto-AVSR: Audio-Visual Speech Recognition with Automatic LabelsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Pingchuan Ma
A. Haliassos
Adriana Fernandez-Lopez
Honglie Chen
Stavros Petridis
Maja Pantic
412
191
0
25 Mar 2023
Conformers are All You Need for Visual Speech Recognition
Conformers are All You Need for Visual Speech Recognition
Oscar Chang
H. Liao
Dmitriy Serdyuk
Ankit Parag Shah
Olivier Siohan
VLM
330
16
0
17 Feb 2023
AV-data2vec: Self-supervised Learning of Audio-Visual Speech
  Representations with Contextualized Target Representations
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target RepresentationsAutomatic Speech Recognition & Understanding (ASRU), 2023
Jiachen Lian
Alexei Baevski
Wei-Ning Hsu
Michael Auli
SSL
429
46
0
10 Feb 2023
Jointly Learning Visual and Auditory Speech Representations from Raw
  Data
Jointly Learning Visual and Auditory Speech Representations from Raw DataInternational Conference on Learning Representations (ICLR), 2022
A. Haliassos
Pingchuan Ma
Rodrigo Mira
Stavros Petridis
Maja Pantic
SSL
338
73
0
12 Dec 2022
Streaming Audio-Visual Speech Recognition with Alignment Regularization
Streaming Audio-Visual Speech Recognition with Alignment RegularizationInterspeech (Interspeech), 2022
Pingchuan Ma
Niko Moritz
Stavros Petridis
Christian Fuegen
Maja Pantic
258
2
0
03 Nov 2022
Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by
  Human Speech Perception
Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech PerceptionIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2022
Jiadong Wang
Xinyuan Qian
Haizhou Li
209
18
0
05 Sep 2022
Visual Context-driven Audio Feature Enhancement for Robust End-to-End
  Audio-Visual Speech Recognition
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech RecognitionInterspeech (Interspeech), 2022
Joanna Hong
Minsu Kim
Daehun Yoo
Y. Ro
304
29
0
13 Jul 2022
FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
  Synthesis
FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech SynthesisACM Multimedia (ACM MM), 2022
Yongqiang Wang
Zhou Zhao
345
12
0
08 Jul 2022
Visual Speech Recognition for Multiple Languages in the Wild
Visual Speech Recognition for Multiple Languages in the WildNature Machine Intelligence (Nat. Mach. Intell.), 2022
Pingchuan Ma
Stavros Petridis
Maja Pantic
VLM
457
202
0
26 Feb 2022
1
Page 1 of 1