ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1809.08001
  4. Cited By
Perfect match: Improved cross-modal embeddings for audio-visual
  synchronisation
v1v2 (latest)

Perfect match: Improved cross-modal embeddings for audio-visual synchronisation

21 September 2018
Soo-Whan Chung
Joon Son Chung
Hong-Goo Kang
ArXiv (abs)PDFHTML

Papers citing "Perfect match: Improved cross-modal embeddings for audio-visual synchronisation"

50 / 78 papers shown
Seeing What You Say: Expressive Image Generation from Speech
Seeing What You Say: Expressive Image Generation from Speech
Jiyoung Lee
S. Park
Sanghyuk Chun
Soo-Whan Chung
DiffMVGen
236
1
0
05 Nov 2025
Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm
Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm
Lin Zhang
Zefan Cai
Jiuxiang Gu
Shentong Mo
Jinhong Lin
...
Ruiyi Zhang
Wen Xiao
Tong Sun
Junjie Hu
Pedro Morgado
VGen
171
1
0
05 Aug 2025
Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation
Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation
Dogucan Yaman
Fevziye Irem Eyiokur
Leonard Barmann
H. K. Ekenel
Alexander H. Waibel
CVBM
196
0
0
28 Jul 2025
ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization
ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization
Huilai Li
Yonghao Dang
Ying Xing
Yiming Wang
Jianqin Yin
192
0
0
14 Jul 2025
UniSync: A Unified Framework for Audio-Visual Synchronization
UniSync: A Unified Framework for Audio-Visual Synchronization
Tao Feng
Yifan Xie
Xun Guan
Jiyuan Song
Z. Liu
Fei Ma
Fei Richard Yu
305
4
0
20 Mar 2025
DETECLAP: Enhancing Audio-Visual Representation Learning with Object
  Information
DETECLAP: Enhancing Audio-Visual Representation Learning with Object InformationIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Shota Nakada
Taichi Nishimura
Hokuto Munakata
Masayoshi Kondo
Tatsuya Komatsu
CLIPVLM
187
2
0
18 Sep 2024
Interpretable Convolutional SyncNet
Interpretable Convolutional SyncNet
Sungjoon Park
Jaesub Yun
Donggeon Lee
Minsik Park
291
1
0
02 Sep 2024
Integrating Audio, Visual, and Semantic Information for Enhanced
  Multimodal Speaker Diarization
Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization
Luyao Cheng
Hui Wang
Siqi Zheng
Yafeng Chen
Rongjie Huang
Qinglin Zhang
Qian Chen
Xihao Li
220
5
0
22 Aug 2024
A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual
  Deepfake Detection
A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection
Kyungbok Lee
You Zhang
Zhiyao Duan
348
3
0
20 Jun 2024
Audio-Visual Talker Localization in Video for Spatial Sound Reproduction
Audio-Visual Talker Localization in Video for Spatial Sound Reproduction
Davide Berghi
Philip J. B. Jackson
222
1
0
01 Jun 2024
Audio-Synchronized Visual Animation
Audio-Synchronized Visual AnimationEuropean Conference on Computer Vision (ECCV), 2024
Lin Zhang
Shentong Mo
Yijing Zhang
Pedro Morgado
DiffM
242
33
0
08 Mar 2024
Pretext Training Algorithms for Event Sequence Data
Pretext Training Algorithms for Event Sequence Data
Yimu Wang
He Zhao
Ruizhi Deng
Frederick Tung
Greg Mori
AI4TS
158
0
0
16 Feb 2024
Synchformer: Efficient Synchronization from Sparse Cues
Synchformer: Efficient Synchronization from Sparse Cues
Vladimir E. Iashin
Weidi Xie
Esa Rahtu
Andrew Zisserman
242
57
0
29 Jan 2024
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the WildInternational Journal of Computer Vision (IJCV), 2024
Zhi-Song Liu
Robin Courant
Vicky Kalogeiton
345
9
0
08 Jan 2024
GestSync: Determining who is speaking without a talking head
GestSync: Determining who is speaking without a talking headBritish Machine Vision Conference (BMVC), 2023
Sindhu B. Hegde
Andrew Zisserman
157
2
0
08 Oct 2023
Audio-driven Talking Face Generation with Stabilized Synchronization
  Loss
Audio-driven Talking Face Generation with Stabilized Synchronization LossEuropean Conference on Computer Vision (ECCV), 2023
Dogucan Yaman
Fevziye Irem Eyiokur
Leonard Barmann
H. K. Ekenel
Alexander Waibel
CVBM
414
11
0
18 Jul 2023
Backchannel Detection and Agreement Estimation from Video with
  Transformer Networks
Backchannel Detection and Agreement Estimation from Video with Transformer NetworksIEEE International Joint Conference on Neural Network (IJCNN), 2023
A. Amer
Chirag Bhuvaneshwara
G. Addluri
Mohammed Maqsood Shaik
Vedant Bonde
Philippe Muller
225
9
0
02 Jun 2023
ModEFormer: Modality-Preserving Embedding for Audio-Video
  Synchronization using Transformers
ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using TransformersIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Akash Gupta
Rohun Tripathi
Won-Kap Jang
221
9
0
21 Mar 2023
WASD: A Wilder Active Speaker Detection Dataset
WASD: A Wilder Active Speaker Detection DatasetIEEE Transactions on Biometrics Behavior and Identity Science (TBBIS), 2023
Tiago Roxo
Joana Cabral Costa
Pedro R. M. Inácio
Hugo Manuel Proença
177
5
0
09 Mar 2023
Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
Self-Supervised Video Forensics by Audio-Visual Anomaly DetectionComputer Vision and Pattern Recognition (CVPR), 2023
Chao Feng
Ziyang Chen
Andrew Owens
272
112
0
04 Jan 2023
Jointly Learning Visual and Auditory Speech Representations from Raw
  Data
Jointly Learning Visual and Auditory Speech Representations from Raw DataInternational Conference on Learning Representations (ICLR), 2022
A. Haliassos
Pingchuan Ma
Rodrigo Mira
Stavros Petridis
Maja Pantic
SSL
309
70
0
12 Dec 2022
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion
  Priors
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion PriorsIEEE International Conference on Computer Vision (ICCV), 2022
Zhentao Yu
Zixin Yin
Deyu Zhou
Duomin Wang
Finn Wong
Baoyuan Wang
DiffM
213
55
0
07 Dec 2022
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via
  Audio-Lip Memory
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip MemoryAAAI Conference on Artificial Intelligence (AAAI), 2022
Se Jin Park
Minsu Kim
Joanna Hong
J. Choi
Y. Ro
CVBM
279
103
0
02 Nov 2022
Multimodal Transformer Distillation for Audio-Visual Synchronization
Multimodal Transformer Distillation for Audio-Visual SynchronizationIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
Xuan-Bo Chen
Haibin Wu
Chung-Che Wang
Hung-yi Lee
J. Jang
155
6
0
27 Oct 2022
Towards Effective Image Manipulation Detection with Proposal Contrastive
  Learning
Towards Effective Image Manipulation Detection with Proposal Contrastive Learning
Yuyuan Zeng
Bowen Zhao
Shanzhao Qiu
Tao Dai
Shutao Xia
169
41
0
16 Oct 2022
Sparse in Space and Time: Audio-visual Synchronisation with Trainable
  Selectors
Sparse in Space and Time: Audio-visual Synchronisation with Trainable SelectorsBritish Machine Vision Conference (BMVC), 2022
Vladimir E. Iashin
Weidi Xie
Esa Rahtu
Andrew Zisserman
149
32
0
13 Oct 2022
Learning State-Aware Visual Representations from Audible Interactions
Learning State-Aware Visual Representations from Audible InteractionsNeural Information Processing Systems (NeurIPS), 2022
Himangi Mittal
Pedro Morgado
Unnat Jain
Abhinav Gupta
224
28
0
27 Sep 2022
Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
Lip-to-Speech Synthesis for Arbitrary Speakers in the WildACM Multimedia (ACM MM), 2022
Sindhu B. Hegde
Prajwal K R
Rudrabha Mukhopadhyay
Vinay P. Namboodiri
C. V. Jawahar
224
16
0
01 Sep 2022
Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors
Extreme-scale Talking-Face Video Upsampling with Audio-Visual PriorsACM Multimedia (ACM MM), 2022
Sindhu B. Hegde
Rudrabha Mukhopadhyay
Vinay P. Namboodiri
C. V. Jawahar
CVBM
178
2
0
17 Aug 2022
End-To-End Audiovisual Feature Fusion for Active Speaker Detection
End-To-End Audiovisual Feature Fusion for Active Speaker DetectionInternational Conference on Digital Image Processing (ICDIP), 2022
Fiseha B. Tesema
Zheyuan Lin
Shiqiang Zhu
Wei Song
J. Gu
Hong-Chuan Wu
159
4
0
27 Jul 2022
Deep Learning for Visual Speech Analysis: A Survey
Deep Learning for Visual Speech Analysis: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Changchong Sheng
Gangyao Kuang
L. Bai
Chen Hou
Yike Guo
Xin Xu
M. Pietikäinen
Tianpeng Liu
VLM
321
53
0
22 May 2022
End-to-End Multi-Person Audio/Visual Automatic Speech Recognition
End-to-End Multi-Person Audio/Visual Automatic Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
Otavio Braga
Takaki Makino
Olivier Siohan
H. Liao
CVBM
136
20
0
11 May 2022
A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active
  Speaker Selection
A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker SelectionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021
Otavio Braga
Olivier Siohan
185
9
0
11 May 2022
Best of Both Worlds: Multi-task Audio-Visual Automatic Speech
  Recognition and Active Speaker Detection
Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker DetectionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
Otavio Braga
Olivier Siohan
CVBM
153
12
0
10 May 2022
VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices
VocaLiST: An Audio-Visual Synchronisation Model for Lips and VoicesInterspeech (Interspeech), 2022
V. S. Kadandale
Juan F. Montesinos
G. Haro
230
30
0
05 Apr 2022
Multi-modality Associative Bridging through Memory: Speech Sound
  Recollected from Face Video
Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face VideoIEEE International Conference on Computer Vision (ICCV), 2021
Minsu Kim
Joanna Hong
Se Jin Park
Yong Man Ro
CVBM
179
48
0
04 Apr 2022
Speaker Extraction with Co-Speech Gestures Cue
Speaker Extraction with Co-Speech Gestures CueIEEE Signal Processing Letters (SPL), 2022
Zexu Pan
Xinyuan Qian
Haizhou Li
SLR
176
33
0
31 Mar 2022
End to End Lip Synchronization with a Temporal AutoEncoder
End to End Lip Synchronization with a Temporal AutoEncoderIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2020
Yoav Shalev
Lior Wolf
84
9
0
30 Mar 2022
Learning Contextually Fused Audio-visual Representations for
  Audio-visual Speech Recognition
Learning Contextually Fused Audio-visual Representations for Audio-visual Speech RecognitionInternational Conference on Information Photonics (ICIP), 2022
Zitian Zhang
Jie Zhang
Jian-Shu Zhang
Ming Wu
Xin Fang
Lirong Dai
SSL
270
12
0
15 Feb 2022
Data standardization for robust lip sync
Data standardization for robust lip syncIEEE International Conference on Multimedia and Expo (ICME), 2022
C. Wang
259
0
0
13 Feb 2022
Leveraging Real Talking Faces via Self-Supervision for Robust Forgery
  Detection
Leveraging Real Talking Faces via Self-Supervision for Robust Forgery DetectionComputer Vision and Pattern Recognition (CVPR), 2022
A. Haliassos
Rodrigo Mira
Stavros Petridis
Maja Pantic
CVBM
385
173
0
18 Jan 2022
End-to-end speaker diarization with transformer
End-to-end speaker diarization with transformer
Yongquan Lai
Xin Tang
Yuanyuan Fu
Rui Fang
159
1
0
14 Dec 2021
LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
  and Lip Reading
LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading
Leyuan Qu
C. Weber
S. Wermter
255
33
0
09 Dec 2021
Audio-Visual Synchronisation in the wild
Audio-Visual Synchronisation in the wild
Honglie Chen
Weidi Xie
Triantafyllos Afouras
Arsha Nagrani
Andrea Vedaldi
Andrew Zisserman
201
49
0
08 Dec 2021
AVA-AVD: Audio-Visual Speaker Diarization in the Wild
AVA-AVD: Audio-Visual Speaker Diarization in the WildACM Multimedia (MM), 2021
Eric Z. Xu
Zeyang Song
Satoshi Tsutsui
C. Feng
Mang Ye
Mike Zheng Shou
VGen
426
55
0
29 Nov 2021
Structure from Silence: Learning Scene Structure from Ambient Sound
Structure from Silence: Learning Scene Structure from Ambient SoundConference on Robot Learning (CoRL), 2021
Ziyang Chen
Xixi Hu
Andrew Owens
178
31
0
10 Nov 2021
Look Who's Talking: Active Speaker Detection in the Wild
Look Who's Talking: Active Speaker Detection in the Wild
You Jin Kim
Hee-Soo Heo
Soyeon Choe
Soo-Whan Chung
Yoohwan Kwon
Bong-Jin Lee
Youngki Kwon
Joon Son Chung
209
27
0
17 Aug 2021
UniCon: Unified Context Network for Robust Active Speaker Detection
UniCon: Unified Context Network for Robust Active Speaker DetectionACM Multimedia (ACM MM), 2021
Yuanhang Zhang
Susan Liang
Shuang Yang
Xiao-Chang Liu
Zhongqin Wu
Shiguang Shan
Xilin Chen
CVBM
154
43
0
05 Aug 2021
Is Someone Speaking? Exploring Long-term Temporal Features for
  Audio-visual Active Speaker Detection
Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker DetectionACM Multimedia (ACM MM), 2021
Ruijie Tao
Zexu Pan
Rohan Kumar Das
Xinyuan Qian
Mike Zheng Shou
Haizhou Li
208
218
0
14 Jul 2021
Active Speaker Detection as a Multi-Objective Optimization with
  Uncertainty-based Multimodal Fusion
Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-based Multimodal FusionInterspeech (Interspeech), 2021
Baptiste Pouthier
L. Pilati
Leela K. Gudupudi
C. Bouveyron
F. Precioso
165
12
0
07 Jun 2021
12
Next
Page 1 of 2