ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.13900
  4. Cited By
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
  Processing

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

26 October 2021
Sanyuan Chen
Chengyi Wang
Zhengyang Chen
Yu-Huan Wu
Shujie Liu
Zhuo Chen
Jinyu Li
Naoyuki Kanda
Takuya Yoshioka
Xiong Xiao
Jian Wu
Long Zhou
Shuo Ren
Y. Qian
Yao Qian
Jian Wu
Micheal Zeng
Xiangzhan Yu
Furu Wei
    SSL
ArXivPDFHTML

Papers citing "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing"

50 / 1,022 papers shown
Title
StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing
StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing
Gaoxiang Cong
Yuankai Qi
Liang-Sheng Li
Amin Beheshti
Zhedong Zhang
A. Hengel
Ming-Hsuan Yang
Chenggang Yan
Qingming Huang
35
12
0
20 Feb 2024
Language-Codec: Reducing the Gaps Between Discrete Codec Representation
  and Speech Language Models
Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models
Shengpeng Ji
Minghui Fang
Ziyue Jiang
Siqi Zheng
Qian Chen
Rongjie Huang
Jialung Zuo
Shulei Wang
Zhou Zhao
AuLLM
24
16
0
19 Feb 2024
Target Speech Extraction with Pre-trained Self-supervised Learning
  Models
Target Speech Extraction with Pre-trained Self-supervised Learning Models
Junyi Peng
Marc Delcroix
Tsubasa Ochiai
Oldrich Plchot
Shoko Araki
J. Černocký
26
8
0
17 Feb 2024
Probing Self-supervised Learning Models with Target Speech Extraction
Probing Self-supervised Learning Models with Target Speech Extraction
Junyi Peng
Marc Delcroix
Tsubasa Ochiai
Oldrich Plchot
Takanori Ashihara
Shoko Araki
J. Černocký
40
2
0
17 Feb 2024
When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate
  Speech into Large Language Models for Depression Detection
When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection
Xiangyu Zhang
Hexin Liu
Kaishuai Xu
Qiquan Zhang
Daijiao Liu
Beena Ahmed
Julien Epps
11
7
0
17 Feb 2024
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot
  Text-to-Speech
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech
Shengpeng Ji
Ziyue Jiang
Hanting Wang
Jia-li Zuo
Zhou Zhao
26
9
0
14 Feb 2024
UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL
  Models
UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models
Ruchao Fan
Natarajan Balaji Shankar
Abeer Alwan
14
0
0
14 Feb 2024
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
Ziyang Ma
Guanrou Yang
Yifan Yang
Zhifu Gao
Jiaming Wang
...
Fan Yu
Qian Chen
Siqi Zheng
Shiliang Zhang
Xie Chen
AuLLM
47
38
0
13 Feb 2024
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model
  on 100K hours of data
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Mateusz Lajszczak
Guillermo Cámbara
Yang Li
Fatih Beyhan
Arent van Korlaar
...
Bartosz Putrycz
Soledad López Gambino
Kayeon Yoo
Elena Sokolova
Thomas Drugman
LM&MA
33
72
0
12 Feb 2024
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
Naoyuki Kanda
Xiaofei Wang
Sefik Emre Eskimez
Manthan Thakker
Hemin Yang
...
Yufei Xia
Jinzhu Li
Yanqing Liu
Sheng Zhao
Michael Zeng
21
8
0
12 Feb 2024
SpeechCLIP+: Self-supervised multi-task representation learning for
  speech via CLIP and speech-image data
SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
Hsuan-Fu Wang
Yi-Jen Shih
Heng-Jui Chang
Layne Berry
Puyuan Peng
Hung-yi Lee
Hsin-Min Wang
David F. Harwath
VLM
32
2
0
10 Feb 2024
SpiRit-LM: Interleaved Spoken and Written Language Model
SpiRit-LM: Interleaved Spoken and Written Language Model
Tu Nguyen
Benjamin Muller
Bokai Yu
Marta R. Costa-jussá
Maha Elbayad
...
Itai Gat
Gabriel Synnaeve
Juan Pino
Benoît Sagot
Emmanuel Dupoux
AuLLM
VLM
44
32
0
08 Feb 2024
REBORN: Reinforcement-Learned Boundary Segmentation with Iterative
  Training for Unsupervised ASR
REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR
Liang-Hsuan Tseng
En-Pei Hu
Cheng-Han Chiang
Yuan Tseng
Hung-yi Lee
Lin-shan Lee
Shao-Hua Sun
59
1
0
06 Feb 2024
Enhancing the Stability of LLM-based Speech Generation Systems through
  Self-Supervised Representations
Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations
Álvaro Martín-Cortinas
Daniel Sáez-Trigueros
Iván Vallés-Pérez
Biel Tura Vecino
Piotr Bilinski
Mateusz Lajszczak
Grzegorz Beringer
Roberto Barra-Chicote
Jaime Lorenzo-Trueba
16
5
0
05 Feb 2024
Are Paralinguistic Representations all that is needed for Speech Emotion
  Recognition?
Are Paralinguistic Representations all that is needed for Speech Emotion Recognition?
Orchid Chetia Phukan
Gautam Siddharth Kashyap
Arun Balaji Buduru
Rajesh Sharma
23
0
0
02 Feb 2024
Low-Resource Cross-Domain Singing Voice Synthesis via Reduced
  Self-Supervised Speech Representations
Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations
Panos Kakoulidis
Nikolaos Ellinas
G. Vamvoukakis
Myrsini Christidou
Alexandra Vioni
...
Junkwang Oh
Gunu Jho
Inchul Hwang
Pirros Tsiakoulis
Aimilios Chalamandaris
13
1
0
02 Feb 2024
On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio
  Classification
On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification
Calum Heggan
S. Budgett
Timothy M. Hospedales
Mehrdad Yaghoobi
SSL
19
1
0
02 Feb 2024
STAA-Net: A Sparse and Transferable Adversarial Attack for Speech
  Emotion Recognition
STAA-Net: A Sparse and Transferable Adversarial Attack for Speech Emotion Recognition
Yi Chang
Zhao Ren
Zixing Zhang
Xin Jing
Kun Qian
Xi Shao
Bin Hu
Tanja Schultz
Björn W. Schuller
AAML
25
4
0
02 Feb 2024
Can you Remove the Downstream Model for Speaker Recognition with
  Self-Supervised Speech Features?
Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
Zakaria Aldeneh
Takuya Higuchi
Jee-weon Jung
Skyler Seto
Tatiana Likhomanenko
Stephen Shum
Ahmed Hussen Abdelaziz
Shinji Watanabe
B. Theobald
SSL
26
2
0
01 Feb 2024
What Do Self-Supervised Speech and Speaker Models Learn? New Findings
  From a Cross Model Layer-Wise Analysis
What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis
Takanori Ashihara
Marc Delcroix
Takafumi Moriya
Kohei Matsuura
Taichi Asami
Yusuke Ijima
SSL
6
7
0
31 Jan 2024
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible
  recipes, self-supervised front-ends, and off-the-shelf models
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models
Jee-weon Jung
Wangyou Zhang
Jiatong Shi
Zakaria Aldeneh
Takuya Higuchi
B. Theobald
Ahmed Hussen Abdelaziz
Shinji Watanabe
63
21
0
30 Jan 2024
SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech
  Generation Leveraging NLP Evaluation Metrics
SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics
Takaaki Saeki
Soumi Maiti
Shinnosuke Takamichi
Shinji Watanabe
Hiroshi Saruwatari
22
11
0
30 Jan 2024
Speech foundation models on intelligibility prediction for
  hearing-impaired listeners
Speech foundation models on intelligibility prediction for hearing-impaired listeners
Santiago Cuervo
R. Marxer
22
6
0
24 Jan 2024
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion,
  Asr Error Detection, and Asr Error Correction
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction
Jiajun He
Xiaohan Shi
Xingfeng Li
T. Toda
37
12
0
24 Jan 2024
Towards Hierarchical Spoken Language Dysfluency Modeling
Towards Hierarchical Spoken Language Dysfluency Modeling
Jiachen Lian
Gopala Anumanchipalli
11
9
0
18 Jan 2024
Efficient Training for Multilingual Visual Speech Recognition:
  Pre-training with Discretized Visual Speech Representation
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
Minsu Kim
Jeong Hun Yeo
Se Jin Park
J. Choi
Y. Ro
17
5
0
18 Jan 2024
Revisiting Self-supervised Learning of Speech Representation from a
  Mutual Information Perspective
Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective
Alexander H. Liu
Sung-Lin Yeh
James R. Glass
SSL
16
3
0
16 Jan 2024
An Explainable Proxy Model for Multiabel Audio Segmentation
An Explainable Proxy Model for Multiabel Audio Segmentation
Théo Mariotte
Antonio Almudévar
Marie Tahon
Alfonso Ortega Giménez
21
1
0
16 Jan 2024
ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion
  Diarization for Emotional Speech Synthesis
ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis
Haobin Tang
Xulong Zhang
Ning Cheng
Jing Xiao
Jianzong Wang
15
10
0
16 Jan 2024
Learning Disentangled Speech Representations with Contrastive Learning
  and Time-Invariant Retrieval
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval
Yimin Deng
Huaizhen Tang
Xulong Zhang
Ning Cheng
Jing Xiao
Jianzong Wang
DRL
15
1
0
16 Jan 2024
DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment
DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment
Hyoung-Seok Oh
Sang-Hoon Lee
Deok-Hyun Cho
Seong-Whan Lee
34
1
0
16 Jan 2024
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided
  Sequence Reordering
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering
Ya-Zhen Song
Zhuo Chen
Xiaofei Wang
Ziyang Ma
Xie Chen
AuLLM
16
35
0
14 Jan 2024
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised
  Audio-Visual Emotion Recognition
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
Licai Sun
Zheng Lian
Bin Liu
Jianhua Tao
51
29
0
11 Jan 2024
Noise-robust zero-shot text-to-speech synthesis conditioned on
  self-supervised speech-representation model with adapters
Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters
Kenichi Fujita
Hiroshi Sato
Takanori Ashihara
Hiroki Kanagawa
Marc Delcroix
Takafumi Moriya
Yusuke Ijima
20
8
0
10 Jan 2024
Singer Identity Representation Learning using Self-Supervised Techniques
Singer Identity Representation Learning using Self-Supervised Techniques
Bernardo Torres
Stefan Lattner
Gaël Richard
SSL
27
8
0
10 Jan 2024
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Wenxi Chen
Yuzhe Liang
Ziyang Ma
Zhisheng Zheng
Xie Chen
ViT
35
17
0
07 Jan 2024
Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
  on Diffusion Models for Enhanced Speaker Naturalness
Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness
Sicheng Yang
Zunnan Xu
Haiwei Xue
Yongkang Cheng
Shaoli Huang
Mingming Gong
Zhiyong Wu
DiffM
VGen
27
11
0
07 Jan 2024
Multichannel AV-wav2vec2: A Framework for Learning Multichannel
  Multi-Modal Speech Representation
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
Qiu-shi Zhu
Jie Zhang
Yu Gu
Yuli Hu
Lirong Dai
SSL
23
11
0
07 Jan 2024
MERBench: A Unified Evaluation Benchmark for Multimodal Emotion
  Recognition
MERBench: A Unified Evaluation Benchmark for Multimodal Emotion Recognition
Zheng Lian
Licai Sun
Yong Ren
Hao Gu
Haiyang Sun
Lan Chen
Bin Liu
Jianhua Tao
11
12
0
07 Jan 2024
StreamVC: Real-Time Low-Latency Voice Conversion
StreamVC: Real-Time Low-Latency Voice Conversion
Yang Yang
Y. Kartynnik
Yunpeng Li
Jiuqiang Tang
Xing Li
George Sung
Matthias Grundmann
28
12
0
05 Jan 2024
Pheme: Efficient and Conversational Speech Generation
Pheme: Efficient and Conversational Speech Generation
Paweł Budzianowski
Taras Sereda
Tomasz Cichy
Ivan Vulić
21
7
0
05 Jan 2024
Self-supervised Reflective Learning through Self-distillation and Online
  Clustering for Speaker Representation Learning
Self-supervised Reflective Learning through Self-distillation and Online Clustering for Speaker Representation Learning
Danwei Cai
Zexin Cai
Ming Li
17
0
0
03 Jan 2024
Efficient Parallel Audio Generation using Group Masked Language Modeling
Efficient Parallel Audio Generation using Group Masked Language Modeling
Myeonghun Jeong
Minchan Kim
Joun Yeop Lee
Nam Soo Kim
16
5
0
02 Jan 2024
Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech
  Recognition using Adversarial Data Augmentation
Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation
Huimeng Wang
Zengrui Jin
Mengzhe Geng
Shujie Hu
Guinan Li
Tianzi Wang
Haoning Xu
Xunying Liu
11
9
0
01 Jan 2024
Investigating Zero-Shot Generalizability on Mandarin-English
  Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models
  with Self-Supervision and Weak Supervision
Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision
Chih-Kai Yang
Kuan-Po Huang
Ke-Han Lu
Chun-Yi Kuan
Chi-Yuan Hsiao
Hung-yi Lee
48
7
0
30 Dec 2023
Boosting Large Language Model for Speech Synthesis: An Empirical Study
Boosting Large Language Model for Speech Synthesis: An Empirical Study
Hong-ping Hao
Long Zhou
Shujie Liu
Jinyu Li
Shujie Hu
Rui Wang
Furu Wei
29
18
0
30 Dec 2023
Self-supervised Pretraining for Decision Foundation Model: Formulation,
  Pipeline and Challenges
Self-supervised Pretraining for Decision Foundation Model: Formulation, Pipeline and Challenges
Xiaoqian Liu
Jianbin Jiao
Junge Zhang
OffRL
LRM
34
2
0
29 Dec 2023
Self-supervised Pretraining for Robust Personalized Voice Activity
  Detection in Adverse Conditions
Self-supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions
H. S. Bovbjerg
Jesper Jensen
Jan Østergaard
Zheng-Hua Tan
VLM
19
3
0
27 Dec 2023
Frame-level emotional state alignment method for speech emotion
  recognition
Frame-level emotional state alignment method for speech emotion recognition
Qifei Li
Yingming Gao
Cong Wang
Yayue Deng
Jinlong Xue
Yichen Han
Ya Li
14
2
0
27 Dec 2023
Modality-Collaborative Transformer with Hybrid Feature Reconstruction
  for Robust Emotion Recognition
Modality-Collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition
Chengxin Chen
Pengyuan Zhang
26
5
0
26 Dec 2023
Previous
123...101112...192021
Next