ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2204.02152
  4. Cited By
UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

5 April 2022
Takaaki Saeki
Detai Xin
Wataru Nakata
Tomoki Koriyama
Shinnosuke Takamichi
Hiroshi Saruwatari
ArXivPDFHTML

Papers citing "UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022"

50 / 116 papers shown
Title
Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding
Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding
Dianwen Ng
Kun Zhou
Yi-Wen Chao
Zhiwei Xiong
B. Ma
E. Chng
23
0
0
12 May 2025
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
Xueyao Zhang
Y. Wang
Chaoren Wang
Z. Li
Zhuo Chen
Zhizheng Wu
46
0
0
07 May 2025
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
Qingkai Fang
Yan Zhou
Shoutao Guo
Shaolei Zhang
Yang Feng
AuLLM
51
0
0
05 May 2025
FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing
FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing
Gaoxiang Cong
Liang-Sheng Li
Jiadong Pan
Zhedong Zhang
Amin Beheshti
A. Hengel
Yuankai Qi
Qingming Huang
46
0
0
02 May 2025
APG-MOS: Auditory Perception Guided-MOS Predictor for Synthetic Speech
APG-MOS: Auditory Perception Guided-MOS Predictor for Synthetic Speech
Zhicheng Lian
Lizhi Wang
Hua Huang
49
0
0
29 Apr 2025
DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue
DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue
X. Li
Duyi Pan
Hongru Xiao
J. Han
Jing Tang
Jiabao Ma
W. Wang
Bo Cheng
32
0
0
20 Apr 2025
EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting
EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting
Guanrou Yang
Chen Yang
Qian Chen
Ziyang Ma
Wenxi Chen
...
Fan Yu
Zhihao Du
Zhifu Gao
Shiliang Zhang
Xie Chen
AuLLM
53
0
0
17 Apr 2025
Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation
Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation
Yan Rong
Shan Yang
Guangzhi Lei
Li Liu
23
0
0
15 Apr 2025
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
Dongchao Yang
Songxiang Liu
Haohan Guo
Jiankun Zhao
Yuanyuan Wang
...
Xubo Liu
Xueyuan Chen
Xu Tan
Xixin Wu
H. Meng
37
0
0
14 Apr 2025
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
Liang-Hsuan Tseng
Yi-Chang Chen
Kuan-Yi Lee
Da-shan Shiu
Hung-yi Lee
AuLLM
52
0
0
09 Apr 2025
VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
Yuhao Wang
Heyang Liu
Ziyang Cheng
Ronghua Wu
Qunshan Gu
Yanfeng Wang
Yu Wang
52
0
0
05 Apr 2025
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
Kim Sung-Bin
Jeongsoo Choi
Puyuan Peng
Joon Son Chung
Tae-Hyun Oh
David F. Harwath
VGen
45
1
0
03 Apr 2025
STFTCodec: High-Fidelity Audio Compression through Time-Frequency Domain Representation
STFTCodec: High-Fidelity Audio Compression through Time-Frequency Domain Representation
Tao Feng
Zhiyuan Zhao
Yifan Xie
Yuqi Ye
Xiangyang Luo
Xun Guan
Y. Li
45
0
0
21 Mar 2025
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
Ji-Hoon Kim
Jeongsoo Choi
Jaehun Kim
Chaeyoung Jung
Joon Son Chung
CVBM
37
1
0
21 Mar 2025
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
Zhedong Zhang
Liang-Sheng Li
C. Yan
Chunshan Liu
A. Hengel
Yuankai Qi
62
2
0
15 Mar 2025
Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations
Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations
Xue Jiang
Xiulian Peng
Yuan Zhang
Yan-Heng Lu
SSL
79
0
0
15 Mar 2025
S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information
Feng Jiang
Zhiyu Lin
Fan Bu
Yuhao Du
Benyou Wang
H. Li
AuLLM
ELM
88
0
0
07 Mar 2025
UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
Alexander H. Liu
Sang-gil Lee
Chao-Han Huck Yang
Yuan Gong
Yu-Chun Wang
James Glass
Rafael Valle
Bryan Catanzaro
SSL
42
0
0
02 Mar 2025
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Tianpeng Li
J. Liu
Tao Zhang
Yuanbo Fang
Da Pan
...
Guosheng Dong
Jianhua Xu
Haoze Sun
Zenan Zhou
Weipeng Chen
AuLLM
53
3
0
24 Feb 2025
AudioMiXR: Spatial Audio Object Manipulation with 6DoF for Sound Design in Augmented Reality
AudioMiXR: Spatial Audio Object Manipulation with 6DoF for Sound Design in Augmented Reality
Brandon Woodard
Margarita Geleta
Joseph J. LaViola Jr.
Andrea Fanelli
Rhonda Wilson
55
2
0
05 Feb 2025
VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching
VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching
Ha-Yeong Choi
Jaehan Park
32
0
0
29 Jan 2025
A Non-autoregressive Model for Joint STT and TTS
A Non-autoregressive Model for Joint STT and TTS
Vishal Sunder
Brian Kingsbury
G. Saon
Samuel Thomas
Slava Shechtman Hagai Aronowitz
Hagai Aronowitz
Eric Fosler-Lussier
Luis A. Lastras
54
0
0
15 Jan 2025
CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
Ji-Hoon Kim
Hong-Sun Yang
Yoon-Cheol Ju
Il-Hwan Kim
Byeong-Yeol Kim
Joon Son Chung
BDL
42
0
0
31 Dec 2024
FreeCodec: A disentangled neural speech codec with fewer tokens
FreeCodec: A disentangled neural speech codec with fewer tokens
Youqiang Zheng
Weiping Tu
Yueteng Kang
Jie Chen
Yike Zhang
Li Xiao
Yuhong Yang
Long Ma
62
1
0
02 Dec 2024
ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram
Xiao-Hang Jiang
Hui-Peng Du
Yang Ai
Ye-Xin Lu
Zhen-Hua Ling
23
0
0
18 Nov 2024
SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features
Yu-Fei Shi
Yang Ai
Ye-Xin Lu
Hui-Peng Du
Zhen-Hua Ling
25
0
0
18 Nov 2024
Pitch-and-Spectrum-Aware Singing Quality Assessment with Bias Correction and Model Fusion
Yu-Fei Shi
Yang Ai
Ye-Xin Lu
Hui-Peng Du
Zhen-Hua Ling
28
0
0
17 Nov 2024
Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution
  and Harmonic Prior for Reliable Complex Spectrogram Estimation
Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation
Reo Yoneyama
Atsushi Miyashita
Ryuichi Yamamoto
T. Toda
22
0
0
11 Nov 2024
MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech
  Quality Assessment Models
MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models
Wen-Chin Huang
Erica Cooper
T. Toda
26
4
0
06 Nov 2024
Addressing Representation Collapse in Vector Quantized Models with One
  Linear Layer
Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
Yongxin Zhu
B. Li
Yifei Xin
Linli Xu
36
10
0
04 Nov 2024
DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models
DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models
Heng-Jui Chang
Hongyu Gong
Changhan Wang
James R. Glass
Yu-An Chung
26
0
0
31 Oct 2024
APCodec+: A Spectrum-Coding-Based High-Fidelity and
  High-Compression-Rate Neural Audio Codec with Staged Training Paradigm
APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm
Hui-Peng Du
Yang Ai
Rui Zheng
Zhen-Hua Ling
31
0
0
30 Oct 2024
Optimizing Neural Speech Codec for Low-Bitrate Compression via
  Multi-Scale Encoding
Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding
Peiji Yang
Fengping Wang
Yicheng Zhong
Huawei Wei
Zhisheng Wang
15
0
0
21 Oct 2024
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction
  and Speculative Decoding
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
Tan Dat Nguyen
Ji-Hoon Kim
Jeongsoo Choi
Shukjae Choi
Jinseok Park
Younglo Lee
Joon Son Chung
26
0
0
17 Oct 2024
GAN-Based Speech Enhancement for Low SNR Using Latent Feature
  Conditioning
GAN-Based Speech Enhancement for Low SNR Using Latent Feature Conditioning
Shrishti Saha Shetu
Emanuël A. P. Habets
Andreas Brendel
16
1
0
17 Oct 2024
Enhancing Crowdsourced Audio for Text-to-Speech Models
Enhancing Crowdsourced Audio for Text-to-Speech Models
José Giraldo
Martí Llopart-Font
Alex Peiró-Lilja
Carme Armentano-Oller
Gerard Sant
Baybars Külebi
DiffM
16
0
0
17 Oct 2024
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice
  Interaction Abilities
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities
Xin Zhang
Xiang Lyu
Zhihao Du
Qian Chen
Dong Zhang
...
Yuxuan Wang
Bin Zhang
Heng Lu
Yaqian Zhou
Xipeng Qiu
AuLLM
28
5
0
09 Oct 2024
Sylber: Syllabic Embedding Representation of Speech from Raw Audio
Sylber: Syllabic Embedding Representation of Speech from Raw Audio
Cheol Jun Cho
Nicholas Lee
Akshat Gupta
Dhruv Agarwal
Ethan Chen
Alan W Black
Gopala K. Anumanchipalli
32
0
0
09 Oct 2024
FINALLY: fast and universal speech enhancement with studio-like quality
FINALLY: fast and universal speech enhancement with studio-like quality
Nicholas Babaev
Kirill Tamogashev
Azat Saginbaev
Ivan Shchekotov
Hanbin Bae
Hosang Sung
WonJun Lee
Hoon-Young Cho
Pavel Andreev
24
2
0
08 Oct 2024
Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for
  Neural Codec Language Models
Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models
Wenrui Liu
Zhifang Guo
Jin Xu
Yuanjun Lv
Yunfei Chu
Zhou Zhao
Junyang Lin
41
1
0
28 Sep 2024
Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds
Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds
Hanbin Bae
Pavel Andreev
Azat Saginbaev
Nicholas Babaev
Won-Jun Lee
Hosang Sung
Hoon-Young Cho
13
0
0
27 Sep 2024
Exploring synthetic data for cross-speaker style transfer in style
  representation based TTS
Exploring synthetic data for cross-speaker style transfer in style representation based TTS
Lucas Ueda
Leonardo B. de M. M. Marques
Flávio O. Simões
Mário Uliani Neto
Fernando Runstein
Bianca Dal Bó
Paula D. P. Costa
18
0
0
25 Sep 2024
Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
Siyin Wang
Wenyi Yu
Yudong Yang
Changli Tang
Yixuan Li
...
Jun Zhang
Guangzhi Sun
Lu Lu
Yuxuan Wang
Chao Zhang
AuLLM
LM&MA
65
5
0
25 Sep 2024
FastTalker: Jointly Generating Speech and Conversational Gestures from
  Text
FastTalker: Jointly Generating Speech and Conversational Gestures from Text
Zixin Guo
Jian Zhang
24
1
0
24 Sep 2024
Preference Alignment Improves Language Model-Based TTS
Preference Alignment Improves Language Model-Based TTS
Jinchuan Tian
Chunlei Zhang
Jiatong Shi
Hao Zhang
Jianwei Yu
Shinji Watanabe
Dong Yu
25
7
0
19 Sep 2024
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Sijing Chen
Yuan Feng
Laipeng He
Tianwei He
Wendi He
...
Huimin Zhang
Xiang Zhang
Guangcheng Zhao
Hongbin Zhou
Pengpeng Zou
25
4
0
18 Sep 2024
Simulating Native Speaker Shadowing for Nonnative Speech Assessment with
  Latent Speech Representations
Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations
Haopeng Geng
Daisuke Saito
Nobuaki Minematsu
25
0
0
18 Sep 2024
SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
Jee-weon Jung
Yihan Wu
Xin Wang
Ji-Hoon Kim
Soumi Maiti
...
Joon Son Chung
Wangyou Zhang
Seyun Um
Shinnosuke Takamichi
Shinji Watanabe
58
1
0
18 Sep 2024
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis
  with Distilled Time-Varying Style Diffusion
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
Yinghao Aaron Li
Xilin Jiang
Cong Han
N. Mesgarani
DiffM
29
4
0
16 Sep 2024
ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration
ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration
Masao Someki
Kwanghee Choi
Siddhant Arora
William Chen
Samuele Cornell
Jionghao Han
Yifan Peng
Jiatong Shi
Vaibhav Srivastav
Shinji Watanabe
VLM
28
0
0
14 Sep 2024
123
Next