Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2308.16692
Cited By
v1
v2 (latest)
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
International Conference on Learning Representations (ICLR), 2023
31 August 2023
Xin Zhang
Dong Zhang
Shimin Li
Yaqian Zhou
Xipeng Qiu
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Github (560★)
Papers citing
"SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models"
50 / 72 papers shown
Title
LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models
Xiaohan Zhao
Hongyu Xiang
Shengze Ye
Song Li
Zhengkun Tian
Guanyu Chen
Ke Ding
Guanglu Wan
AuLLM
132
1
0
17 Oct 2025
FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates
Jiaqi Li
Y. Qian
Yuxuan Hu
Leying Zhang
Xiaofei Wang
Heng Lu
Manthan Thakker
Jinyu Li
Sheng Zhao
Zhizheng Wu
162
0
0
01 Oct 2025
AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook
Yihao Chen
Kai Hu
Long Zhou
Shulin Feng
Xusheng Yang
Hangting Chen
Xie Chen
96
2
0
26 Sep 2025
Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
Junjie Cao
Yichen Han
Ruonan Zhang
Xiaoyang Hao
Hongxiang Li
Shuaijiang Zhao
Yue Liu
Xiao-Ping Zhng
83
0
0
26 Sep 2025
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
Yuhan Song
Linhao Zhang
Chuhan Wu
Aiwei Liu
Wei Jia
Houfeng Wang
Xiao-bin Zhou
109
0
0
26 Sep 2025
MBCodec:Thorough disentangle for high-fidelity audio compression
Ruonan Zhang
Xiaoyang Hao
Yichen Han
Junjie Cao
Yue Liu
Kai Zhang
76
1
0
21 Sep 2025
FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation
Luca Della Libera
Cem Subakan
Mirco Ravanelli
88
0
0
19 Sep 2025
DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners
Xiaoxue Luo
Jinwei Huang
Runyan Yang
Yingying Gao
Junlan Feng
Chao Deng
Shilei Zhang
106
2
0
11 Sep 2025
FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot
Kun Xie
Feiyu Shen
Junjie Li
Fenglong Xie
Xu Tang
Yao Hu
111
8
0
02 Sep 2025
Analysing the Language of Neural Audio Codecs
J. S. Park
Shinnosuke Takamichi
David M. Chan
Shunsuke Kando
Yuki Saito
Hiroshi Saruwatari
60
0
0
01 Sep 2025
CodecBench: A Comprehensive Benchmark for Acoustic and Semantic Evaluation
Ruifan Deng
Yitian Gong
Qinghui Gao
Luozhijie Jin
Qinyuan Cheng
Zhaoye Fei
Shimin Li
Xipeng Qiu
AuLLM
105
2
0
28 Aug 2025
TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling
Yuancheng Wang
Dekun Chen
Xueyao Zhang
Junan Zhang
Jiaqi Li
Zhizheng Wu
208
4
0
22 Aug 2025
Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework
Andrea Di Pierno
Luca Guarnera
D. Allegra
Sebastiano Battiato
108
0
0
04 Aug 2025
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec
Chunyu Qiang
Haoyu Wang
Cheng Gong
Tianrui Wang
Ruibo Fu
...
Zhengqi Wen
C. Zhang
Longbiao Wang
Jianwu Dang
Jianhua Tao
116
4
0
04 Aug 2025
ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models
Kaizhi Qian
Xulin Fan
Junrui Ni
Slava Shechtman
M. Hasegawa-Johnson
Chuang Gan
Yang Zhang
126
0
0
27 Jul 2025
Step-Audio 2 Technical Report
Boyong Wu
Chao Yan
Chen Hu
Cheng Yi
Chengli Feng
...
Yuanwei Lu
Yuchu Luo
Yuhe Yin
Yumeng Zhan
Y. Zhang
AuLLM
183
0
0
22 Jul 2025
DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
Hang Shao
Heting Gao
Yunhang Shen
Jiawei Chen
Zuwei Long
Dong Yang
Ke Li
Xing Sun
AuLLM
MoE
151
2
0
27 Jun 2025
MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation
Yakun Song
Jiawei Chen
Xiaobin Zhuang
Chenpeng Du
Ziyang Ma
...
Dongya Jia
Zhuo Chen
Yuping Wang
Yuping Wang
Xie Chen
166
3
0
31 May 2025
Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English
Haoyang Zhang
Hexin Liu
Xiangyu Zhang
Qiquan Zhang
Yuchen Hu
Junqi Zhao
Fei Tian
Xuerui Yang
Eng Siong Chng
Eng Siong Chng
342
0
0
20 May 2025
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space
Zhengrui Ma
Yang Feng
Chenze Shao
Fandong Meng
Jie Zhou
Min Zhang
192
3
0
19 May 2025
Universal Semantic Disentangled Privacy-preserving Speech Representation Learning
Biel Tura Vecino
Subhadeep Maji
Aravind Varier
Antonio Bonafonte
Ivan Valles
...
Roberto Barra-Chicote
Ariya Rastrow
C. Papayiannis
Volker Leutnant
Trevor Wood
232
0
0
19 May 2025
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
Yemin Shi
Yu Shu
Siwei Dong
Guangyi Liu
Jaward Sesay
Jingwen Li
Zhiting Hu
AuLLM
VLM
220
2
0
05 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Wei Wei
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
...
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
975
25
0
05 May 2025
Deep Audio Watermarks are Shallow: Limitations of Post-Hoc Watermarking Techniques for Speech
P. O'Reilly
Zeyu Jin
Jiaqi Su
Bryan Pardo
204
6
0
15 Apr 2025
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
Liang-Hsuan Tseng
Yi-Chang Chen
Kuan-Yi Lee
Da-shan Shiu
Hung-yi Lee
AuLLM
376
11
0
09 Apr 2025
UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
International Conference on Learning Representations (ICLR), 2025
Alexander H. Liu
Sang-gil Lee
Chao-Han Huck Yang
Yuan Gong
Yu-Chun Wang
James Glass
Rafael Valle
Bryan Catanzaro
SSL
216
4
0
02 Mar 2025
From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval
Jian Jia
Jingtong Gao
Ben Xue
Junhao Wang
Qingpeng Cai
Quan Chen
Xiangyu Zhao
Peng Jiang
Kun Gai
OffRL
281
5
0
18 Feb 2025
AudioMiXR: Spatial Audio Object Manipulation with 6DoF for Sound Design in Augmented Reality
Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2025
Brandon Woodard
Margarita Geleta
Joseph J. LaViola Jr.
Andrea Fanelli
Rhonda Wilson
718
22
0
05 Feb 2025
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
Computer Vision and Pattern Recognition (CVPR), 2024
Jianping Jiang
Weiye Xiao
Zhengyu Lin
Han Zhang
Tianxiang Ren
Yang Gao
Zhiqian Lin
Zhongang Cai
Lei Yang
Ziwei Liu
293
7
0
29 Nov 2024
Scaling Transformers for Low-Bitrate High-Quality Speech Coding
Julian Parker
Anton Smirnov
Jordi Pons
CJ Carr
Zack Zukowski
Zach Evans
Xubo Liu
265
49
0
29 Nov 2024
MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios
Spoken Language Technology Workshop (SLT), 2024
Xiao-Hang Jiang
Yang Ai
Rui Zheng
Hui-Peng Du
Ye-Xin Lu
Zhen-Hua Ling
253
9
0
01 Nov 2024
Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding
Peiji Yang
Fengping Wang
Yicheng Zhong
Huawei Wei
Zhisheng Wang
151
1
0
21 Oct 2024
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Alan Dao
Dinh Bach Vu
Huy Hoang Ha
AuLLM
VLM
277
6
0
20 Oct 2024
DM-Codec: Distilling Multimodal Representations for Speech Tokenization
Md Mubtasim Ahasan
Md Fahim
Tasnim Mohiuddin
A. K. M. Mahbubur Rahman
Vasu Sharma
Tariq Iqbal
M. A. Amin
Md. Mofijul Islam
Amin Ahsan Ali
273
3
0
19 Oct 2024
Code Drift: Towards Idempotent Neural Audio Codecs
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
P. O'Reilly
Prem Seetharaman
Jiaqi Su
Zeyu Jin
Bryan Pardo
864
3
0
14 Oct 2024
Graded Suspiciousness of Adversarial Texts to Human
Shakila Mahjabin Tonni
Pedro Faustini
Mark Dras
AAML
152
0
0
06 Oct 2024
SyllableLM: Learning Coarse Semantic Units for Speech Language Models
International Conference on Learning Representations (ICLR), 2024
Alan Baade
Puyuan Peng
David Harwath
276
19
0
05 Oct 2024
Recent Advances in Speech Language Models: A Survey
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Wenqian Cui
Dianzhi Yu
Xiaoqi Jiao
Ziqiao Meng
Guangyan Zhang
Qichao Wang
Yiwen Guo
Irwin King
AuLLM
413
61
0
01 Oct 2024
Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models
Wenrui Liu
Zhifang Guo
Jin Xu
Yuanjun Lv
Yunfei Chu
Zhou Zhao
Junyang Lin
180
4
0
28 Sep 2024
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Computer Vision and Pattern Recognition (CVPR), 2024
Kai Chen
Yunhao Gou
Runhui Huang
Zhili Liu
Daxin Tan
...
Qun Liu
Jun Yao
Lu Hou
Hang Xu
Hang Xu
AuLLM
MLLM
VLM
373
41
0
26 Sep 2024
Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM
Robin Shing-Hei Yuen
Timothy Tin-Long Tse
Jian Zhu
AuLLM
141
4
0
25 Sep 2024
StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis
Chinese Conference on Pattern Recognition and Computer Vision (CPRCV), 2024
Zhiyong Chen
Xinnuo Li
Zhiqi Ai
Shugong Xu
DiffM
143
3
0
24 Sep 2024
Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models
Spoken Language Technology Workshop (SLT), 2024
Haibin Wu
Xuanjun Chen
Yi-Cheng Lin
Kaiwei Chang
Jiawei Du
...
Yi-Chiao Wu
Xu Tan
James Glass
Shinji Watanabe
Hung-yi Lee
151
14
0
21 Sep 2024
Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Xiaoyu Liu
Xu Li
Joan Serrà
Santiago Pascual
225
5
0
14 Sep 2024
Text-To-Speech Synthesis In The Wild
Jee-weon Jung
Wangyou Zhang
Soumi Maiti
Yihan Wu
Xin Eric Wang
...
Hye-jin Shim
Nicholas W. D. Evans
Joon Son Chung
Shinnosuke Takamichi
Shinji Watanabe
303
3
0
13 Sep 2024
LAST: Language Model Aware Speech Tokenization
A. Turetzky
Yossi Adi
231
8
0
05 Sep 2024
SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis
Spoken Language Technology Workshop (SLT), 2024
Haohan Guo
Fenglong Xie
Kun Xie
Dongchao Yang
Dake Guo
Xixin Wu
Helen Meng
142
11
0
02 Sep 2024
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
International Conference on Learning Representations (ICLR), 2024
Yuancheng Wang
Haoyue Zhan
Liwei Liu
Ruihong Zeng
Haotian Guo
Jiachen Zheng
Qiang Zhang
Shunsi Zhang
Shunsi Zhang
Zhizheng Wu
332
135
0
01 Sep 2024
Progressive Residual Extraction based Pre-training for Speech Representation Learning
IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2024
Tianrui Wang
Jin Li
Ziyang Ma
Rui Cao
Xie Chen
...
Meng Ge
Xiaobao Wang
Yuguang Wang
Jianwu Dang
Nyima Tashi
SSL
250
3
0
31 Aug 2024
SSDM: Scalable Speech Dysfluency Modeling
Neural Information Processing Systems (NeurIPS), 2024
Jiachen Lian
Xuanru Zhou
Z. Ezzes
Jet M J Vonk
Brittany Morin
D. Baquirin
Zachary Mille
M. G. Tempini
Gopala Anumanchipalli
AuLLM
231
18
0
29 Aug 2024
1
2
Next