v1v2 (latest)

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

International Conference on Learning Representations (ICLR), 2023

31 August 2023

Xipeng Qiu

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (560★)

Papers citing "SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models"

50 / 74 papers shown

PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning

140

27 Nov 2025

DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation

168

25 Nov 2025

LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models

179

17 Oct 2025

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

218

01 Oct 2025

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

119

26 Sep 2025

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

132

26 Sep 2025

AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook

155

26 Sep 2025

MBCodec:Thorough disentangle for high-fidelity audio compression

111

21 Sep 2025

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

Luca Della Libera

Cem Subakan

Mirco Ravanelli

112

19 Sep 2025

DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners

142

11 Sep 2025

FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot

168

02 Sep 2025

Analysing the Language of Neural Audio Codecs

01 Sep 2025

CodecBench: A Comprehensive Benchmark for Acoustic and Semantic Evaluation

133

28 Aug 2025

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

228

22 Aug 2025

Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework

182

04 Aug 2025

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

...

136

04 Aug 2025

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

149

27 Jul 2025

Step-Audio 2 Technical Report

...

288

22 Jul 2025

DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

222

27 Jun 2025

MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation

...

192

31 May 2025

Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English

418

20 May 2025

Universal Semantic Disentangled Privacy-preserving Speech Representation Learning

...

Roberto Barra-Chicote

307

19 May 2025

Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

261

19 May 2025

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

...

1.1K

05 May 2025

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

263

05 May 2025

Deep Audio Watermarks are Shallow: Limitations of Post-Hoc Watermarking Techniques for Speech

275

15 Apr 2025

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

452

09 Apr 2025

UniWav: Towards Unified Pre-training for Speech Representation Learning and GenerationInternational Conference on Learning Representations (ICLR), 2025

268

02 Mar 2025

From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval

328

18 Feb 2025

AudioMiXR: Spatial Audio Object Manipulation with 6DoF for Sound Design in Augmented RealityProceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2025

Brandon Woodard

Margarita Geleta

Joseph J. LaViola Jr.

Andrea Fanelli

Rhonda Wilson

874

05 Feb 2025

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous CharactersComputer Vision and Pattern Recognition (CVPR), 2024

338

29 Nov 2024

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

305

29 Nov 2024

MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate ScenariosSpoken Language Technology Workshop (SLT), 2024

271

01 Nov 2024

Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding

185

21 Oct 2024

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

340

20 Oct 2024

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

Md Mubtasim Ahasan

Md Fahim

Tasnim Mohiuddin

A. K. M. Mahbubur Rahman

352

19 Oct 2024

Code Drift: Towards Idempotent Neural Audio CodecsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

925

14 Oct 2024

Graded Suspiciousness of Adversarial Texts to Human

Shakila Mahjabin Tonni

Pedro Faustini

Mark Dras

AAML

206

06 Oct 2024

SyllableLM: Learning Coarse Semantic Units for Speech Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Alan Baade

Puyuan Peng

David Harwath

328

05 Oct 2024

Recent Advances in Speech Language Models: A SurveyAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

537

01 Oct 2024

Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

Jin Xu

213

28 Sep 2024

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid EmotionsComputer Vision and Pattern Recognition (CVPR), 2024

Kai Chen

Zhili Liu

...

Jun Yao

433

26 Sep 2024

Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM

179

25 Sep 2024

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech SynthesisChinese Conference on Pattern Recognition and Computer Vision (CPRCV), 2024

Xinnuo Li

192

24 Sep 2024

Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec modelsSpoken Language Technology Workshop (SLT), 2024

Haibin Wu

Xuanjun Chen

Yi-Cheng Lin

Kaiwei Chang

Jiawei Du

...

Yi-Chiao Wu

Xu Tan

James Glass

Shinji Watanabe

Hung-yi Lee

179

21 Sep 2024

Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved IntelligibilityIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Xiaoyu Liu

Xu Li

Joan Serrà

Santiago Pascual

253

14 Sep 2024

Text-To-Speech Synthesis In The Wild

...

390

13 Sep 2024

LAST: Language Model Aware Speech Tokenization

A. Turetzky

Yossi Adi

287

05 Sep 2024

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech SynthesisSpoken Language Technology Workshop (SLT), 2024

Dongchao Yang

Xixin Wu

Helen Meng

177

02 Sep 2024

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec TransformerInternational Conference on Learning Representations (ICLR), 2024

Yuancheng Wang

Zhizheng Wu

440

148

01 Sep 2024