v1v2 (latest)

Super Monotonic Alignment Search

12 September 2024

Junhyeok Lee

Hyeongju Kim

ArXiv (abs)PDF HTML Github (700★)

Main:3 Pages

2 Figures

Bibliography:2 Pages

1 Tables

Abstract

Monotonic alignment search (MAS), introduced by Glow-TTS, is one of the most popular algorithm in text-to-speech to estimate unknown alignments between text and speech. Since this algorithm needs to search for the most probable alignment with dynamic programming by caching all possible paths, the time complexity of the algorithm is $O(T \times S)$ , where $T$ is the length of text and $S$ is the length of speech representation. The authors of Glow-TTS run this algorithm on CPU, and while they mentioned it is difficult to parallelize, we found that MAS can be parallelized in text length dimension and CPU execution consumes an inordinate amount of time for inter-device copy. Therefore, we implemented a Triton kernel and PyTorch JIT script to accelerate MAS on GPU without inter-device copy. As a result, Super-MAS Triton kernel is up to 72 times faster in the extreme-length case. The code is available atthis https URL.

View on arXiv

Comments on this paper