Phonology-Guided Speech-to-Speech Translation for African Languages
We present a prosody-guided framework for speech-to-speech translation (S2ST) that aligns and translates speech \emph{without} transcripts by leveraging cross-linguistic pause synchrony. Analyzing a 6{,}000-hour East African news corpus spanning five languages, we show that \emph{within-phylum} language pairs exhibit 30--40\% lower pause variance and over 3 higher onset/offset correlation compared to cross-phylum pairs. These findings motivate \textbf{SPaDA}, a dynamic-programming alignment algorithm that integrates silence consistency, rate synchrony, and semantic similarity. SPaDA improves alignment by +3--4 points and eliminates up to 38\% of spurious matches relative to greedy VAD baselines. Using SPaDA-aligned segments, we train \textbf{SegUniDiff}, a diffusion-based S2ST model guided by \emph{external gradients} from frozen semantic and speaker encoders. SegUniDiff matches an enhanced cascade in BLEU (30.3 on CVSS-C vs.\ 28.9 for UnitY), reduces speaker error rate (EER) from 12.5\% to 5.3\%, and runs at an RTF of 1.02. To support evaluation in low-resource settings, we also release a three-tier, transcript-free BLEU suite (M1--M3) that correlates strongly with human judgments. Together, our results show that prosodic cues in multilingual speech provide a reliable scaffold for scalable, non-autoregressive S2ST.
View on arXiv