Audio Samples from "Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions"

[arXiv] [GitHub Repo]
Abstract: Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.

StyleTTS w/ PL-BERT StyleTTS w/o PL-BERT

This page contains a set of audio samples in support of the paper. Some examples are randomly selected directly from the sets we used for evaluation.

All utterances were unseen during training, and the results are uncurated (NOT cherry-picked) unless otherwise specified.

For more samples, please go to our survey used for MOS evaluation here.

Out-of-distribution Examples


This section contains OOD examples that our subjective evaluation shows there is an improvemet with PL-BERT over those without it. We do not present the in-distribution examples here because our subjective evaluation shows that there is no significant improvement of BERT for in-distribution texts.

Text: "I'm seeing another Egyptian influence. The Yahwist rewriters were curiously pro-Egyptian and may be referencing the Egyptian serpent Apep, who represents Darkness and Chaos and is the great enemy of Amun-Ra."

StyleTTS w/ PL-BERT StyleTTS w/ MP-BERT StyleTTS VITS FastSpeech 2 Tacotron 2

Text: Obviously somebody was trying to tell me something, but although I tried, I could not make any sense of the messenger or the message.

StyleTTS w/ PL-BERT StyleTTS w/ MP-BERT StyleTTS VITS FastSpeech 2 Tacotron 2

Text: Cuthbert Dawkins urges us to share the searching and sensitivity of the young Africans. He urges us for the sake of those around us to be part of the flow of life-giving water which comes from Jesus.

StyleTTS w/ PL-BERT StyleTTS w/ MP-BERT StyleTTS VITS FastSpeech 2 Tacotron 2

Text: "You should know I'm not afraid of hell. If there is one. You're dead and you haven't even left this house."

StyleTTS w/ PL-BERT StyleTTS w/ MP-BERT StyleTTS VITS FastSpeech 2 Tacotron 2

Text: "Let me go alone. I can climb the mountain faster by myself. I can examine the summit and report back to you. It may be safer if you stay within the trees. Maybe they'll help to shield you."

StyleTTS w/ PL-BERT StyleTTS w/ MP-BERT StyleTTS VITS FastSpeech 2 Tacotron 2

Ablation Study


Text: So there are some really big issues there that people like can confront if they fully feel that truth has this power to free and make you happy, rather than cause hurt.

Baseline w/o P2G Loss w/o MLM Loss w/o PL-BERT

Text: That night I did not sleep at all, struck by this new idea. All new strength had come to me from somewhere above. I felt that I had made another victory over my own self.

Baseline w/o P2G Loss w/o MLM Loss w/o PL-BERT