Abstract
Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a Consistency Model-based Speech synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples and codes are available at https://comospeech.github. https://comospeech.github.io/.
Original language | English |
---|---|
Title of host publication | MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia |
Publisher | Association for Computing Machinery (ACM) |
Pages | 1831–1839 |
Number of pages | 9 |
Edition | 1st |
ISBN (Electronic) | 9798400701085 |
DOIs | |
Publication status | Published - 27 Oct 2023 |
Event | 31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada Duration: 29 Oct 2023 → 3 Nov 2023 https://dl.acm.org/doi/proceedings/10.1145/3581783 (Conference proceedings) https://www.acmmm2023.org/ (Conference website) |
Publication series
Name | Proceedings of the ACM International Conference on Multimedia |
---|
Conference
Conference | 31st ACM International Conference on Multimedia, MM 2023 |
---|---|
Country/Territory | Canada |
City | Ottawa |
Period | 29/10/23 → 3/11/23 |
Internet address |
|
User-Defined Keywords
- Text-to-speech
- Singing Voice Synthesis
- Diffusion Model
- Consistency Model