CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

Zhen Ye, Wei Xue*, Xu Tan, Jie Chen, Qifeng Liu, Yike Guo*

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

Abstract

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a Consistency Model-based Speech synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples and codes are available at https://comospeech.github. https://comospeech.github.io/.
Original languageEnglish
Title of host publicationMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery (ACM)
Pages1831–1839
Number of pages9
Edition1st
ISBN (Electronic)9798400701085
DOIs
Publication statusPublished - 27 Oct 2023
Event31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada
Duration: 29 Oct 20233 Nov 2023
https://dl.acm.org/doi/proceedings/10.1145/3581783 (Conference proceedings)
https://www.acmmm2023.org/ (Conference website)

Publication series

NameProceedings of the ACM International Conference on Multimedia

Conference

Conference31st ACM International Conference on Multimedia, MM 2023
Country/TerritoryCanada
CityOttawa
Period29/10/233/11/23
Internet address

User-Defined Keywords

  • Text-to-speech
  • Singing Voice Synthesis
  • Diffusion Model
  • Consistency Model

Fingerprint

Dive into the research topics of 'CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model'. Together they form a unique fingerprint.

Cite this