Speech synthesis with face embeddings

Xing Wu*, Sihui Ji, Jianjia Wang, Yike Guo

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

6 Citations (Scopus)

Abstract

Human beings are capable of imagining a person’s voice according to his or her appearance because different people have different voice characteristics. Although researchers have made great progress in single-view speech synthesis, there are few studies on multi-view speech synthesis, especially the speech synthesis using face images. On the basis of implicit relationship between the speaker’s face image and his or her voice, we propose a multi-view speech synthesis method called SSFE (Speech Synthesis with Face Embeddings). The proposed SSFE consists of three parts: a voice encoder, a face encoder and an improved multi-speaker text-to-speech (TTS) engine. On the one hand, the proposed voice encoder generates the voice embeddings from the speaker’s speech and the proposed face encoder extracts the voice features from the speaker’s face as f-voice embeddings. On the other hand, the multi-speaker TTS engine would synthesize the speech with voice embeddings and f-voice embeddings. We have conducted extensive experiments to evaluate the proposed SSFE on the synthesized speech quality and face-voice matching degree, in which the Mean Opinion Score of the SSFE is more than 3.7 and the matching degree is about 1.7. The experimental results prove that the proposed SSFE method outperforms state-of-the-art methods on the synthesized speech in terms of speech quality and face-voice matching degree.

Original languageEnglish
Pages (from-to)14839-14852
Number of pages14
JournalApplied Intelligence
Volume52
Early online date18 Mar 2022
DOIs
Publication statusPublished - Oct 2022

Scopus Subject Areas

  • Artificial Intelligence

User-Defined Keywords

  • Face to voice
  • Multi-speaker text-to-speech
  • Multi-view speech synthesis
  • Visual-audio

Cite this