首页 /研究 /Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding
LEARNING

Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding

Milan Sečujski, Darko Pekar, Siniša Suzić, Anton A Smirnov, Tijana Nosek

发表年份
2020
引用次数
11
访问权限
开放获取

摘要

The paper presents a novel architecture and method for training neural networks to produce synthesized speech in a particular voice and speaking style, based on a small quantity of target speaker/style training data. The method is based on neural network embedding, i.e. mapping of discrete variables into continuous vectors in a low-dimensional space, which has been shown to be a very successful universal deep learning technique. In this particular case, different speaker/style combinations are mapped into different points in a low-dimensional space, which enables the network to capture the similarities and differences between speakers and speaking styles more efficiently. The initial model from which speaker/style adaptation was carried out was a multi-speaker/multi-style model based on 8.5 hours of American English speech data which corresponds to 16 different speaker/style combinations. The results of the experiments show that both versions of the obtained system, one using 10 minutes and the other as little as 30 seconds of target data, outperform the state of the art in parametric speaker/style-dependent speech synthesis. This opens a wide range of application of speaker/style dependent speech synthesis based on small quantities of training data, in domains ranging from customer interaction in call centers to robot-assisted medical therapy.

关键词

Speech recognitionComputer scienceSpeaker diarisationSpeech synthesisArtificial neural networkSpeaker recognitionEmbeddingStyle (visual arts)Space (punctuation)Range (aeronautics)

相关论文

查看 LEARNING 分类全部论文