2019年2月18日 (一) 12:12的最后版本

Project name

Text To Speech

Project members

Dong Wang, Zhiyong Zhang

Introduction

We are interested in a flexible syntehsis based on neural model . The basic idea is that since the neural model can be traind with multiple conditions, we can treat speaker and emotion as the conditional factors. We use the speaker vector and emotion vector as addiiontal input to the model, and then train a single model that can produce sound of different speakers and different emotions.

In the following experiments, we use a simple DNN architecture to implement the training. The vocoder is WORD.

Experiments

Mono-speaker

The first step is mono-speaker systems. We trained three systems: a female, a male and a child, each with a single network. The performance is like the ofllowing.

Synthesis text:好雨知时节，当春乃发声，随风潜入夜，润物细无声

Female[1]

Male[2]

Child[3]

Multi-speaker

Now we combine all the data from male, female and child to train a single model.

Without Speaker-vector

The first experiment is that the data are blindly combined, without any indicator of speakers.

Female & Male[4]

Female & Child[5]

Male & Child[6]

With Speaker-vector

Now we use speaker vector as an indicator of the speaker trait.

Specific person

Firstly, use the speaker fector to specifiy a particular person:

Female[7]

Male[8]

Interpolate of different person

Now let's produce interpolated voice by interpolating two speakers: female and amle.

Female & Male with different ratio

(1) 0.0:1.0[9]

(2) 0.1:0.9[10]

(3) 0.2:0.8[11]

(4) 0.3:0.7[12]

(5) 0.4:0.6[13]

(6) 0.5:0.5[14]

(7) 0.6:0.4[15]

(8) 0.7:0.3[16]

(9) 0.8:0.2[17]

(10) 0.9:0.1[18]

(11) 1.0:0.0[19]

Mono-speaker Multi-Emotion

Using emotion vectors can specify which emotio to use, and the emotion can be also interpolated.

Specific emotion

Neutral emotion [20]
Happy emotion [21]
Sorrow emotion [22]
Angry emotion [23]

Interpolation emotion

Angry & neutral with different ratio

(1) 0.0:1.0 [24]
(2) 0.1:0.9 [25]
(3) 0.2:0.8 [26]
(4) 0.3:0.7 [27]
(5) 0.4:0.6 [28]
(6) 0.5:0.5 [29]
(7) 0.6:0.4 [30]
(8) 0.7:0.3 [31]
(9) 0.8:0.2 [32]
(10) 0.9:0.1 [33]
(11) 1.0:0.0 [34]

Multi-speaker Multi-emotion

Finally, all the data (different speakers and different emotions) are combined together. Note that only the child voice has different emotions of training data. We hope that this emotion can be learned so that we can generate voice of other speakers with emotion, although they do not have any training data with emtoions.

Female

angry [35]
happy [36]
neutral [37]
sorrow [38]

Male

angry [39]
happy [40]
neutral [41]
sorrow [42]

MLPG Comparation

We compare the different implementation of mlpg AS merlin does(mlpg.py and fast_mlpg.py). There are three implementations:

mlpg: As mlpg.py while compute all the dimension of delta features(including lf0/bap/mgc, the dim is 1/5/60 respectively)
mlpg-lossy: Wrong implementation of mlpg.py by only considering the first dimension of global co-variance.
fast-mlpg: As fast_mlpg.py in merlin.

Computation Time(Estimation)

   alg.    |    lf0(dim=1)    |    bap(dim=5)   |   mgc(dim=60) 
mlpg-lossy |      100000      |     130000      |   160000    
   mlpg    |      130000      |     500000      |   6200000    
fast-mlpg  |      60000       |     300000      |   3580000
 avg-rate  |      1:1.3:0.6   |     1:4:2+      |   1:40:20+

Synthesis waves

text

5='好雨知时节，当春乃发声，随风潜入夜，润物细无声。'
13='大熊猫最大的愿望就是拍一张自己的照片。'

no-mlpg

5 [43]
13 :*5 [44]

mlpg-lossy

5 [45]
13 :*5 [46]

mlpg

5 [47]
13 :*5 [48]

fast-mlpg

5 [49]
13 :*5 [50]

@@ 第6行： / 第6行： @@
 =Introduction=
-Text To Speech
+We are interested in a flexible syntehsis based on neural model . The basic idea is that since the neural model can be
+traind with multiple conditions, we can treat speaker and emotion as the conditional factors. We use the speaker vector
+and emotion vector as addiiontal input to the model, and then train a single model that can produce sound of different
+speakers and different emotions.
+In the following experiments, we use a simple DNN architecture to implement the training. The vocoder is WORD.
+=Experiments=
+==Mono-speaker==
+The first step is mono-speaker systems. We trained three systems: a female, a male and a child, each with a
+single network. The performance is like the ofllowing.
-=Sample waves=
 Synthesis text:好雨知时节，当春乃发声，随风潜入夜，润物细无声
-==Mono-speaker TTS==
 *Female[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/huilian/female01/female01_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
@@ 第18行： / 第28行： @@
 *Child[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/huilian/child01.neutral/child01-neutral_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
-==Multi-speaker mix-trainingr==
+==Multi-speaker==
+Now we combine all the data from male, female and child to train a single model.
 ===Without Speaker-vector===
+The first experiment is that the data are blindly combined, without any indicator of speakers.
 *Female & Male[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/mix/female01-male01/female01-male01_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
@@ 第27行： / 第43行： @@
-===With speaker-vector===
+===With Speaker-vector===
-When synthesis, we just replace the speaker-vector for specific person.
-*Specific person===
+Now we use speaker vector as an indicator of the speaker trait.
+*Specific person
+Firstly, use the speaker fector to specifiy a particular person:
 :*Female[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/mix/all.dvector40/female01.dvec40_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
 :*Male[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/mix/all.dvector40/male01.dvec40_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
-*Interpolate the speaker-vector of different person
+*Interpolate of different person
+Now let's produce interpolated voice by interpolating two speakers: female and amle.
 :* Female & Male with different ratio
@@ 第59行： / 第83行： @@
 ::*(11) 1.0:0.0[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/mix/iterpolation/female01_male01/iterpolation_10_female01_male01_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
-==Mono-speaker Emotion TTS==
+==Mono-speaker Multi-Emotion==
+Using emotion vectors can specify which emotio to use, and the emotion can be also interpolated.
 *Specific emotion
 :* Neutral emotion [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/emotion/roobo.child/x-neutral_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
@@ 第81行： / 第108行： @@
 ==Multi-speaker Multi-emotion==
-*Synthesis text:'据了解，天津市今年粮食种植面积达六百万亩，预计全年粮食总产量可达二十公斤，比去年提高了'
-:* female-angry [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/emotion/mix/female01.angry_1_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+Finally, all the data (different speakers and different emotions) are combined together. Note that only the child voice
-:* female-happy [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/emotion/mix/female01.happy_1_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+has different emotions of training data. We hope that this emotion can be learned so that we can generate voice of
-:* female-neutral [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/emotion/mix/female01.neutral_1_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+other speakers with emotion, although they do not have any training data with emtoions.
-:* female-sorrow [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/emotion/mix/female01.sorrow_1_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+*Female
+:* angry [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speaker_multi-emotion/female01_angry_final_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+:* happy [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speaker_multi-emotion/female01_happy_final_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+:* neutral [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speaker_multi-emotion/female01_neutral_final_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+:* sorrow [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speaker_multi-emotion/female01_sorrow_final_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+*Male
+:* angry [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speaker_multi-emotion/male01_angry_final_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+:* happy [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speaker_multi-emotion/male01_happy_final_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+:* neutral [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speaker_multi-emotion/male01_neutral_final_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+:* sorrow [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speaker_multi-emotion/male01_sorrow_final_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav]
+=MLPG Comparation=
+We compare the different implementation of mlpg AS merlin does(mlpg.py and fast_mlpg.py).
+There are three implementations:
+:*mlpg: As mlpg.py while compute all the dimension of delta features(including lf0/bap/mgc, the dim is 1/5/60 respectively)
+:*mlpg-lossy: Wrong implementation of mlpg.py by only considering the first dimension of global co-variance.
+:*fast-mlpg: As fast_mlpg.py in merlin.
+*Computation Time(Estimation)
+-----------------------------------------------------------------
+    alg.    |    lf0(dim=1)    |    bap(dim=5)   |   mgc(dim=60)
+ mlpg-lossy |      100000      |     130000      |   160000
+    mlpg    |      130000      |     500000      |   6200000
+ fast-mlpg  |      60000       |     300000      |   3580000
+  avg-rate  |      1:1.3:0.6   |     1:4:2+      |   1:40:20+
+-----------------------------------------------------------------
+* Synthesis waves
+:*text
+::*5='好雨知时节，当春乃发声，随风潜入夜，润物细无声。'
+::*13='大熊猫最大的愿望就是拍一张自己的照片。'
+* no-mlpg
+:*5 [http://zhangzy.cslt.org/categories/tts/sample-wav/mlpg-cmp/mlpg-no_5.wav]
+:*13 :*5 [http://zhangzy.cslt.org/categories/tts/sample-wav/mlpg-cmp/mlpg-no_13.wav]
+* mlpg-lossy
+:*5 [http://zhangzy.cslt.org/categories/tts/sample-wav/mlpg-cmp/mlpg01_5.wav]
+:*13 :*5 [http://zhangzy.cslt.org/categories/tts/sample-wav/mlpg-cmp/mlpg01_13.wav]
+* mlpg
+:*5 [http://zhangzy.cslt.org/categories/tts/sample-wav/mlpg-cmp/mlpg60_5.wav]
+:*13 :*5 [http://zhangzy.cslt.org/categories/tts/sample-wav/mlpg-cmp/mlpg60_13.wav]
+* fast-mlpg
+:*5 [http://zhangzy.cslt.org/categories/tts/sample-wav/mlpg-cmp/fast-mlpg_5.wav]
+:*13 :*5 [http://zhangzy.cslt.org/categories/tts/sample-wav/mlpg-cmp/fast-mlpg_13.wav]

“TTS-project-synthesis”版本间的差异