“Sinovoice-2014-03-18”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
6000 hour 16k training
Corpora
 
(相同用户的3个中间修订版本未显示)
第7行: 第7行:
 
* PICC data are done (200h).
 
* PICC data are done (200h).
 
* Huibei telecom data are done (108h).
 
* Huibei telecom data are done (108h).
* Now totally 1229h (470 + 346 + 105BJ mobile + 200 PICC + 108h) telephone speech is ready.
+
* Now totally 1229h (470 + 346 + 105BJ mobile + 200 PICC + 108h HBTc) telephone speech is ready.
 
* 16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data.
 
* 16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data.
 
* LM corpus preparation done.
 
* LM corpus preparation done.
第40行: 第40行:
  
 
* GMM training using this subset achieved 22.47%. Xiaoming's result is 16.1%.
 
* GMM training using this subset achieved 22.47%. Xiaoming's result is 16.1%.
:* Seems the database is not very consistent as well
+
:* Seems the database is still not very consistent
:* Xiaoming will try to reproduce the Qihang training using the big database
+
:* Xiaoming will try to reproduce the Qihang training using this subset
  
* Test 1700h model and 6000h model on T test
+
* Tested the 1700h model and 6000h model on the T test sets
  
 
<pre>
 
<pre>
第53行: 第53行:
 
</pre>
 
</pre>
  
:* 6000h data is general better than 1700h for careful reading or domain specific recording
+
:* 6000h model is general better than the 1700h for careful reading or domain specific recording
:* 6000h with MPE/jt phoneset is on training, but better performance is expected
+
:* 6000h with MPE/jt phone set is still on training, but better performance is expected
:* Suggest test the 6000 model on jidong  
+
:* This indicates that we should prepare domain-specific AM (not only 8k/16k). The online test prefers online training data
:* Suggest online test prefers online training data
+
:* Suggest test the 6000 model on jidong data
 
+
  
 
===Hubei telecom===
 
===Hubei telecom===
  
* Hubei telecom data (127 h), retrieve 60k sentence by conf thred=0.9, amounting to 50%
+
* Incremental training with Hubei telecome data based on the (470+300+BJmobile) model. MPE4 finished
 
+
:* The original model: 27.30,
<pre>
+
:* The adapted model: 25.42
xEnt org:  -            wer_15  29.05
+
MPE iter1:wer_14 29.23;wer_15 29.38
+
MPE iter2:wer_14 29.05;wer_15 29.11
+
MPE iter3:wer_14 29.32;wer_15 29.28
+
MPE iter4:wer_14 29.29;wer_15 29.28
+
</pre>
+
 
+
* retrieve 30k sentences by conf thred=0.95, amounting to 25%, plus the original 770h data
+
 
+
<pre>
+
xEnt org:    -            wer_15  29.05
+
MPE iter1:    -            wer_15: 29.36
+
</pre>
+
 
+
* Incremental training with Hubei telecome based on the model (470+300+BJmobile)
+
:* MPE4 modeltraining done: org: 27.30, Hubei model: 25.42
+
  
 
=Language modeling=
 
=Language modeling=
  
 
* Training data ready
 
* Training data ready
* Focus on PICC test set
+
* First focus on PICC test set, try to improve the PPL
 
+
  
 
=DNN Decoder=
 
=DNN Decoder=

2014年3月21日 (五) 08:09的最后版本

Environment setting

  • Raid215 is a bit slow. Move some den-lattice and alignment to Raid212.

Corpora

  • PICC data are done (200h).
  • Huibei telecom data are done (108h).
  • Now totally 1229h (470 + 346 + 105BJ mobile + 200 PICC + 108h HBTc) telephone speech is ready.
  • 16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data.
  • LM corpus preparation done.

Acoustic modeling

Telephone model training

1000h Training

  • Xent completed. Compiling lattices.
  • Need to test the xEnt performance

PICC dedicated training

  • Need to collect financial text data and retrain the LM
  • Need to comb word list and training text


6000 hour 16k training

Training progress

  • 6000h/CSLT phone set alignment/denlattice completed
  • 6000h/jt phone set alignment/denlattice completed
  • MPE is kicked off


Train Analysis

  • The Qihang model used a subset of the 6k data
  • 2500+950H+tang500h*+20131220, approximately 1700+2400 hours
  • GMM training using this subset achieved 22.47%. Xiaoming's result is 16.1%.
  • Seems the database is still not very consistent
  • Xiaoming will try to reproduce the Qihang training using this subset
  • Tested the 1700h model and 6000h model on the T test sets
  model/testcase  |   ditu |  due1| entity1 | rec1 | shiji | zaixian1 | zaixian2 | kuaisu
  ------------------------------------------------------------------------------------------------
    1700h_mpe       |  12.18 | 12.93 | 5.29   |   3.69     |  21.73  | 25.38   | 19.45   | 12.50
  ------------------------------------------------------------------------------------------------
    6000h_xEnt      |  11.13 | 10.12 | 4.64   |   2.80     |  17.67  | 27.45   | 23.23   | 10.98 
  • 6000h model is general better than the 1700h for careful reading or domain specific recording
  • 6000h with MPE/jt phone set is still on training, but better performance is expected
  • This indicates that we should prepare domain-specific AM (not only 8k/16k). The online test prefers online training data
  • Suggest test the 6000 model on jidong data

Hubei telecom

  • Incremental training with Hubei telecome data based on the (470+300+BJmobile) model. MPE4 finished
  • The original model: 27.30,
  • The adapted model: 25.42

Language modeling

  • Training data ready
  • First focus on PICC test set, try to improve the PPL

DNN Decoder

Online decoder adaptation

  • Finished alignment/den-lattice
  • 1st round MPE training on going, 2 days/iteration