2014年3月21日 (五) 08:09的最后版本

Environment setting

Raid215 is a bit slow. Move some den-lattice and alignment to Raid212.

Corpora

PICC data are done (200h).
Huibei telecom data are done (108h).
Now totally 1229h (470 + 346 + 105BJ mobile + 200 PICC + 108h HBTc) telephone speech is ready.
16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data.
LM corpus preparation done.

Acoustic modeling

Telephone model training

1000h Training

Xent completed. Compiling lattices.
Need to test the xEnt performance

PICC dedicated training

Need to collect financial text data and retrain the LM
Need to comb word list and training text

6000 hour 16k training

Training progress

6000h/CSLT phone set alignment/denlattice completed
6000h/jt phone set alignment/denlattice completed
MPE is kicked off

Train Analysis

The Qihang model used a subset of the 6k data

2500+950H+tang500h*+20131220, approximately 1700+2400 hours

GMM training using this subset achieved 22.47%. Xiaoming's result is 16.1%.

Seems the database is still not very consistent
Xiaoming will try to reproduce the Qihang training using this subset

Tested the 1700h model and 6000h model on the T test sets

  model/testcase  |   ditu |  due1| entity1 | rec1 | shiji | zaixian1 | zaixian2 | kuaisu
  ------------------------------------------------------------------------------------------------
    1700h_mpe       |  12.18 | 12.93 | 5.29   |   3.69     |  21.73  | 25.38   | 19.45   | 12.50
  ------------------------------------------------------------------------------------------------
    6000h_xEnt      |  11.13 | 10.12 | 4.64   |   2.80     |  17.67  | 27.45   | 23.23   | 10.98

6000h model is general better than the 1700h for careful reading or domain specific recording
6000h with MPE/jt phone set is still on training, but better performance is expected
This indicates that we should prepare domain-specific AM (not only 8k/16k). The online test prefers online training data
Suggest test the 6000 model on jidong data

Hubei telecom

Incremental training with Hubei telecome data based on the (470+300+BJmobile) model. MPE4 finished

The original model: 27.30,
The adapted model: 25.42

Language modeling

Training data ready
First focus on PICC test set, try to improve the PPL

DNN Decoder

Online decoder adaptation

Finished alignment/den-lattice
1st round MPE training on going, 2 days/iteration

@@ 第7行： / 第7行： @@
 * PICC data are done (200h).
 * Huibei telecom data are done (108h).
-* Now totally 1229h (470 + 346 + 105BJ mobile + 200 PICC + 108h) telephone speech is ready.
+* Now totally 1229h (470 + 346 + 105BJ mobile + 200 PICC + 108h HBTc) telephone speech is ready.
 * 16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data.
 * LM corpus preparation done.
@@ 第40行： / 第40行： @@
 * GMM training using this subset achieved 22.47%. Xiaoming's result is 16.1%.
-:* Seems the database is not very consistent as well
+:* Seems the database is still not very consistent
-:* Xiaoming will try to reproduce the Qihang training using the big database
+:* Xiaoming will try to reproduce the Qihang training using this subset
-* Test 1700h model and 6000h model on T test
+* Tested the 1700h model and 6000h model on the T test sets
 <pre>
@@ 第53行： / 第53行： @@
 </pre>
-:* 6000h data is general better than 1700h for careful reading or domain specific recording
+:* 6000h model is general better than the 1700h for careful reading or domain specific recording
-:* 6000h with MPE/jt phoneset is on training, but better performance is expected
+:* 6000h with MPE/jt phone set is still on training, but better performance is expected
-:* Suggest test the 6000 model on jidong
+:* This indicates that we should prepare domain-specific AM (not only 8k/16k). The online test prefers online training data
-:* Suggest online test prefers online training data
+:* Suggest test the 6000 model on jidong data
 ===Hubei telecom===
-* Hubei telecom data (127 h), retrieve 60k sentence by conf thred=0.9, amounting to 50%
+* Incremental training with Hubei telecome data based on the (470+300+BJmobile) model. MPE4 finished
+:* The original model: 27.30,
-<pre>
+:* The adapted model: 25.42
-xEnt org:  -             wer_15  29.05
-MPE iter1：wer_14 29.23；wer_15 29.38
-MPE iter2：wer_14 29.05；wer_15 29.11
-MPE iter3：wer_14 29.32；wer_15 29.28
-MPE iter4：wer_14 29.29；wer_15 29.28
-</pre>
-* retrieve 30k sentences by conf thred=0.95, amounting to 25%, plus the original 770h data
-<pre>
-xEnt org:     -             wer_15  29.05
-MPE iter1:    -             wer_15: 29.36
-</pre>
-* Incremental training with Hubei telecome based on the model (470+300+BJmobile)
-:* MPE4 modeltraining done: org: 27.30, Hubei model: 25.42
 =Language modeling=
 * Training data ready
-* Focus on PICC test set
+* First focus on PICC test set, try to improve the PPL
 =DNN Decoder=

“Sinovoice-2014-03-18”版本间的差异

2014年3月21日 (五) 08:09的最后版本

目录

Environment setting

Corpora

Acoustic modeling

Telephone model training

1000h Training

PICC dedicated training

6000 hour 16k training

Training progress

Train Analysis

Hubei telecom

Language modeling

DNN Decoder

Online decoder adaptation

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具