Sinovoice-2014-03-25
来自cslt Wiki
目录
Environment setting
- Raid215 is a bit slow. Move some den-lattice and alignment to Raid212.
Corpora
- Labeling Beijing Mobile?
- Now totally 1229h (470 + 346 + 105BJ mobile + 200 PICC + 108h HBTc) telephone speech is ready.
- 16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data.
- LM corpus preparation done.
Acoustic modeling
Telephone model training
1000h Training
- Jietong phone, 200 hour seed, 10k states training:
- Baseline: 8k states, 470+300 MPE4, 20.29
- MPE1: 21.91
- MPE2: 21.71
- MPE3: 21.68
- MPE4: 21.86
- CSLT phone, 8k states traiing
- MPE1: 20.60
- MPE2: 20.37
PICC dedicated training
- Need to collect financial text data and retrain the LM
- Need to comb word list and training text
6000 hour 16k training
Training progress
- 6000h/CSLT phone set alignment/denlattice completed
- 6000h/jt phone set alignment/denlattice completed
- MPE is kicked off
Train Analysis
- The Qihang model used a subset of the 6k data
- 2500+950H+tang500h*+20131220, approximately 1700+2400 hours
- GMM training using this subset achieved 22.47%. Xiaoming's result is 16.1%.
- Seems the database is still not very consistent
- Xiaoming will try to reproduce the Qihang training using this subset
- Tested the 1700h model and 6000h model on the T test sets
model/testcase | ditu | due1| entity1 | rec1 | shiji | zaixian1 | zaixian2 | kuaisu ------------------------------------------------------------------------------------------------ 1700h_mpe | 12.18 | 12.93 | 5.29 | 3.69 | 21.73 | 25.38 | 19.45 | 12.50 ------------------------------------------------------------------------------------------------ 6000h_xEnt | 11.13 | 10.12 | 4.64 | 2.80 | 17.67 | 27.45 | 23.23 | 10.98
- 6000h model is general better than the 1700h for careful reading or domain specific recording
- 6000h with MPE/jt phone set is still on training, but better performance is expected
- This indicates that we should prepare domain-specific AM (not only 8k/16k). The online test prefers online training data
- Suggest test the 6000 model on jidong data
Language modeling
- Training data ready
- First focus on PICC test set, try to improve the PPL
DNN Decoder
Online decoder adaptation
- Incremental training finished (stream mode)
- 8k sentence test
non-stream baseline MPE5: %WER 9.91 [ 4734 / 47753, 235 ins, 509 del, 3990 sub ] stream: MPE1:%WER 9.66 [ 4612 / 47753, 252 ins, 490 del, 3870 sub ] stream: MPE2: %WER 9.48 [ 4529 / 47753, 251 ins, 477 del, 3801 sub ] stream: MPE3: %WER 9.43 [ 4502 / 47753, 230 ins, 484 del, 3788 sub ] stream: MPE4: %WER 9.39 [ 4482 / 47753, 236 ins, 475 del, 3771 sub ]