“Sinovoice-2014-02-25”版本间的差异
来自cslt Wiki
(以内容“=DNN training= ==Environment setting== * Two queues: 100.q dedicated to decoding, all.q dedicated to GMM training/MPE lattice generation * disk203-disk210: distribute...”创建新页面) |
|||
(1位用户的10个中间修订版本未显示) | |||
第9行: | 第9行: | ||
==Corpora== | ==Corpora== | ||
− | * PICC data are under labeling (200h), ready in | + | * PICC data are under labeling (200h), ready in one week. |
* 105h data from BJ mobile | * 105h data from BJ mobile | ||
* Now totally 1121h (470 + 346 + 105 + 200) telephone speech will be ready soon. | * Now totally 1121h (470 + 346 + 105 + 200) telephone speech will be ready soon. | ||
* 16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data | * 16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data | ||
− | + | ==Telephone model training== | |
− | == | + | |
===470 + 300h + BJ mobile 105h training=== | ===470 + 300h + BJ mobile 105h training=== | ||
(1) 105 BJ mobile re-training without NOISE: 33.97% WER | (1) 105 BJ mobile re-training without NOISE: 33.97% WER | ||
+ | |||
(2) 105 BJ mobile re-training with NOISE phone in training, but decoding without NOISE: 34.27% | (2) 105 BJ mobile re-training with NOISE phone in training, but decoding without NOISE: 34.27% | ||
+ | |||
(3) (2) + noise-decoding (with noise phone in lexicon/LM), still under investigation | (3) (2) + noise-decoding (with noise phone in lexicon/LM), still under investigation | ||
第25行: | 第26行: | ||
(1) Original 470 + 300 model: 30.24% WER | (1) Original 470 + 300 model: 30.24% WER | ||
− | |||
+ | (2) Incremental DT training with 105h BJ data, 27.01% WER | ||
==6000 hour 16k training== | ==6000 hour 16k training== | ||
第49行: | 第50行: | ||
|large LM, it 5, -5/-10 || 13.77 || 1.29 | |large LM, it 5, -5/-10 || 13.77 || 1.29 | ||
|- | |- | ||
− | |large LM, it 6, -5/-9 || 13.64 || | + | |large LM, it 6, -5/-9 || 13.64 || 1.12 |
|- | |- | ||
− | |large LM, it 6, -5/-10 || 13.25 || | + | |large LM, it 6, -5/-10 || 13.25 || 1.33 |
|- | |- | ||
− | |large LM, it 7, -5/-9 || 13.29 || | + | |large LM, it 7, -5/-9 || 13.29 || 1.12 |
|- | |- | ||
− | |large LM, it 7, -5/-10 || 12.87 || | + | |large LM, it 7, -5/-10 || 12.87 || 1.17 |
|- | |- | ||
|large LM, it 8, -5/-9 || 13.09 || - | |large LM, it 8, -5/-9 || 13.09 || - | ||
第65行: | 第66行: | ||
|large LM, it 9, -5/-10 || 12.55 || - | |large LM, it 9, -5/-10 || 12.55 || - | ||
|- | |- | ||
− | |large LM, it 10, -5/-9 || 12.83 || | + | |large LM, it 10, -5/-9 || 12.83 || 1.51 |
|- | |- | ||
− | |large LM, it 10, -5/-10 || 12.48 || | + | |large LM, it 10, -5/-10 || 12.48 || 1.65 |
|- | |- | ||
− | |large LM, it 11, -5/-9 || 12.87 || | + | |large LM, it 11, -5/-9 || 12.87 || 1.61 |
|- | |- | ||
− | |large LM, it 11, -5/-10 || 12.46 || | + | |large LM, it 11, -5/-10 || 12.46 || 1.28 |
|- | |- | ||
|} | |} | ||
第79行: | 第80行: | ||
* First version of DT model would be trained with online data (1700h) | * First version of DT model would be trained with online data (1700h) | ||
− | ==Training Analysis=== | + | ===Training Analysis=== |
* Shared tree GMM model training completed, WER% is similar to non-shared model . | * Shared tree GMM model training completed, WER% is similar to non-shared model . | ||
− | * Selected 100h online data, trained two systems: (1) di-syllable system | + | * Selected 100h online data, trained two systems: (1) di-syllable system (2) jt-phone system |
+ | <pre> | ||
di-syl jt-ph | di-syl jt-ph | ||
Xent 15.42% 14.78% | Xent 15.42% 14.78% | ||
MPE1 14.46% 14.23% | MPE1 14.46% 14.23% | ||
− | MPE2 14.22% | + | MPE2 14.22% 14.09% |
− | MPE3 14.26% | + | MPE3 14.26% 13.80% |
− | + | MPE4 14.24% 13.68% | |
+ | </pre> | ||
==Auto Transcription== | ==Auto Transcription== | ||
第94行: | 第97行: | ||
* PICC auto-trans incremental DT training completed | * PICC auto-trans incremental DT training completed | ||
+ | <pre> | ||
Threshold WER | Threshold WER | ||
− | |||
org: 45.03% | org: 45.03% | ||
0.9: 41.89% | 0.9: 41.89% | ||
0.8: 41.64% | 0.8: 41.64% | ||
+ | </pre> | ||
* Current training data with 0.8 involve 80k sentences, amounting to about 60h data. | * Current training data with 0.8 involve 80k sentences, amounting to about 60h data. | ||
* Sampling 60h labelled data to enrich the training | * Sampling 60h labelled data to enrich the training | ||
− | |||
=DNN Decoder= | =DNN Decoder= |
2014年2月27日 (四) 07:12的最后版本
目录
DNN training
Environment setting
- Two queues: 100.q dedicated to decoding, all.q dedicated to GMM training/MPE lattice generation
- disk203-disk210: distributed disks, for parallel jobs
- /nfs/disk1: train212 GPU task; /nfs/disk2: train215 GPU task
Corpora
- PICC data are under labeling (200h), ready in one week.
- 105h data from BJ mobile
- Now totally 1121h (470 + 346 + 105 + 200) telephone speech will be ready soon.
- 16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data
Telephone model training
470 + 300h + BJ mobile 105h training
(1) 105 BJ mobile re-training without NOISE: 33.97% WER
(2) 105 BJ mobile re-training with NOISE phone in training, but decoding without NOISE: 34.27%
(3) (2) + noise-decoding (with noise phone in lexicon/LM), still under investigation
BJ mobile incremental training
(1) Original 470 + 300 model: 30.24% WER
(2) Incremental DT training with 105h BJ data, 27.01% WER
6000 hour 16k training
Training progress
- Ran CE DNN to iteration 11 (8400 states, 80000 pdf)
- Testing results go down to 12.46% WER (Sinovoice results: 10.46).
Model | WER | RT |
---|---|---|
small LM, it 4, -5/-9 | 15.80 | 1.18 |
large LM, it 4, -5/-9 | 15.30 | 1.50 |
large LM, it 4, -6/-9 | 15.36 | 1.30 |
large LM, it 4, -7/-9 | 15.25 | 1.30 |
large LM, it 5, -5/-9 | 14.17 | 1.10 |
large LM, it 5, -5/-10 | 13.77 | 1.29 |
large LM, it 6, -5/-9 | 13.64 | 1.12 |
large LM, it 6, -5/-10 | 13.25 | 1.33 |
large LM, it 7, -5/-9 | 13.29 | 1.12 |
large LM, it 7, -5/-10 | 12.87 | 1.17 |
large LM, it 8, -5/-9 | 13.09 | - |
large LM, it 8, -5/-10 | 12.69 | - |
large LM, it 9, -5/-9 | 12.87 | - |
large LM, it 9, -5/-10 | 12.55 | - |
large LM, it 10, -5/-9 | 12.83 | 1.51 |
large LM, it 10, -5/-10 | 12.48 | 1.65 |
large LM, it 11, -5/-9 | 12.87 | 1.61 |
large LM, it 11, -5/-10 | 12.46 | 1.28 |
- Additional xEnt training with DNN alignment, should be completed in 2 days
- DT training is still on the queue, waiting for lattice generation
- First version of DT model would be trained with online data (1700h)
Training Analysis
- Shared tree GMM model training completed, WER% is similar to non-shared model .
- Selected 100h online data, trained two systems: (1) di-syllable system (2) jt-phone system
di-syl jt-ph Xent 15.42% 14.78% MPE1 14.46% 14.23% MPE2 14.22% 14.09% MPE3 14.26% 13.80% MPE4 14.24% 13.68%
Auto Transcription
- PICC development set decoding obtained 45% WER.
- PICC auto-trans incremental DT training completed
Threshold WER org: 45.03% 0.9: 41.89% 0.8: 41.64%
- Current training data with 0.8 involve 80k sentences, amounting to about 60h data.
- Sampling 60h labelled data to enrich the training
DNN Decoder
- Online decoder
- Integration almost completed
- Initial CMN implementation finished
- The first step is to tune the prior prob of the global CMN, and then consider re-training with DT.