DNN training

Environment setting

Two queues: 100.q dedicated to decoding, all.q dedicated to GMM training/MPE lattice generation
disk203-disk210: distributed disks, for parallel jobs
/nfs/disk1: train212 GPU task; /nfs/disk2: train215 GPU task

Corpora

PICC data are under labeling (200h), ready in one week.
105h data from BJ mobile
Now totally 1121h (470 + 346 + 105 + 200) telephone speech will be ready soon.
16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data

Telephone model training

470 + 300h + BJ mobile 105h training

(1) 105 BJ mobile re-training without NOISE: 33.97% WER

(2) 105 BJ mobile re-training with NOISE phone in training, but decoding without NOISE: 34.27%

(3) (2) + noise-decoding (with noise phone in lexicon/LM), still under investigation

BJ mobile incremental training

(1) Original 470 + 300 model: 30.24% WER

(2) Incremental DT training with 105h BJ data, 27.01% WER

6000 hour 16k training

Training progress

Ran CE DNN to iteration 11 (8400 states, 80000 pdf)
Testing results go down to 12.46% WER (Sinovoice results: 10.46).

Model	WER	RT
small LM, it 4, -5/-9	15.80	1.18
large LM, it 4, -5/-9	15.30	1.50
large LM, it 4, -6/-9	15.36	1.30
large LM, it 4, -7/-9	15.25	1.30
large LM, it 5, -5/-9	14.17	1.10
large LM, it 5, -5/-10	13.77	1.29
large LM, it 6, -5/-9	13.64	1.12
large LM, it 6, -5/-10	13.25	1.33
large LM, it 7, -5/-9	13.29	1.12
large LM, it 7, -5/-10	12.87	1.17
large LM, it 8, -5/-9	13.09	-
large LM, it 8, -5/-10	12.69	-
large LM, it 9, -5/-9	12.87	-
large LM, it 9, -5/-10	12.55	-
large LM, it 10, -5/-9	12.83	1.51
large LM, it 10, -5/-10	12.48	1.65
large LM, it 11, -5/-9	12.87	1.61
large LM, it 11, -5/-10	12.46	1.28

Additional xEnt training with DNN alignment, should be completed in 2 days
DT training is still on the queue, waiting for lattice generation
First version of DT model would be trained with online data (1700h)

Training Analysis

Shared tree GMM model training completed, WER% is similar to non-shared model .
Selected 100h online data, trained two systems: (1) di-syllable system (2) jt-phone system

        di-syl      jt-ph
Xent    15.42%      14.78%       
MPE1    14.46%      14.23%
MPE2    14.22%      14.09%
MPE3    14.26%      13.80%
MPE4    14.24%      13.68%

Auto Transcription

PICC development set decoding obtained 45% WER.
PICC auto-trans incremental DT training completed

Threshold  WER
org:     45.03%
0.9:     41.89%
0.8:     41.64%

Current training data with 0.8 involve 80k sentences, amounting to about 60h data.
Sampling 60h labelled data to enrich the training

DNN Decoder

Online decoder

Integration almost completed
Initial CMN implementation finished
The first step is to tune the prior prob of the global CMN, and then consider re-training with DT.

Sinovoice-2014-02-25

目录