2013-09-06

Data sharing

LM count files still undelivered!

DNN progress

Discriminative DNN

1000 hour training not yet finished.

Sparse DNN

Cutting 50% of the weights, and then start to run into sticky with learning rate 0.0025. Completed after 6 iterations.

set	no-sparse	sparse (1/2)
map	23.75	23.90
2044	21.47	21.45
notetp3	13.17	13.65
record1900	8.10	8.18
general	34.41	34.34
online1	33.02	32.92
online2	25.99	26.06
speedup	23.52	23.58

The comparison shows very similar performance.
Cut more weights based on up-to-now sparse model. Lead to iterative sparsity.
Test on noisy data with the sparse.

FBank features

Test on 100 hour data, clean & noise speech.

1,MFCC 100_1200_1200_1200_1200_3580

      map: %WER 23.75 [ 3474 / 14628, 134 ins, 373 del, 2967 sub ]
      2044: %WER 21.47 [ 4991 / 23241, 304 ins, 664 del, 4023 sub ]
      notetp3: %WER 13.17 [ 244 / 1853, 10 ins, 26 del, 208 sub ]
      record1900: %WER 8.10 [ 963 / 11888, 217 ins, 299 del, 447 sub ]
      general: %WER 34.41 [ 12943 / 37619, 779 ins, 785 del, 11379 sub ]
      online1: %WER 33.02 [ 9388 / 28433, 522 ins, 1465 del, 7401 sub ]
      online2: %WER 25.99 [ 15363 / 59101, 873 ins, 2408 del, 12082 sub ]
      speedup: %WER 23.52 [ 1236 / 5255, 72 ins, 213 del, 951 sub ]

2,GFCC 100_1200_1200_1200_1200_3625

      map: %WER 22.95 [ 3357 / 14628, 109 ins, 471 del, 2777 sub ]
      2044: %WER 20.93 [ 4865 / 23241, 387 ins, 748 del, 3730 sub ]
      notetp3: %WER 15.43 [ 286 / 1853, 41 ins, 26 del, 219 sub ]
      record1900: %WER 7.32 [ 870 / 11888, 107 ins, 266 del, 497 sub ]
      general: %WER 31.57 [ 11878 / 37619, 587 ins, 861 del, 10430 sub ]
      online1: %WER 31.83 [ 9049 / 28433, 519 ins, 1506 del, 7024 sub ]
      online2: %WER 25.20 [ 14894 / 59101, 839 ins, 2434 del, 11621 sub ]
      speedup: %WER 22.97 [ 1207 / 5255, 73 ins, 221 del, 913 sub ]

3,FB 100_1200_1200_1200_1200_3625 map: %WER 20.88 [ 3055 / 14628, 88 ins, 385 del, 2582 sub ] 2044: %WER 19.69 [ 4576 / 23241, 296 ins, 643 del, 3637 sub ] notetp3: %WER 12.79 [ 237 / 1853, 12 ins, 25 del, 200 sub ] record1900: %WER 7.38 [ 877 / 11888, 221 ins, 277 del, 379 sub ] general: %WER 31.88 [ 11993 / 37619, 752 ins, 740 del, 10501 sub ] online1: %WER 31.54 [ 8969 / 28433, 491 ins, 1455 del, 7023 sub ] online2: %WER 24.89 [ 14711 / 59101, 733 ins, 2394 del, 11584 sub ] speedup: %WER 21.54 [ 1132 / 5255, 55 ins, 210 del, 867 sub ]

FB feature is much better than both MFCC and GFCC. Probably due to the less information lost without DCT.
We need to investigate how many FBs are the most appropriate.
Inspired by this assumption, we need to test how the LDA leads to the similar information lost. We need to investigate which is the suitable dimension number for the LDA. We need to investigate non-linear discriminative approach which is simple but leads to less information lost.
Need to investigate the the advantage of FB is due to the maintanance of information. Another assumption is that the FB is more suitable for CMN. DCT accumulates a number of noisy channels and thus more uncertain, which is in turn can not be normalized by CMN. We need to test with/without CMN, how about FB and MFCC perform.

15db noisy data:

1) FB: 100_1200_1200_1200_1200_3580

map: %WER 62.20 [ 9098 / 14628, 33 ins, 2917 del, 6148 sub ] 2044: %WER 45.75 [ 10632 / 23241, 183 ins, 2740 del, 7709 sub ] notetp3: %WER 54.56 [ 1011 / 1853, 11 ins, 471 del, 529 sub ] record1900: %WER 23.97 [ 2849 / 11888, 28 ins, 1276 del, 1545 sub ] general: %WER 65.93 [ 24804 / 37619, 125 ins, 5136 del, 19543 sub ] online1: %WER 48.06 [ 13666 / 28433, 411 ins, 3824 del, 9431 sub ] online2: %WER 45.83 [ 27086 / 59101, 678 ins, 7441 del, 18967 sub ] speedup: %WER 61.52 [ 3233 / 5255, 5 ins, 1038 del, 2190 sub ]

2) MFCC 100_1200_1200_1200_1200_3580

   map: %WER 65.24 [ 9544 / 14628, 48 ins, 2841 del, 6655 sub ]
   2044: %WER 48.93 [ 11372 / 23241, 176 ins, 2803 del, 8393 sub ]
   notetp3: %WER 55.91 [ 1036 / 1853, 9 ins, 476 del, 551 sub ]
   record1900: %WER 25.43 [ 3023 / 11888, 27 ins, 1387 del, 1609 sub ]
   general: %WER 70.05 [ 26352 / 37619, 141 ins, 5336 del, 20875 sub ]
   online1: %WER 50.40 [ 14329 / 28433, 431 ins, 3827 del, 10071 sub ]
   online2: %WER 48.45 [ 28632 / 59101, 664 ins, 7930 del, 20038 sub ]
   speedup: %WER 64.78 [ 3404 / 5255, 13 ins, 1084 del, 2307 sub ]

Need to investigate Gammatone filter banks. The same idea as FB, that we want to keep the information as much as possible. And it is possible to combine FB and GFB to pursue a better performance.

Tencent exps

DNN Confidence estimation

Lattice-based confidence show better performance with DNN with before.
Accumulated DNN confidence is done.
Prepare MLP-based confidence integration.

Noise robustness

1. Training with 15 db noisy date, test with noise at various levels. Test at online1.

             MFCC       GFCC

clean      45.63         38.12
20db       32.41         30.54  
15db(match)       35.05         32.80
10db       41.06         38.53

We are looking forward to the noisy training which introduces some noisy data and train the DNN with the artificial noised speech.

Stream decoding

The interface for server-side is done. For embedded-side is on development.
Fixed a bug which prompts intermediate results when silence encountered.
Fixed a CMN bug for the last segment.

To do:

global CMN initialization.