“2013-09-06”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
FBank features
第40行: 第40行:
 
=== FBank features ===
 
=== FBank features ===
  
Test on 100 hour data, clean & noise speech.  
+
Test on 100 hour data, structure 100_1200_1200_1200_1200_3580. Test on clean & 15db noiy speech.  
 
+
{|class='wikitable'
+
!
+
  
 +
{|class="wikitable"
 +
!set !! MFCC !! GFCC !! FB    || MFCC + 15db  || GFCC + 15db || FB + 15db
 +
|-
 +
|map    ||23.75 ||22.95 || 20.88 || 65.24  || 62.99 || 62.20
 +
|-
 +
|2044    ||21.47 ||20.93 || 19.69  || 48.93 || 46.34 ||45.75
 +
|-
 +
|notetp3 ||13.17 ||15.43  || 12.79  || 55.91 ||52.46 ||54.56
 +
|-
 +
|record1900 ||8.10|| 7.32  || 7.38  || 25.43  ||26.62 || 23.97
 +
|-
 +
|general ||34.41||31.57 || 31.88  || 70.95  || 66.04 || 65.93
 +
|-
 +
|online1 ||33.02||31.83  || 31.54  || 50.40  || 46.61 || 48.06
 +
|-
 +
|online2 ||25.99||25.20  || 24.89  || 48.45  || 44.49 ||45.83
 +
|-
 +
|speedup ||23.52||22.97  ||21.54  || 64.78 || 60.38 ||61.52
 +
|-
 
|}
 
|}
1,MFCC 100_1200_1200_1200_1200_3580
 
      map: %WER 23.75 [ 3474 / 14628, 134 ins, 373 del, 2967 sub ]
 
      2044: %WER 21.47 [ 4991 / 23241, 304 ins, 664 del, 4023 sub ]
 
      notetp3: %WER 13.17 [ 244 / 1853, 10 ins, 26 del, 208 sub ]
 
      record1900: %WER 8.10 [ 963 / 11888, 217 ins, 299 del, 447 sub ]
 
      general: %WER 34.41 [ 12943 / 37619, 779 ins, 785 del, 11379 sub ]
 
      online1: %WER 33.02 [ 9388 / 28433, 522 ins, 1465 del, 7401 sub ]
 
      online2: %WER 25.99 [ 15363 / 59101, 873 ins, 2408 del, 12082 sub ]
 
      speedup: %WER 23.52 [ 1236 / 5255, 72 ins, 213 del, 951 sub ]
 
 
2,GFCC 100_1200_1200_1200_1200_3625
 
      map: %WER 22.95 [ 3357 / 14628, 109 ins, 471 del, 2777 sub ]
 
      2044: %WER 20.93 [ 4865 / 23241, 387 ins, 748 del, 3730 sub ]
 
      notetp3: %WER 15.43 [ 286 / 1853, 41 ins, 26 del, 219 sub ]
 
      record1900: %WER 7.32 [ 870 / 11888, 107 ins, 266 del, 497 sub ]
 
      general: %WER 31.57 [ 11878 / 37619, 587 ins, 861 del, 10430 sub ]
 
      online1: %WER 31.83 [ 9049 / 28433, 519 ins, 1506 del, 7024 sub ]
 
      online2: %WER 25.20 [ 14894 / 59101, 839 ins, 2434 del, 11621 sub ]
 
      speedup: %WER 22.97 [ 1207 / 5255, 73 ins, 221 del, 913 sub ]
 
 
3,FB 100_1200_1200_1200_1200_3625
 
map: %WER 20.88 [ 3055 / 14628, 88 ins, 385 del, 2582 sub ]
 
2044: %WER 19.69 [ 4576 / 23241, 296 ins, 643 del, 3637 sub ]
 
notetp3: %WER 12.79 [ 237 / 1853, 12 ins, 25 del, 200 sub ]
 
record1900: %WER 7.38 [ 877 / 11888, 221 ins, 277 del, 379 sub ]
 
general: %WER 31.88 [ 11993 / 37619, 752 ins, 740 del, 10501 sub ]
 
online1: %WER 31.54 [ 8969 / 28433, 491 ins, 1455 del, 7023 sub ]
 
online2: %WER 24.89 [ 14711 / 59101, 733 ins, 2394 del, 11584 sub ]
 
speedup: %WER 21.54 [ 1132 / 5255, 55 ins, 210 del, 867 sub ]
 
  
 
* FB feature is much better than both MFCC and GFCC. Probably due to the less information lost without DCT.
 
* FB feature is much better than both MFCC and GFCC. Probably due to the less information lost without DCT.
 
* We need to investigate how many FBs are the most appropriate.
 
* We need to investigate how many FBs are the most appropriate.
* Inspired by this assumption, we need to test how the LDA leads to the similar information lost. We need to investigate which is the suitable dimension number for the LDA. We need to investigate non-linear discriminative approach which is simple but leads to less information lost.
+
* Inspired by the assumption of information lost with DCT, we need to test how another transform,  LDAleads to the similar information lost. We need to investigate which is the suitable dimension number for the LDA. We need to investigate non-linear discriminative approach which is simple but leads to less information lost.
* Need to investigate the the advantage of FB is due to the maintanance of information. Another assumption is that the FB is more suitable for CMN. DCT accumulates a number of noisy channels and thus more uncertain, which is in turn can not be normalized by CMN. We need to test with/without CMN, how about FB and MFCC perform.
+
* Another assumption for the better performance with FB is that the FB is more suitable for CMN. DCT accumulates a number of noisy channels and thus exhibits more uncertain. This in turn can hardly be normalized by CMN. We need to test the performance of FB and MFCC when no CMN is introduced.
 
+
* We can also test a simple 'the same dimension DCT'. If the performance is still worse than FB, we confirm that the problem is due to noisy channel accumulation.
15db noisy data:
+
 
+
1) FB: 100_1200_1200_1200_1200_3580
+
 
+
map: %WER 62.20 [ 9098 / 14628, 33 ins, 2917 del, 6148 sub ]
+
2044: %WER 45.75 [ 10632 / 23241, 183 ins, 2740 del, 7709 sub ]
+
notetp3: %WER 54.56 [ 1011 / 1853, 11 ins, 471 del, 529 sub ]
+
record1900: %WER 23.97 [ 2849 / 11888, 28 ins, 1276 del, 1545 sub ]
+
general: %WER 65.93 [ 24804 / 37619, 125 ins, 5136 del, 19543 sub ]
+
online1: %WER 48.06 [ 13666 / 28433, 411 ins, 3824 del, 9431 sub ]
+
online2: %WER 45.83 [ 27086 / 59101, 678 ins, 7441 del, 18967 sub ]
+
speedup: %WER 61.52 [ 3233 / 5255, 5 ins, 1038 del, 2190 sub ]
+
 
+
2) MFCC 100_1200_1200_1200_1200_3580
+
    map: %WER 65.24 [ 9544 / 14628, 48 ins, 2841 del, 6655 sub ]
+
    2044: %WER 48.93 [ 11372 / 23241, 176 ins, 2803 del, 8393 sub ]
+
    notetp3: %WER 55.91 [ 1036 / 1853, 9 ins, 476 del, 551 sub ]
+
    record1900: %WER 25.43 [ 3023 / 11888, 27 ins, 1387 del, 1609 sub ]
+
    general: %WER 70.05 [ 26352 / 37619, 141 ins, 5336 del, 20875 sub ]
+
    online1: %WER 50.40 [ 14329 / 28433, 431 ins, 3827 del, 10071 sub ]
+
    online2: %WER 48.45 [ 28632 / 59101, 664 ins, 7930 del, 20038 sub ]
+
    speedup: %WER 64.78 [ 3404 / 5255, 13 ins, 1084 del, 2307 sub ]
+
 
+
 
* Need to investigate Gammatone filter banks. The same idea as FB, that we want to keep the information as much as possible. And it is possible to combine FB and GFB to pursue a better performance.
 
* Need to investigate Gammatone filter banks. The same idea as FB, that we want to keep the information as much as possible. And it is possible to combine FB and GFB to pursue a better performance.
  
第112行: 第76行:
  
 
* Lattice-based confidence show better performance with DNN with before.  
 
* Lattice-based confidence show better performance with DNN with before.  
* Accumulated DNN confidence is done.
+
* Accumulated DNN confidence is done. The confidence values are much more reasonable.  
* Prepare MLP-based confidence integration.
+
* Prepare MLP/DNN-based confidence integration.
  
  
==Noise robustness ==
+
==Noisy training ==
  
1. Training with 15 db noisy date, test with noise at various levels. Test at online1.
+
Reading the table in the last section, we observe very disapointting performance reduction with noise. And we did not see too much advantage for FB and GFCC. We examine how if we introduce the noise in training. In this experiment, 15db noise are introduced in all the training data (100 hours), and the test utterances are in various noise level. Just give the performance on the test set online1. More performance is here:
  
<pre>
+
http://cslt.riit.tsinghua.edu.cn/cgi-bin/cvss/cvss_request.pl?account=wangd&step=view_request&cvssid=118
            MFCC      GFCC
+
  
clean      45.63        38.12
+
{|calss='wikitable'
20db      32.41        30.54   
+
!    SNR  !!        MFCC    !!  GFCC
15db(match)       35.05         32.80
+
|-
10db      41.06        38.53
+
|clean      || 45.63        || 38.12
 +
|-
 +
|20db      || 32.41        || 30.54   
 +
|-
 +
|15db(matched training)     || 35.05       ||  32.80
 +
|-
 +
|10db      || 41.06        ||38.53
 +
|-
 +
|}
  
</pre>
+
* It is interesting to see that two factors are important in the noisy training: (1) speech should be clean (2) speech should match the training condition. The best performance is from 20db, which is not very clean and not very mismatch. This is interesting.
 
+
* We are looking forward to the noisy training which introduces some noises randomly in training.
* We are looking forward to the noisy training which introduces some noisy data and train the DNN with the artificial noised speech.
+
  
 
==Stream decoding==
 
==Stream decoding==
第136行: 第106行:
 
* The interface for server-side is done. For embedded-side is on development.  
 
* The interface for server-side is done. For embedded-side is on development.  
 
* Fixed a bug which prompts intermediate results when silence encountered.  
 
* Fixed a bug which prompts intermediate results when silence encountered.  
* Fixed a CMN bug for the last segment.
+
* Fixed a CMN bug for the last segment.
 
+
To do:
+
* global CMN initialization.
+

2013年9月6日 (五) 06:22的版本

Data sharing

  • LM count files still undelivered!

DNN progress

Discriminative DNN

  • 1000 hour training not yet finished.

Sparse DNN

  • Cutting 50% of the weights, and then start to run into sticky with learning rate 0.0025. Completed after 6 iterations.
set no-sparse sparse (1/2)
map 23.75 23.90
2044 21.47 21.45
notetp3 13.17 13.65
record1900 8.10 8.18
general 34.41 34.34
online1 33.02 32.92
online2 25.99 26.06
speedup 23.52 23.58
  • The comparison shows very similar performance.
  • Cut more weights based on up-to-now sparse model. Lead to iterative sparsity.
  • Test on noisy data with the sparse.

FBank features

Test on 100 hour data, structure 100_1200_1200_1200_1200_3580. Test on clean & 15db noiy speech.

set MFCC GFCC FB MFCC + 15db GFCC + 15db FB + 15db
map 23.75 22.95 20.88 65.24 62.99 62.20
2044 21.47 20.93 19.69 48.93 46.34 45.75
notetp3 13.17 15.43 12.79 55.91 52.46 54.56
record1900 8.10 7.32 7.38 25.43 26.62 23.97
general 34.41 31.57 31.88 70.95 66.04 65.93
online1 33.02 31.83 31.54 50.40 46.61 48.06
online2 25.99 25.20 24.89 48.45 44.49 45.83
speedup 23.52 22.97 21.54 64.78 60.38 61.52
  • FB feature is much better than both MFCC and GFCC. Probably due to the less information lost without DCT.
  • We need to investigate how many FBs are the most appropriate.
  • Inspired by the assumption of information lost with DCT, we need to test how another transform, LDA, leads to the similar information lost. We need to investigate which is the suitable dimension number for the LDA. We need to investigate non-linear discriminative approach which is simple but leads to less information lost.
  • Another assumption for the better performance with FB is that the FB is more suitable for CMN. DCT accumulates a number of noisy channels and thus exhibits more uncertain. This in turn can hardly be normalized by CMN. We need to test the performance of FB and MFCC when no CMN is introduced.
  • We can also test a simple 'the same dimension DCT'. If the performance is still worse than FB, we confirm that the problem is due to noisy channel accumulation.
  • Need to investigate Gammatone filter banks. The same idea as FB, that we want to keep the information as much as possible. And it is possible to combine FB and GFB to pursue a better performance.

Tencent exps

DNN Confidence estimation

  • Lattice-based confidence show better performance with DNN with before.
  • Accumulated DNN confidence is done. The confidence values are much more reasonable.
  • Prepare MLP/DNN-based confidence integration.


Noisy training

Reading the table in the last section, we observe very disapointting performance reduction with noise. And we did not see too much advantage for FB and GFCC. We examine how if we introduce the noise in training. In this experiment, 15db noise are introduced in all the training data (100 hours), and the test utterances are in various noise level. Just give the performance on the test set online1. More performance is here:

http://cslt.riit.tsinghua.edu.cn/cgi-bin/cvss/cvss_request.pl?account=wangd&step=view_request&cvssid=118

SNR MFCC GFCC
clean 45.63 38.12
20db 32.41 30.54
15db(matched training) 35.05 32.80
10db 41.06 38.53
  • It is interesting to see that two factors are important in the noisy training: (1) speech should be clean (2) speech should match the training condition. The best performance is from 20db, which is not very clean and not very mismatch. This is interesting.
  • We are looking forward to the noisy training which introduces some noises randomly in training.

Stream decoding

  • The interface for server-side is done. For embedded-side is on development.
  • Fixed a bug which prompts intermediate results when silence encountered.
  • Fixed a CMN bug for the last segment.