“2013-09-27”版本间的差异

2013年9月29日 (日) 15:32的最后版本

Data sharing

LM count files still undelivered!

DNN progress

Sparse DNN

Optimal Brain Damage based sparsity is on going. Prepare the algorithm.
An interesting investigation is drop-out 50% weights after each iteration, and then re-training without sticky. The performance is a bit better than the original best. This might be attributed to some noisy turbulence that provides some change out of local minimum.

Report on here

FBank features

1000 hour testing is done. The performance is significantly better than the MFCC. And the iteration 14 is better than the final iteration. This may be attributed to some over-fitting.

click here

Tencent exps

N/A

Noisy training

Sample noise segments randomly for each utterance. Using Dirichlet to sample noise distribution on various types, and use Gaussian to sample SNR.

The first initial test involves white noise and car noise are 1/3 respectively. The performance report is here:

click here

The conclusions is that:

by sampling noises, most of the noise patterns can be learned efficiently and thus improve performance on noisy test data.
by sampling noises with high variance, performance on clean speech is largely remained.

Continuous LM

1. SogouQ n-gram building: 500M text data, 110k words. Two tests:

(1) using Tencent online1 and online2 transcription: online1: 1651 online2: 1512
(2) using 70k sogouQ test set : ppl 33

 This means the SogouQ text is significantly different from the online1 and online2 Tencent set, due to the different domain.

2. NN LM

  Using 11k words as the input, 192 units in the hidden layer. 500M text data from QA data. Test with online2 transcription.

 (1)  Predict the most frequent 1-1024 words with the NN LM, and others predicted by 4-gram. n-gram baseline: 402.37; NN+ngram: 122.54
 (2)  Predict the most frequent 1-2048 words with the NN LM, and others predicted by 4-gram. n-gram baseline: 402.37; NN+ngram: 127.59
 (3)  Predict the most frequent 1024-2048 words with the  NN LM, and others predicted by 4-gram. n-gram baseline: 402.37; NN+ngram: 118.92

Conclusions: NN LM is extremely good than n-gram, due to its smooth capacity.It seems it helps more for the not-very-frequent words, which verifies its capability in smoothing.

@@ 第8行： / 第8行： @@
 * Optimal Brain Damage based sparsity is on going. Prepare the algorithm.
-* An interesting investigation is drop-out 50% weights after each iteration, and then re-training without sticky.
+* An interesting investigation is drop-out 50% weights after each iteration, and then re-training without sticky. The performance is a bit better than the original best. This might be attributed to some noisy turbulence that provides some change out of local minimum.
-Report on [http://192.168.0.50:3000/series/?q=&action=view&series[]=91&series[]=91.0&series[]=91.1&series[]=91.2&series[]=91.3&series[]=91.4&series[]=91.5&series[]=91.6&series[]=91.7&series[]=91.8&series[]=91.9 graph]
+Report on [http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/文件:Chart1.png here]
 === FBank features ===
-* CMN shows similar impact to MFCC & FBank. Since MFCC involves summary of various random channels, the mean and covariance of the dimensions are less random. This leads to two possible impacts: first, the dimensions are relatively stable therefore CMVN does not contribute much; on other hand, estimation of mean and variance is more accurate so CMVN leads to more reliable results. This means CMVN leads to unpredictable performance improvement for MFCC & Fbank, depending on the data set.
+hour testing is done. The performance is significantly better than the MFCC. And the iteration 14 is better than the final iteration. This may be attributed to some over-fitting.
-[http://192.168.0.50:3000/series/?q=&action=view&series=53%2C51%2C45%2C44%2C33%2C32%2C31%2C29&chart_type=bar Performance chart]
+[http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/%E6%96%87%E4%BB%B6:Chart2.png  click here]
-* Choose various Fbank dimension, keep LDA output dimension as 100. FB30 seems the best.
+=== Tencent exps ===
-[http://192.168.0.50:3000/series/?q=&action=view&series=36%2C34%2C29&chart_type=bar Performance chart]
+N/A
-* Choose FBank 40, test various LDA output dimension. The results show LDA is still helpful, and dimension 200 is sufficient.
-[http://192.168.0.50:3000/series/?q=&action=view&series=56%2C54%2C43%2C36&chart_type=bar Performance chart]
+==Noisy training ==
-* We need to investigate non-linear discriminative approach which is simple but leads to less information lost.
+Sample noise segments randomly for each utterance. Using Dirichlet to sample noise distribution on various types, and use Gaussian to sample SNR.
-* We can also test a simple 'the same dimension DCT'. If the performance is still worse than FB, we confirm that the problem is due to noisy channel accumulation.
-* Need to investigate Gammatone filter banks. The same idea as FB, that we want to keep the information as much as possible. And it is possible to combine FB and GFB to pursue a better performance.
-=== Tencent exps ===
+The first initial test involves white noise and  car noise are 1/3 respectively. The performance report is here:
-N/A
-==DNN Confidence estimation==
+[http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/%E6%96%87%E4%BB%B6:Chart3.png click here]
-* Lattice-based confidence show better performance with DNN with before.
+The conclusions is that:
-* Accumulated DNN confidence is done. The confidence values are much more reasonable.
-* Prepare MLP/DNN-based confidence integration.
-==Noisy training ==
+# by sampling noises, most of the noise patterns can be learned efficiently and thus improve performance on noisy test data.
+# by sampling noises with high variance, performance on clean speech is largely remained.
+==Continuous LM ==
+. SogouQ n-gram building: 500M text data, 110k words. Two tests:
+ (1) using Tencent online1 and online2 transcription: online1: 1651 online2: 1512
+ (2) using 70k sogouQ test set : ppl 33
-* We trained model with a random noise approach, which samples half of the training data and add 15db white noise. We hope this rand-noise learning will improve the performance of data in noise while keeping the discriminative power of the model in clean speech.
+  This means the SogouQ text is significantly different from the online1 and online2 Tencent set, due to the different domain.
-[http://192.168.0.50:3000/series/?q=&action=view&series=76%2C76.0%2C76.1%2C76.2%2C76.3%2C74%2C73%2C72%2C71%2C45&chart_type=bar performance chat]
+. NN LM
-* The results are largely consistent with our expectation, that the performance on noisy data were greatly improved, while the performance on clean speech is not hurted much.
+   Using 11k words as the input, 192 units in the hidden layer. 500M text data from QA data. Test with online2 transcription.
-* We are looking forward to the noisy training which introduces some noises randomly online in training.
+  (1)  Predict the most frequent 1-1024 words with the NN LM, and others predicted by 4-gram. n-gram baseline: 402.37; NN+ngram: 122.54
+  (2)  Predict the most frequent 1-2048 words with the NN LM, and others predicted by 4-gram. n-gram baseline: 402.37; NN+ngram: 127.59
+  (3)  Predict the most frequent 1024-2048 words with the  NN LM, and others predicted by 4-gram. n-gram baseline: 402.37; NN+ngram: 118.92
-* Car noise training. It shows limited impact of car noise.
-[http://192.168.0.50:3000/series/?q=&action=view&series=78%2C78.0%2C78.1%2C78.2%2C78.3%2C45&chart_type=bar Performance chart]
+Conclusions:  NN  LM is extremely good than n-gram, due to its smooth capacity.It seems it helps more for the not-very-frequent words, which verifies its capability in smoothing.