“2013-09-06”版本间的差异
来自cslt Wiki
(以内容“== Data sharing == * LM count files still undelivered! == DNN progress == === Discriminative DNN === * 1000 hour training not yet finished. === Sparse DNN === * C...”创建新页面) |
(→FBank features) |
||
(相同用户的10个中间修订版本未显示) | |||
第4行: | 第4行: | ||
== DNN progress == | == DNN progress == | ||
− | |||
− | |||
− | |||
− | |||
=== Sparse DNN === | === Sparse DNN === | ||
− | * Cutting 50% of the weights, and then start to run into sticky with learning rate 0.0025. Completed after 6 iterations. | + | * Cutting 50% of the weights, and then start to run into sticky with learning rate 0.0025. Completed after 6 iterations. |
− | map | + | {|class="wikitable" |
− | 2044 | + | !set !! no-sparse !! sparse (1/2) |
− | notetp3 | + | |- |
− | record1900 | + | |map ||23.75 ||23.90 |
− | general | + | |- |
− | online1 | + | |2044 ||21.47 ||21.45 |
− | online2 | + | |- |
− | speedup | + | |notetp3 ||13.17 ||13.65 |
+ | |- | ||
+ | |record1900 ||8.10|| 8.18 | ||
+ | |- | ||
+ | |general ||34.41||34.34 | ||
+ | |- | ||
+ | |online1 ||33.02||32.92 | ||
+ | |- | ||
+ | |online2 ||25.99||26.06 | ||
+ | |- | ||
+ | |speedup ||23.52||23.58 | ||
+ | |- | ||
+ | |} | ||
− | + | * The comparison shows very similar performance. | |
− | + | * Cut more weights based on up-to-now sparse model. Lead to iterative sparsity. | |
− | + | * Test on noisy data with the sparse. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | The comparison shows very similar performance. | + | |
− | + | ||
− | * Cut more weights based on up-to-now sparse model. | + | |
− | * Test on | + | |
=== FBank features === | === FBank features === | ||
− | + | Test on 100 hour data, structure 100_1200_1200_1200_1200_3580. Test on clean & 15db noiy speech. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | {|class="wikitable" | |
− | + | !set !! MFCC !! GFCC !! FB || MFCC + 15db || GFCC + 15db || FB + 15db | |
− | + | |- | |
− | + | |map ||23.75 ||22.95 || 20.88 || 65.24 || 62.99 || 62.20 | |
− | + | |- | |
− | + | |2044 ||21.47 ||20.93 || 19.69 || 48.93 || 46.34 ||45.75 | |
− | + | |- | |
− | + | |notetp3 ||13.17 ||15.43 || 12.79 || 55.91 ||52.46 ||54.56 | |
− | + | |- | |
− | + | |record1900 ||8.10|| 7.32 || 7.38 || 25.43 ||26.62 || 23.97 | |
− | + | |- | |
− | + | |general ||34.41||31.57 || 31.88 || 70.95 || 66.04 || 65.93 | |
− | + | |- | |
− | notetp3 | + | |online1 ||33.02||31.83 || 31.54 || 50.40 || 46.61 || 48.06 |
− | record1900 | + | |- |
− | general | + | |online2 ||25.99||25.20 || 24.89 || 48.45 || 44.49 ||45.83 |
− | online1 | + | |- |
− | online2 | + | |speedup ||23.52||22.97 ||21.54 || 64.78 || 60.38 ||61.52 |
− | speedup | + | |- |
+ | |} | ||
* FB feature is much better than both MFCC and GFCC. Probably due to the less information lost without DCT. | * FB feature is much better than both MFCC and GFCC. Probably due to the less information lost without DCT. | ||
+ | * In noisy environment, GFCC obtains comparable or better performance compared to FB. | ||
* We need to investigate how many FBs are the most appropriate. | * We need to investigate how many FBs are the most appropriate. | ||
− | * Inspired by | + | * Inspired by the assumption of information lost with DCT, we need to test how another transform, LDA, leads to the similar information lost. We need to investigate which is the suitable dimension number for the LDA. We need to investigate non-linear discriminative approach which is simple but leads to less information lost. |
− | * | + | * Another assumption for the better performance with FB is that the FB is more suitable for CMN. DCT accumulates a number of noisy channels and thus exhibits more uncertain. This in turn can hardly be normalized by CMN. We need to test the performance of FB and MFCC when no CMN is introduced. |
− | + | * We can also test a simple 'the same dimension DCT'. If the performance is still worse than FB, we confirm that the problem is due to noisy channel accumulation. | |
− | + | * Need to investigate Gammatone filter banks. The same idea as FB, that we want to keep the information as much as possible. And it is possible to combine FB and GFB to pursue a better performance. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | * Need to investigate Gammatone filter banks. The same idea as FB, that we want to keep the information as much as possible. And it is possible to combine FB and GFB to pursue a better performance. | + | |
− | + | ||
=== Tencent exps === | === Tencent exps === | ||
− | + | N/A | |
==DNN Confidence estimation== | ==DNN Confidence estimation== | ||
* Lattice-based confidence show better performance with DNN with before. | * Lattice-based confidence show better performance with DNN with before. | ||
− | * Accumulated DNN confidence is done. | + | * Accumulated DNN confidence is done. The confidence values are much more reasonable. |
− | * Prepare MLP-based confidence integration. | + | * Prepare MLP/DNN-based confidence integration. |
− | == | + | ==Noisy training == |
− | + | Reading the table in the last section, we observe very disapointting performance reduction with noise. And we did not see too much advantage for FB and GFCC. We examine how if we introduce the noise in training. In this experiment, 15db noise are introduced in all the training data (100 hours), and the test utterances are in various noise level. Just give the performance on the test set online1. More performance is here: | |
− | + | http://cslt.riit.tsinghua.edu.cn/cgi-bin/cvss/cvss_request.pl?account=wangd&step=view_request&cvssid=118 | |
− | + | ||
− | clean 45.63 38.12 | + | {|class='wikitable' |
− | 20db 32.41 30.54 | + | ! SNR !! MFCC !! GFCC |
− | 15db( | + | |- |
− | 10db 41.06 38.53 | + | |clean || 45.63 || 38.12 |
+ | |- | ||
+ | |20db || 32.41 || 30.54 | ||
+ | |- | ||
+ | |15db(matched training) || 35.05 || 32.80 | ||
+ | |- | ||
+ | |10db || 41.06 ||38.53 | ||
+ | |- | ||
+ | |} | ||
− | + | * It is interesting to see that two factors are important in the noisy training: (1) speech should be clean (2) speech should match the training condition. The best performance is from 20db, which is not very clean and not very mismatch. This is interesting. | |
− | + | * We are looking forward to the noisy training which introduces some noises randomly in training. | |
− | * We are looking forward to the noisy training | + | |
==Stream decoding== | ==Stream decoding== | ||
* The interface for server-side is done. For embedded-side is on development. | * The interface for server-side is done. For embedded-side is on development. | ||
− | * Fixed a bug which prompts intermediate results when | + | * Fixed a bug which prompts intermediate results when short pause encountered. |
− | * Fixed a CMN bug for the last segment | + | * Fixed a CMN bug for the last segment. |
− | + | ||
− | + | ||
− | + |
2013年9月6日 (五) 06:36的最后版本
目录
Data sharing
- LM count files still undelivered!
DNN progress
Sparse DNN
- Cutting 50% of the weights, and then start to run into sticky with learning rate 0.0025. Completed after 6 iterations.
set | no-sparse | sparse (1/2) |
---|---|---|
map | 23.75 | 23.90 |
2044 | 21.47 | 21.45 |
notetp3 | 13.17 | 13.65 |
record1900 | 8.10 | 8.18 |
general | 34.41 | 34.34 |
online1 | 33.02 | 32.92 |
online2 | 25.99 | 26.06 |
speedup | 23.52 | 23.58 |
- The comparison shows very similar performance.
- Cut more weights based on up-to-now sparse model. Lead to iterative sparsity.
- Test on noisy data with the sparse.
FBank features
Test on 100 hour data, structure 100_1200_1200_1200_1200_3580. Test on clean & 15db noiy speech.
set | MFCC | GFCC | FB | MFCC + 15db | GFCC + 15db | FB + 15db |
---|---|---|---|---|---|---|
map | 23.75 | 22.95 | 20.88 | 65.24 | 62.99 | 62.20 |
2044 | 21.47 | 20.93 | 19.69 | 48.93 | 46.34 | 45.75 |
notetp3 | 13.17 | 15.43 | 12.79 | 55.91 | 52.46 | 54.56 |
record1900 | 8.10 | 7.32 | 7.38 | 25.43 | 26.62 | 23.97 |
general | 34.41 | 31.57 | 31.88 | 70.95 | 66.04 | 65.93 |
online1 | 33.02 | 31.83 | 31.54 | 50.40 | 46.61 | 48.06 |
online2 | 25.99 | 25.20 | 24.89 | 48.45 | 44.49 | 45.83 |
speedup | 23.52 | 22.97 | 21.54 | 64.78 | 60.38 | 61.52 |
- FB feature is much better than both MFCC and GFCC. Probably due to the less information lost without DCT.
- In noisy environment, GFCC obtains comparable or better performance compared to FB.
- We need to investigate how many FBs are the most appropriate.
- Inspired by the assumption of information lost with DCT, we need to test how another transform, LDA, leads to the similar information lost. We need to investigate which is the suitable dimension number for the LDA. We need to investigate non-linear discriminative approach which is simple but leads to less information lost.
- Another assumption for the better performance with FB is that the FB is more suitable for CMN. DCT accumulates a number of noisy channels and thus exhibits more uncertain. This in turn can hardly be normalized by CMN. We need to test the performance of FB and MFCC when no CMN is introduced.
- We can also test a simple 'the same dimension DCT'. If the performance is still worse than FB, we confirm that the problem is due to noisy channel accumulation.
- Need to investigate Gammatone filter banks. The same idea as FB, that we want to keep the information as much as possible. And it is possible to combine FB and GFB to pursue a better performance.
Tencent exps
N/A
DNN Confidence estimation
- Lattice-based confidence show better performance with DNN with before.
- Accumulated DNN confidence is done. The confidence values are much more reasonable.
- Prepare MLP/DNN-based confidence integration.
Noisy training
Reading the table in the last section, we observe very disapointting performance reduction with noise. And we did not see too much advantage for FB and GFCC. We examine how if we introduce the noise in training. In this experiment, 15db noise are introduced in all the training data (100 hours), and the test utterances are in various noise level. Just give the performance on the test set online1. More performance is here:
SNR | MFCC | GFCC |
---|---|---|
clean | 45.63 | 38.12 |
20db | 32.41 | 30.54 |
15db(matched training) | 35.05 | 32.80 |
10db | 41.06 | 38.53 |
- It is interesting to see that two factors are important in the noisy training: (1) speech should be clean (2) speech should match the training condition. The best performance is from 20db, which is not very clean and not very mismatch. This is interesting.
- We are looking forward to the noisy training which introduces some noises randomly in training.
Stream decoding
- The interface for server-side is done. For embedded-side is on development.
- Fixed a bug which prompts intermediate results when short pause encountered.
- Fixed a CMN bug for the last segment.