“2014-06-20”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
(以内容“==Resoruce Building== * release management combing done. == Leftover questions== * Asymmetric window: Great improvement on training set(WER 34% to 24%), however the im...”创建新页面)
 
第30行: 第30行:
 
===Multilingual ASR===
 
===Multilingual ASR===
  
 +
<pre>
 
                                   HW 30h (HW TR LM not involved)    HW30h (HW TR LM involved)
 
                                   HW 30h (HW TR LM not involved)    HW30h (HW TR LM involved)
 
FBank non-stream (MPE4)            22.23                                  21.38
 
FBank non-stream (MPE4)            22.23                                  21.38
第36行: 第37行:
 
GFbank stream    (MPE4)            -                    -          -
 
GFbank stream    (MPE4)            -                    -          -
 
GFbank non-stream (MPE)            -                    -          -
 
GFbank non-stream (MPE)            -                    -          -
 +
 +
</pre>
  
 
===Denoising & Farfield ASR===
 
===Denoising & Farfield ASR===
第46行: 第49行:
 
Original model:
 
Original model:
  
 +
<pre>
 
xEnt model:
 
xEnt model:
 
               middle-field    far-field
 
               middle-field    far-field
第60行: 第64行:
 
     eval92      52.67          90.45
 
     eval92      52.67          90.45
  
 +
</pre>
  
 
===VAD===
 
===VAD===
第76行: 第81行:
 
===Embedded decoder===
 
===Embedded decoder===
  
 +
<pre>
 
FSA size:  
 
FSA size:  
 
 
threshold  1e-5    1e-6  1e-7  1e-8    1e-9
 
threshold  1e-5    1e-6  1e-7  1e-8    1e-9
 
5k        480k    5.5M  44M    -      1.1G
 
5k        480k    5.5M  44M    -      1.1G
 
10k        731k    7M    61M
 
10k        731k    7M    61M
 
20k        1.2M    8.8M  78M(301M)
 
20k        1.2M    8.8M  78M(301M)
+
</pre>
  
 
<pre>
 
<pre>
第108行: 第113行:
 
:* using k-nn to conduct classification
 
:* using k-nn to conduct classification
  
 +
<pre>
 
             mean Eur Distance    KL distance    baseline (NB with mean)
 
             mean Eur Distance    KL distance    baseline (NB with mean)
  
 
Acc (50dim)    81.84            79.65                  69.7
 
Acc (50dim)    81.84            79.65                  69.7
 
+
</pre>
  
 
==Semantic word tree==
 
==Semantic word tree==

2014年6月20日 (五) 02:39的版本

Resoruce Building

  • release management combing done.

Leftover questions

  • Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test.
  • Multi GPU training: Error encountered
  • Multilanguage training
  • Investigating LOUDS FST.
  • CLG embedded decoder plus online compiler.
  • DNN-GMM co-training

AM development

Sparse DNN

  • GA-based block sparsity (+++++++)
  • Paper revision done.

Noise training

  • Paper writing will be started this week

GFbank

  • Running into Sinovoice 8k 1400 + 100 mixture training.
  • GFbank 14 xEnt iteration completed:
                                  Huawei disanpi     BJ mobile   8k English data       

FBank non-stream (MPE4) 20.44% 22.28% 24.36% GFbank stream (MPE4) - - - GFbank non-stream (MPE) - - -

Multilingual ASR

                                   HW 30h (HW TR LM not involved)     HW30h (HW TR LM involved)
FBank non-stream (MPE4)             22.23                                   21.38
Fbank stream (monolang)             21.64                                   20.72

GFbank stream    (MPE4)             -                     -           -
GFbank non-stream (MPE)             -                     -           -

Denoising & Farfield ASR

  • Replay may cause time delay. This should be solved by cross-correlation detection.
  • Single-layer network with more hidden units. failed.
  • Looks like the problem resides in large magnitude on output data.
  • New recordings (one almost near mic & one far field 2 meters)

Original model:

xEnt model:
               middle-field    far-field
    dev93       74.79          96.68
    eval92      63.42          94.75

MPE model:


MPE adaptation: 

               middle-field    far-field
    dev93       63.71          94.84
    eval92      52.67          90.45

VAD

  • DNN-based VAD (7.49) showers much better performance than energy based VAD (45.74)
  • 100 X n (n<=3) hidden units with 2 output units seem sufficient for VAD



Scoring

  • Collect more data with human scoring to train discriminative models


Embedded decoder

FSA size: 
threshold  1e-5    1e-6   1e-7   1e-8    1e-9
5k         480k    5.5M   44M     -      1.1G
10k        731k     7M    61M
20k        1.2M    8.8M   78M(301M)
600 X 4+800 AM, beam9: 
        150k       20k     10k      5k 
WER     15.96       -       -       -
RT       X         0.94     -       -

LM development

Domain specific LM

  • Baiduzhidao + Weibeo extraction done with various thresholds
  • Looks like the extracted text can improve to some extent, but the major change seems come from pre-pocessing.
  • Check proportion of tags int HW 30 h data

Word2Vector

W2V based doc classification

  • Full Gaussian based doc vector
  • represent each doc with a Gaussian distribution of the word vectors it involved.
  • using k-nn to conduct classification
             mean Eur Distance     KL distance    baseline (NB with mean)

Acc (50dim)    81.84            79.65                  69.7

Semantic word tree

  • First version based on pattern match done
  • Filter with query log
  • Further refinement with Baidu Baike hierarchy


NN LM

  • Character-based NNLM (6700 chars, 7gram), 500M data training done.
  • Inconsistent pattern in WER were found on Tenent test sets
  • probably need to use another test set to do investigation.
  • Investigate MS RNN LM training