2014-06-20

来自cslt Wiki
跳转至: 导航搜索

Resoruce Building

  • release management combing done.

Leftover questions

  • Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test.
  • Multi GPU training: Error encountered
  • Multilanguage training
  • Investigating LOUDS FST.
  • CLG embedded decoder plus online compiler.
  • DNN-GMM co-training

AM development

Sparse DNN

  • GA-based block sparsity (+++++++)
  • Paper revision done.

Noise training

  • Paper writing will be started this week

GFbank

  • Running into Sinovoice 8k 1400 + 100 mixture training.
  • FBank/GFbank, stream/non-stream MPE completed:
                                   Huawei disanpi     BJ mobile   8k English data       
FBank non-stream (MPE4)             20.44%              22.28%      24.36%
FBank stream (MPE1)             20.17%              22.50%      21.63%
GFbank stream    (MPE4)           20.69%                22.84%       24.45%
GFbank non-stream (MPE)             -                     -           -

Multilingual ASR

                                   HW 30h (HW TR LM not involved)     HW30h (HW TR LM involved)
FBank non-stream (MPE4)             22.23                                   21.38
Fbank stream (monolang)             21.64                                   20.72

GFbank stream    (MPE4)             -                     -           -
GFbank non-stream (MPE)             -                     -           -

Denoising & Farfield ASR

  • Replay may cause time delay. This should be solved by cross-correlation detection.
  • Single-layer network with more hidden units. failed.
  • Looks like the problem resides in large magnitude on output data.
  • New recordings (one almost near mic & one far field 2 meters)

Original model:

xEnt model:
               middle-field    far-field
    dev93       74.79          96.68
    eval92      63.42          94.75

MPE model:


MPE adaptation: 

               middle-field    far-field
    dev93       63.71          94.84
    eval92      52.67          90.45

VAD

  • DNN-based VAD (7.49) showers much better performance than energy based VAD (45.74)
  • 100 X n (n<=3) hidden units with 2 output units seem sufficient for VAD



Scoring

  • Collect more data with human scoring to train discriminative models


Embedded decoder

FSA size: 
threshold  1e-5    1e-6   1e-7   1e-8    1e-9
5k         480k    5.5M   44M     -      1.1G
10k        731k     7M    61M
20k        1.2M    8.8M   78M(301M)
600 X 4+800 AM, beam9: 
        150k       20k     10k      5k 
WER     15.96       -       -       -
RT       X         0.94     -       -

LM development

Domain specific LM

  • Baiduzhidao + Weibeo extraction done with various thresholds
  • Looks like the extracted text can improve to some extent, but the major change seems come from pre-pocessing.
  • Check proportion of tags int HW 30 h data

Word2Vector

W2V based doc classification

  • Full Gaussian based doc vector
  • represent each doc with a Gaussian distribution of the word vectors it involved.
  • using k-nn to conduct classification
             mean Eur Distance     KL distance    baseline (NB with mean)

Acc (50dim)    81.84            79.65                  69.7

Semantic word tree

  • First version based on pattern match done
  • Filter with query log
  • Further refinement with Baidu Baike hierarchy


NN LM

  • Character-based NNLM (6700 chars, 7gram), 500M data training done.
  • Inconsistent pattern in WER were found on Tenent test sets
  • probably need to use another test set to do investigation.
  • Investigate MS RNN LM training