2014-02-28

来自cslt Wiki

跳转至：导航、搜索

目录

[隐藏]

1 Resoruce Building
2 AM development
3 Word to Vector
4 LM development
- 4.1 NN LM
- 4.2 3T Sogou LM
5 Embedded development
6 Speech QA

Resoruce Building

Current text resource has been re-arranged and listed

AM development

Sparse DNN

Optimal Brain Damage(OBD).

GA-based block sparsity

Efficient DNN training

Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?

Multi GPU training

Error encountered

GMM - DNN co-training

Error encountered

Multilanguage training

Pure Chinese training reached 4.9%
Chinese + English reduced to 7.9%
English phone set should discriminate beginning phone and ending phone
Should set up multilingual network structure which shares low layers but separate languages at high layers

Noise training

Train with wsj database by corrupting data with various noise types

White noise training completed. All results are fine
Car noise training almost finished. Large-variance training on progress

Engine optimization

Investigating LOUDS FST.

Word to Vector

Test a training toolkit Standford University, which can involve global information into word2vector training

C++ implementation (instead of python) for data pre-processing. Failed. Just use python.

Basic wordvector plus global sense

1 MB corpus costs 5 mins,vocab size 16698
10 MB corpus costs about 82 mins vocab size 56287

Improved wordvector with multi sense

Almost impossible with the toolkit
Can think of pre-training vectors and then do clusering

WordVecteor-based keyword extraction

wordvector keyword extraction seems more reasonable if the keywords are in the lexicon
For oov words, wv-based extraction is limited by the vocabulary
Need a standard new word extraction

Investigating Senna toolkit from NEC. Intending to implement POS tagging based on word vectors.

LM development

NN LM

Character-based NNLM (6700 chars, 7gram), 500M data training done.

3hours per iteration
For word-based NNLM, 1 hour/iteration for 1024 words, 4 hours/iteration for 10240 words
Performance lower than word-based NNLM

WordVector-based word and char NNLM training done

Google wordvecotr-based NNLM is worse than random initialized NNLM

3T Sogou LM

Improved training

re-segmentation by Tencent 110k lexicon
re-train with 4G text blocks
1/6 merge done. PPL reduced to 466(vs Tencent 8w8 213.74)
Need to check the OOV problem
Need to finish the final merge.

Embedded development

CLG embedded decoder is almost done. Online compiler is on progress.
Zhiyong is working on layer-by-layer DNN training.

Speech QA

N-best with entity LM was analyzed
Entity-class LM comparision

re-segmentation & re-train
SRILM class-based LM ???
Subgraph integration from Zhiyong

取自“http://index.cslt.org/mediawiki/index.php?title=2014-02-28&oldid=9235”