2014-03-07

Resoruce Building

Current text resource has been re-arranged and listed

AM development

Sparse DNN

Optimal Brain Damage(OBD).

GA-based block sparsity

Efficient DNN training

Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?

Multi GPU training

Error encountered

GMM - DNN co-training

Error encountered

Multilanguage training

Pure Chinese training reached 4.9%
Chinese + English reduced to 7.9%
English phone set should discriminate beginning phone and ending phone
Should set up multilingual network structure which shares low layers but separate languages at high layers

Noise training

Train with wsj database by corrupting data with various noise types

White noise + car noise training partially completed
Mixture training produces better performance for both car and white noise
Unknown noise testing is on progress

AMR compression re-training

WeChat uses AMR compression method, which requires adaptation for our AM
Test AMR & non-AMR model

               test-wav       WAV     AMR
        model
        WAV                   4.31     26.09
        AMR                  13.80      6.77

Prepare to do adaptation

GFbank

Finished the first round of gfbank training & test
The same gmm model (mfcc feature) was used to get the alignment
Traing fbank & gfbank based on the mfcc alignment
Clean training and noise test

     clean     25db    5db
gf   4.22      5.60    73.03
fb   4.31      5.87    84.12

Engine optimization

Investigating LOUDS FST.

Word to Vector

Test a training toolkit Standford University, which can involve global information into word2vector training

C++ implementation (instead of python) for data pre-processing. Failed. Just use python.

Basic wordvector plus global sense

1 MB corpus costs 5 mins,vocab size 16698
10 MB corpus costs about 82 mins vocab size 56287

Improved wordvector with multi sense

Almost impossible with the toolkit
Can think of pre-training vectors and then do clusering

WordVecteor-based keyword extraction

Prepared 7 category totally 500+ articles
A problem in keyword identification. Fix it by using the article vector space

Investigating Senna toolkit from NEC. Intending to implement POS tagging based on word vectors.

LM development

NN LM

Character-based NNLM (6700 chars, 7gram), 500M data training done.

Performance lower than word-based NNLM
Prepare to run boundary-involved char NNLM

WordVector-based word and char NNLM training done

Google wordvecotr-based NNLM is worse than random initialized NNLM

3T Sogou LM

Improved training

3T LM + Tencent 80k lM: performance worse than the original 80K LM
Need to check if it is caused by the mismatched vocabu9lary
3T LM + QA LM : use online1 as the EM target, performance worse than QA LM
Probably due to the incorrect EM target

QA Matching

Working on edit FST for fuzzy matching
TF/IDF score matching completed

Embedded development

CLG embedded decoder is almost done. Online compiler is on progress.
English scoring is under go

Speech QA

N-best with entity LM was analyzed
Entity-class LM comparision

re-segmentation & re-train
SRILM class-based LM ???
Subgraph integration from Zhiyong

WER summary is done
Prepare to compose a paper

2014-03-07

目录

Resoruce Building

AM development

Sparse DNN

Efficient DNN training

Multi GPU training

GMM - DNN co-training

Multilanguage training

Noise training

AMR compression re-training

GFbank

Engine optimization

Word to Vector

LM development

NN LM

3T Sogou LM

QA Matching

Embedded development

Speech QA

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具