“2014-03-21”版本间的差异

2014年3月21日 (五) 07:20的最后版本

Resoruce Building

Current text resource has been re-arranged and listed

Leftover questions

Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?
Multi GPU training: Error encountered
Multilanguage training
Investigating LOUDS FST.
CLG embedded decoder plus online compiler.

AM development

Sparse DNN

GA-based block sparsity

code ready, testing on pure matrix multiplication

GMM/DNN co-training

Co-training using Tencent data

slightly better in GMM modeling when using DNN alignment
worse performance when using the re-trained GMMs

Noise training

Single noise injection

Multi noise injection

white+cafe noise training

AMR compression re-training

1700h AMR training on going

GFbank

gfbank is better than gfcc
gfbank is better than fbank
gfbank + fbank seems outperforms others

Word to Vector

Data preparation

Prepared 7 category totally 500+ articles
Prepared Sogou 9-class text, totally 9*2000 articles
Achieved Fudan 11-class text data, only for testing

Improved wordvector with multi sense

Almost impossible with the toolkit
Can think of pre-training vectors and then do clusering

WordVecteor-based keyword extraction

Decide to use the Sogou data to do extraction
Evaluate the keyword in the classification task

Wordvector based on classification

Decide to use the Sogou data to do extraction

LM development

NN LM

Character-based NNLM (6700 chars, 7gram), 500M data training done.

boundary-involved char NNLM training done
Test on going

Investigate MS RNN LM training

Pronunciation scoring

G-score done on 16k English model
The distribution of frames over phone/frame posterior scores seem highly discriminative
The distribution of the distance of the test utterance against the reference utterance seems a high discriminative score

QA

FST-based matching

Code done. Simple test done
Ready for large scale test

Speech QA

Class LM QA

Now find that with smaller weight to the class FST, better performance is obtained
Now it is very difficult to retrieve the words that can not be found by the original FST
Test negative weights

@@ 第1行： / 第1行： @@
 ==Resoruce Building==
 * Current text resource has been re-arranged and listed
+== Leftover questions==
+* Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?
+* Multi GPU training: Error encountered
+* Multilanguage training
+* Investigating LOUDS FST.
+* CLG embedded decoder plus online compiler.
 == AM development ==
 === Sparse DNN ===
+* GA-based block sparsity
+:* code ready, testing on pure matrix multiplication
-* Optimal Brain Damage(OBD).
+===GMM/DNN co-training===
+* Co-training using Tencent data
+:* slightly better in GMM modeling when using DNN alignment
+:* worse performance when using the re-trained GMMs
-# GA-based block sparsity
+===Noise training===
-# code ready, testing on pure matrix multiplication
+* Single noise injection
+:* [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/7/7e/White-eps-converted-to.pdf White noise training]
+:* [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/e/ec/Cafe-eps-converted-to.pdf Cafe noise training]
+:* [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/3/39/Car-eps-converted-to.pdf car noise training]
+* Multi noise injection
+:* [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/f/fc/White_cafe_clean-eps-converted-to.pdf white+cafe noise training]
-=== Efficient DNN training ===
-# Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?
-===Multi GPU training===
-* Error encountered
-===GMM - DNN co-training===
-* Initial DNN test done
-:* tri4b - > DNN   (org)
-:* DNN alignmenment -> tri4b
-:* tri4b alignment -> DNN (re-train)
-<pre>
-  model/testcase              |  test_dev93(cv)       |     test_eval92
-    --------------------------------------------------------------
--80000(org)           |    7.41               |      4.13
-    --------------------------------------------------------------
-    re-train (Keep state #)   |    7.20               |      4.24
-    --------------------------------------------------------------
-    re-train (Free state #)   |    7.29               |      4.31
-    --------------------------------------------------------------
-</pre>
-=== Multilanguage training===
-# Pure Chinese training reached 4.9%
-# Chinese + English reduced to 7.9%
-# English phone set should discriminate beginning phone and ending phone
-# Should set up multilingual network structure which shares low layers but separate languages at high layers
-===Noise training===
-* Train with wsj database by corrupting data with various noise types
-:* Almost all training conditions are completed
-:* Interesting results in multi-conditional training (white + cafe) and test on park/station
 ===AMR compression re-training===
-* WeChat uses AMR compression method, which requires adaptation for our AM
-* Test AMR & non-AMR model
-<pre>
+* 1700h AMR training on going
-model			wav	amr
-xent baseline		4.47
-wav_mpe baseline        4.20	36.77
-amr_mpe_lr_1e-5		6.27	8.95
-amr_mpe_lr_1e-4		7.58	8.68
-amr_xEnt_lr_1e-5	6.89	7.99
-amr_xEnt_lr_1e-4	6.61	7.28
-amr_xEnt_lr_0.08	5.72	6.20
-</pre>
-* Prepare to do adaptation on 1700h
-* Prepare to do mixing xEnt test
 ===GFbank===
+* gfbank is better than gfcc
-* Finished the first round of gfbank training & test
+* gfbank is better than fbank
-* The same gmm model (mfcc feature) was used to get the alignment
+* gfbank + fbank seems outperforms others
-* Traing fbank & gfbank based on the mfcc alignment
-* Clean training and noise test
-<pre>
-		clean	5dB	10dB	15dB	20dB	25dB
-gfbank		4.22	73.03	39.20	16.41	8.36	5.60
-gfbank_80	4.36	74.41	42.94	18.13	8.59	5.85
-fbank_zmy	3.97	74.78	44.57	18.80	8.54	5.30
-</pre>
-* gfbank + fbank 80 dim training/test
-===Engine optimization===
-* Investigating LOUDS FST.
 ==Word to Vector==
@@ 第125行： / 第70行： @@
 * Investigate MS RNN LM training
-===3T Sogou LM===
-*3T + tencent LM combination:
+==Pronunciation scoring==
-:* Combine the 3T voc (11w) and the tencent 8w voca
+* G-score done on 16k English model
-:* re-segmentation
+* The distribution of frames over phone/frame posterior scores seem highly discriminative
-:* compute PPL with the 3T and tencent LM
+* The distribution of the distance of the test utterance against the reference utterance seems a high discriminative score
-:* compute the best mixing weights
-:* the mixing weight is wrong ....
-:* if we mix the two by equal weight (0.5/0.5), performance is better than the individual
-*3T + QA model combination
+==QA==
-==QA Matching==
+===FST-based matching===
+:* Code done. Simple test done
-* FST-based matching
+:* Ready for large scale test
-:* Investigating why openfST union does not lead to a determinizable graph
-:* Test the pattern label
-* TF/IDF weight
-:* code is done, TF/IDF weight can be used right now.
-==Embedded development==
-* CLG embedded decoder is almost done. Online compiler is on progress.
-* English scoring is under go
-==Speech QA==
-* N-best with entity LM was analyzed
-:* WER vs QA accuracy is done
-:* The figure shows that WER and QA accuracy is positively related
-:* Addding song names and singer names improve performance in most cases
-:* There indeed some exceptions in the figure that (a) higher WER does not reduce QA necessarily (b) adding entity names does not improve QA
-:* The results on [[媒体文件:Music_QA_wer.pdf]]
+===Speech QA===
 *Class LM QA
-* Use QA LM as the baseine
+:* Now find that with smaller weight to the class FST, better performance is obtained
-* Tag singer names and song names
+:* Now it is very difficult to retrieve the words that can not be found by the original FST
-* build tag LM
+:* Test negative weights
-* Using graph integration to resolve the tags
-* Adjusting in-tag weight
-* Smaller weight produces more entity recognition
-* Check if the recognized songs/singers are correct/wrong
-<pre>
-, non-merge
-    BaseLine:
-           qa-singer-song
-    songs       41
-    singers     23
-, HCLG-merge
-    Weight means the multiplier of the sub-graph entry.
-  (1) LM:1e-5
-    weight  0.00000001  0.0001   0.001  0.01    1   10
-    songs      20        20      21    19       9    4
-    singers    13        13      13    13       2    2
-</pre>

“2014-03-21”版本间的差异

2014年3月21日 (五) 07:20的最后版本

目录

Resoruce Building

Leftover questions

AM development

Sparse DNN

GMM/DNN co-training

Noise training

AMR compression re-training

GFbank

Word to Vector

LM development

NN LM

Pronunciation scoring

QA

FST-based matching

Speech QA

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具