“2014-03-21”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
(以内容“==Resoruce Building== * Current text resource has been re-arranged and listed == AM development == === Sparse DNN === * Optimal Brain Damage(OBD). # GA-based block...”创建新页面)
 
 
(相同用户的10个中间修订版本未显示)
第1行: 第1行:
 
==Resoruce Building==
 
==Resoruce Building==
 
* Current text resource has been re-arranged and listed
 
* Current text resource has been re-arranged and listed
 +
 +
== Leftover questions==
 +
* Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?
 +
* Multi GPU training: Error encountered
 +
* Multilanguage training
 +
* Investigating LOUDS FST.
 +
* CLG embedded decoder plus online compiler.
  
 
== AM development ==
 
== AM development ==
  
 
=== Sparse DNN ===
 
=== Sparse DNN ===
 +
* GA-based block sparsity
 +
:* code ready, testing on pure matrix multiplication
  
* Optimal Brain Damage(OBD).
+
===GMM/DNN co-training===
 +
* Co-training using Tencent data
 +
:* slightly better in GMM modeling when using DNN alignment
 +
:* worse performance when using the re-trained GMMs
  
# GA-based block sparsity
+
===Noise training===
# code ready, testing on pure matrix multiplication
+
  
 +
* Single noise injection
 +
:* [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/7/7e/White-eps-converted-to.pdf White noise training]
 +
:* [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/e/ec/Cafe-eps-converted-to.pdf Cafe noise training]
 +
:* [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/3/39/Car-eps-converted-to.pdf car noise training]
 +
* Multi noise injection
 +
:* [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/f/fc/White_cafe_clean-eps-converted-to.pdf white+cafe noise training]
  
=== Efficient DNN training ===
 
 
# Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?
 
 
===Multi GPU training===
 
* Error encountered
 
 
===GMM - DNN co-training===
 
* Initial DNN test done
 
:* tri4b - > DNN  (org)
 
:* DNN alignmenment -> tri4b
 
:* tri4b alignment -> DNN (re-train)
 
 
<pre>
 
  model/testcase              |  test_dev93(cv)      |    test_eval92
 
    --------------------------------------------------------------
 
    8400-80000(org)          |    7.41              |      4.13
 
    --------------------------------------------------------------
 
    re-train (Keep state #)  |    7.20              |      4.24
 
    --------------------------------------------------------------
 
    re-train (Free state #)  |    7.29              |      4.31
 
    --------------------------------------------------------------
 
</pre>
 
 
=== Multilanguage training===
 
 
# Pure Chinese training reached 4.9%
 
# Chinese + English reduced to 7.9%
 
# English phone set should discriminate beginning phone and ending phone
 
# Should set up multilingual network structure which shares low layers but separate languages at high layers
 
 
===Noise training===
 
 
* Train with wsj database by corrupting data with various noise types
 
:* Almost all training conditions are completed
 
:* Interesting results in multi-conditional training (white + cafe) and test on park/station
 
  
 
===AMR compression re-training===
 
===AMR compression re-training===
* WeChat uses AMR compression method, which requires adaptation for our AM
 
* Test AMR & non-AMR model
 
  
<pre>
+
* 1700h AMR training on going
model wav amr
+
 
+
xent baseline 4.47
+
wav_mpe baseline        4.20 36.77
+
 
+
amr_mpe_lr_1e-5 6.27 8.95
+
amr_mpe_lr_1e-4 7.58 8.68
+
 
+
amr_xEnt_lr_1e-5 6.89 7.99
+
amr_xEnt_lr_1e-4 6.61 7.28
+
amr_xEnt_lr_0.08 5.72 6.20
+
 
+
</pre>
+
 
+
 
+
* Prepare to do adaptation on 1700h
+
* Prepare to do mixing xEnt test
+
  
 
===GFbank===
 
===GFbank===
  
 
+
* gfbank is better than gfcc
* Finished the first round of gfbank training & test
+
* gfbank is better than fbank
* The same gmm model (mfcc feature) was used to get the alignment
+
* gfbank + fbank seems outperforms others
* Traing fbank & gfbank based on the mfcc alignment
+
* Clean training and noise test
+
 
+
<pre>
+
clean 5dB 10dB 15dB 20dB 25dB
+
gfbank 4.22 73.03 39.20 16.41 8.36 5.60
+
gfbank_80 4.36 74.41 42.94 18.13 8.59 5.85
+
fbank_zmy 3.97 74.78 44.57 18.80 8.54 5.30
+
</pre>
+
 
+
* gfbank + fbank 80 dim training/test
+
 
+
 
+
===Engine optimization===
+
 
+
* Investigating LOUDS FST.
+
 
+
  
 
==Word to Vector==
 
==Word to Vector==
第125行: 第70行:
 
* Investigate MS RNN LM training
 
* Investigate MS RNN LM training
  
===3T Sogou LM===
 
  
*3T + tencent LM combination:
+
==Pronunciation scoring==
:* Combine the 3T voc (11w) and the tencent 8w voca
+
* G-score done on 16k English model
:* re-segmentation
+
* The distribution of frames over phone/frame posterior scores seem highly discriminative
:* compute PPL with the 3T and tencent LM
+
* The distribution of the distance of the test utterance against the reference utterance seems a high discriminative score
:* compute the best mixing weights
+
:* the mixing weight is wrong ....
+
:* if we mix the two by equal weight (0.5/0.5), performance is better than the individual
+
  
*3T + QA model combination
+
==QA==
  
==QA Matching==
+
===FST-based matching===
 
+
:* Code done. Simple test done
* FST-based matching
+
:* Ready for large scale test
:* Investigating why openfST union does not lead to a determinizable graph
+
:* Test the pattern label
+
 
+
* TF/IDF weight
+
:* code is done, TF/IDF weight can be used right now.
+
 
+
==Embedded development==
+
 
+
* CLG embedded decoder is almost done. Online compiler is on progress.
+
* English scoring is under go
+
 
+
 
+
 
+
==Speech QA==
+
 
+
* N-best with entity LM was analyzed
+
:* WER vs QA accuracy is done
+
:* The figure shows that WER and QA accuracy is positively related
+
:* Addding song names and singer names improve performance in most cases
+
:* There indeed some exceptions in the figure that (a) higher WER does not reduce QA necessarily (b) adding entity names does not improve QA
+
:* The results on [[媒体文件:Music_QA_wer.pdf]]
+
  
  
 +
===Speech QA===
 
*Class LM QA
 
*Class LM QA
* Use QA LM as the baseine
+
:* Now find that with smaller weight to the class FST, better performance is obtained
* Tag singer names and song names
+
:* Now it is very difficult to retrieve the words that can not be found by the original FST
* build tag LM
+
:* Test negative weights
* Using graph integration to resolve the tags
+
* Adjusting in-tag weight
+
* Smaller weight produces more entity recognition
+
* Check if the recognized songs/singers are correct/wrong
+
 
+
<pre>
+
1, non-merge
+
    BaseLine:
+
          qa-singer-song
+
    songs      41
+
    singers    23
+
 
+
2, HCLG-merge
+
    Weight means the multiplier of the sub-graph entry.
+
  (1) LM:1e-5
+
    weight  0.00000001  0.0001  0.001  0.01    1  10
+
    songs      20        20      21    19      9    4
+
    singers    13        13      13    13      2    2
+
</pre>
+

2014年3月21日 (五) 07:20的最后版本

Resoruce Building

  • Current text resource has been re-arranged and listed

Leftover questions

  • Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?
  • Multi GPU training: Error encountered
  • Multilanguage training
  • Investigating LOUDS FST.
  • CLG embedded decoder plus online compiler.

AM development

Sparse DNN

  • GA-based block sparsity
  • code ready, testing on pure matrix multiplication

GMM/DNN co-training

  • Co-training using Tencent data
  • slightly better in GMM modeling when using DNN alignment
  • worse performance when using the re-trained GMMs

Noise training

  • Single noise injection
  • Multi noise injection


AMR compression re-training

  • 1700h AMR training on going

GFbank

  • gfbank is better than gfcc
  • gfbank is better than fbank
  • gfbank + fbank seems outperforms others

Word to Vector

  • Data preparation
  • Prepared 7 category totally 500+ articles
  • Prepared Sogou 9-class text, totally 9*2000 articles
  • Achieved Fudan 11-class text data, only for testing
  • Improved wordvector with multi sense
  • Almost impossible with the toolkit
  • Can think of pre-training vectors and then do clusering
  • WordVecteor-based keyword extraction
  • Decide to use the Sogou data to do extraction
  • Evaluate the keyword in the classification task
  • Wordvector based on classification
  • Decide to use the Sogou data to do extraction


LM development

NN LM

  • Character-based NNLM (6700 chars, 7gram), 500M data training done.
  • boundary-involved char NNLM training done
  • Test on going
  • Investigate MS RNN LM training


Pronunciation scoring

  • G-score done on 16k English model
  • The distribution of frames over phone/frame posterior scores seem highly discriminative
  • The distribution of the distance of the test utterance against the reference utterance seems a high discriminative score

QA

FST-based matching

  • Code done. Simple test done
  • Ready for large scale test


Speech QA

  • Class LM QA
  • Now find that with smaller weight to the class FST, better performance is obtained
  • Now it is very difficult to retrieve the words that can not be found by the original FST
  • Test negative weights