“2014-03-14”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
Speech QA
 
第151行: 第151行:
 
:* Addding song names and singer names improve performance in most cases
 
:* Addding song names and singer names improve performance in most cases
 
:* There indeed some exceptions in the figure that (a) higher WER does not reduce QA necessarily (b) adding entity names does not improve QA  
 
:* There indeed some exceptions in the figure that (a) higher WER does not reduce QA necessarily (b) adding entity names does not improve QA  
:* The results on [[Music_QA_wer.pdf]]
+
:* The results on [[媒体文件:Music_QA_wer.pdf]]
  
  

2014年3月14日 (五) 02:38的最后版本

Resoruce Building

  • Current text resource has been re-arranged and listed

AM development

Sparse DNN

  • Optimal Brain Damage(OBD).
  1. GA-based block sparsity

Efficient DNN training

  1. Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?

Multi GPU training

  • Error encountered

GMM - DNN co-training

  • Initial DNN test done
  • tri4b - > DNN (org)
  • DNN alignmenment -> tri4b
  • tri4b alignment -> DNN (re-train)
  model/testcase              |  test_dev93(cv)       |     test_eval92
    --------------------------------------------------------------
    8400-80000(org)           |    7.41               |      4.13
    --------------------------------------------------------------
    re-train (Keep state #)   |    7.20               |      4.24
    --------------------------------------------------------------
    re-train (Free state #)   |    7.29               |      4.31
    --------------------------------------------------------------

Multilanguage training

  1. Pure Chinese training reached 4.9%
  2. Chinese + English reduced to 7.9%
  3. English phone set should discriminate beginning phone and ending phone
  4. Should set up multilingual network structure which shares low layers but separate languages at high layers

Noise training

  • Train with wsj database by corrupting data with various noise types
  • Almost all training conditions are completed
  • Interesting results in multi-conditional training (white + cafe) and test on park/station

AMR compression re-training

  • WeChat uses AMR compression method, which requires adaptation for our AM
  • Test AMR & non-AMR model
model			wav	amr

xent baseline		4.47	
wav_mpe baseline        4.20	36.77

amr_mpe_lr_1e-5		6.27	8.95
amr_mpe_lr_1e-4		7.58	8.68

amr_xEnt_lr_1e-5	6.89	7.99
amr_xEnt_lr_1e-4	6.61	7.28
amr_xEnt_lr_0.08	5.72	6.20


  • Prepare to do adaptation on 1700h
  • Prepare to do mixing xEnt test

GFbank

  • Finished the first round of gfbank training & test
  • The same gmm model (mfcc feature) was used to get the alignment
  • Traing fbank & gfbank based on the mfcc alignment
  • Clean training and noise test
		clean	5dB	10dB	15dB	20dB	25dB
gfbank		4.22	73.03	39.20	16.41	8.36	5.60
gfbank_80	4.36	74.41	42.94	18.13	8.59	5.85
fbank_zmy	3.97	74.78	44.57	18.80	8.54	5.30
  • gfbank + fbank 80 dim training/test


Engine optimization

  • Investigating LOUDS FST.


Word to Vector

  • Improved wordvector with multi sense
  • Almost impossible with the toolkit
  • Can think of pre-training vectors and then do clusering
  • WordVecteor-based keyword extraction
  • Prepared 7 category totally 500+ articles
  • A problem fixed to retrieve article words
  • Wordvector based on classification


LM development

NN LM

  • Character-based NNLM (6700 chars, 7gram), 500M data training done.
  • boundary-involved char NNLM training done
  • Test on rescoring
  • Investigate MS RNN LM training

3T Sogou LM

  • 3T + tencent LM combination:
  • Combine the 3T voc (11w) and the tencent 8w voca
  • re-segmentation
  • compute PPL with the 3T and tencent LM
  • compute the best mixing weights
  • the mixing weight is wrong ....
  • if we mix the two by equal weight (0.5/0.5), performance is better than the individual
  • 3T + QA model combination

QA Matching

  • FST-based matching
  • Investigating why openfST union does not lead to a determinizable graph
  • Test the pattern label
  • TF/IDF weight
  • code is done, TF/IDF weight can be used right now.

Embedded development

  • CLG embedded decoder is almost done. Online compiler is on progress.
  • English scoring is under go


Speech QA

  • N-best with entity LM was analyzed
  • WER vs QA accuracy is done
  • The figure shows that WER and QA accuracy is positively related
  • Addding song names and singer names improve performance in most cases
  • There indeed some exceptions in the figure that (a) higher WER does not reduce QA necessarily (b) adding entity names does not improve QA
  • The results on 媒体文件:Music_QA_wer.pdf


  • Class LM QA
  • Use QA LM as the baseine
  • Tag singer names and song names
  • build tag LM
  • Using graph integration to resolve the tags
  • Adjusting in-tag weight
  • Smaller weight produces more entity recognition
  • Check if the recognized songs/singers are correct/wrong
1, non-merge
    BaseLine:
           qa-singer-song
    songs       41
    singers     23

2, HCLG-merge
    Weight means the multiplier of the sub-graph entry. 
  (1) LM:1e-5
    weight  0.00000001  0.0001   0.001  0.01    1   10
    songs      20        20      21    19       9    4
    singers    13        13      13    13       2    2