“2014-03-14”版本间的差异
来自cslt Wiki
(以内容“==Resoruce Building== * Current text resource has been re-arranged and listed == AM development == === Sparse DNN === * Optimal Brain Damage(OBD). # GA-based block...”创建新页面) |
(→Speech QA) |
||
(相同用户的一个中间修订版本未显示) | |||
第151行: | 第151行: | ||
:* Addding song names and singer names improve performance in most cases | :* Addding song names and singer names improve performance in most cases | ||
:* There indeed some exceptions in the figure that (a) higher WER does not reduce QA necessarily (b) adding entity names does not improve QA | :* There indeed some exceptions in the figure that (a) higher WER does not reduce QA necessarily (b) adding entity names does not improve QA | ||
+ | :* The results on [[媒体文件:Music_QA_wer.pdf]] | ||
2014年3月14日 (五) 02:38的最后版本
目录
Resoruce Building
- Current text resource has been re-arranged and listed
AM development
Sparse DNN
- Optimal Brain Damage(OBD).
- GA-based block sparsity
Efficient DNN training
- Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?
Multi GPU training
- Error encountered
GMM - DNN co-training
- Initial DNN test done
- tri4b - > DNN (org)
- DNN alignmenment -> tri4b
- tri4b alignment -> DNN (re-train)
model/testcase | test_dev93(cv) | test_eval92 -------------------------------------------------------------- 8400-80000(org) | 7.41 | 4.13 -------------------------------------------------------------- re-train (Keep state #) | 7.20 | 4.24 -------------------------------------------------------------- re-train (Free state #) | 7.29 | 4.31 --------------------------------------------------------------
Multilanguage training
- Pure Chinese training reached 4.9%
- Chinese + English reduced to 7.9%
- English phone set should discriminate beginning phone and ending phone
- Should set up multilingual network structure which shares low layers but separate languages at high layers
Noise training
- Train with wsj database by corrupting data with various noise types
- Almost all training conditions are completed
- Interesting results in multi-conditional training (white + cafe) and test on park/station
AMR compression re-training
- WeChat uses AMR compression method, which requires adaptation for our AM
- Test AMR & non-AMR model
model wav amr xent baseline 4.47 wav_mpe baseline 4.20 36.77 amr_mpe_lr_1e-5 6.27 8.95 amr_mpe_lr_1e-4 7.58 8.68 amr_xEnt_lr_1e-5 6.89 7.99 amr_xEnt_lr_1e-4 6.61 7.28 amr_xEnt_lr_0.08 5.72 6.20
- Prepare to do adaptation on 1700h
- Prepare to do mixing xEnt test
GFbank
- Finished the first round of gfbank training & test
- The same gmm model (mfcc feature) was used to get the alignment
- Traing fbank & gfbank based on the mfcc alignment
- Clean training and noise test
clean 5dB 10dB 15dB 20dB 25dB gfbank 4.22 73.03 39.20 16.41 8.36 5.60 gfbank_80 4.36 74.41 42.94 18.13 8.59 5.85 fbank_zmy 3.97 74.78 44.57 18.80 8.54 5.30
- gfbank + fbank 80 dim training/test
Engine optimization
- Investigating LOUDS FST.
Word to Vector
- Improved wordvector with multi sense
- Almost impossible with the toolkit
- Can think of pre-training vectors and then do clusering
- WordVecteor-based keyword extraction
- Prepared 7 category totally 500+ articles
- A problem fixed to retrieve article words
- Wordvector based on classification
LM development
NN LM
- Character-based NNLM (6700 chars, 7gram), 500M data training done.
- boundary-involved char NNLM training done
- Test on rescoring
- Investigate MS RNN LM training
3T Sogou LM
- 3T + tencent LM combination:
- Combine the 3T voc (11w) and the tencent 8w voca
- re-segmentation
- compute PPL with the 3T and tencent LM
- compute the best mixing weights
- the mixing weight is wrong ....
- if we mix the two by equal weight (0.5/0.5), performance is better than the individual
- 3T + QA model combination
QA Matching
- FST-based matching
- Investigating why openfST union does not lead to a determinizable graph
- Test the pattern label
- TF/IDF weight
- code is done, TF/IDF weight can be used right now.
Embedded development
- CLG embedded decoder is almost done. Online compiler is on progress.
- English scoring is under go
Speech QA
- N-best with entity LM was analyzed
- WER vs QA accuracy is done
- The figure shows that WER and QA accuracy is positively related
- Addding song names and singer names improve performance in most cases
- There indeed some exceptions in the figure that (a) higher WER does not reduce QA necessarily (b) adding entity names does not improve QA
- The results on 媒体文件:Music_QA_wer.pdf
- Class LM QA
- Use QA LM as the baseine
- Tag singer names and song names
- build tag LM
- Using graph integration to resolve the tags
- Adjusting in-tag weight
- Smaller weight produces more entity recognition
- Check if the recognized songs/singers are correct/wrong
1, non-merge BaseLine: qa-singer-song songs 41 singers 23 2, HCLG-merge Weight means the multiplier of the sub-graph entry. (1) LM:1e-5 weight 0.00000001 0.0001 0.001 0.01 1 10 songs 20 20 21 19 9 4 singers 13 13 13 13 2 2