“Sinovoice-2016-4-21”版本间的差异
来自cslt Wiki
(相同用户的一个中间修订版本未显示) | |||
第16行: | 第16行: | ||
==Model training== | ==Model training== | ||
==Deletion Error Promblem== | ==Deletion Error Promblem== | ||
+ | * Add one noise phone to alleviate the silence over-training | ||
* Omit sil accuracy in discriminative training | * Omit sil accuracy in discriminative training | ||
+ | :* Testdata: test_1000ju from 8000ju | ||
+ | --------------------------------------------------- | ||
+ | model | ins | del | sub | wer | ||
+ | --------------------------------------------------- | ||
+ | baseMPE 3.mdl | 25 | 68 | 468 | 9.50 | ||
+ | --------------------------------------------------- | ||
+ | MPE omit_sil_acc 3.mdl| 26 | 72 | 453 | 9.33 | ||
+ | --------------------------------------------------- | ||
− | + | :* Testdata: test_2000ju from 10000ju | |
− | + | ---------------------------------------------------- | |
− | + | model | ins | del | sub | wer | |
− | + | ---------------------------------------------------- | |
− | + | baseMPE 3.mdl | 96 | 768 | 1590 | 19.39 | |
− | + | ---------------------------------------------------- | |
− | + | MPE omit_sil_acc 3.mdl | 165 | 627 | 1685 | 19.58 | |
− | + | ---------------------------------------------------- | |
− | + | ||
− | Testdata: test_2000ju from 10000ju | + | |
− | ---------------------------------------------------- | + | |
− | + | ||
− | ---------------------------------------------------- | + | |
− | + | ||
− | ---------------------------------------------------- | + | |
− | MPE omit_sil_acc 3.mdl | 165 | 627 | 1685 | 19.58 | + | |
− | ---------------------------------------------------- | + | |
* H smoothing of XEnt and MPE | * H smoothing of XEnt and MPE | ||
* Add one silence arc from start-state to end-state | * Add one silence arc from start-state to end-state | ||
+ | |||
===Big-Model Training=== | ===Big-Model Training=== | ||
* 7*2048-10000h net weight-matrix factoring, to improve the decoding speed --SVD | * 7*2048-10000h net weight-matrix factoring, to improve the decoding speed --SVD | ||
第82行: | 第83行: | ||
:* TDNN deletion error rate > DNN deletion error rate | :* TDNN deletion error rate > DNN deletion error rate | ||
:* TDNN Silence scale is too sensitive for different test cases. | :* TDNN Silence scale is too sensitive for different test cases. | ||
− | + | ||
==SID== | ==SID== | ||
===Digit=== | ===Digit=== |
2016年4月21日 (四) 08:40的最后版本
目录
Data
- 16K LingYun
- 2000h data ready
- 4300h real-env data to label
- YueYu
- Total 250h(190h-YueYu + 60h-English)
- Add 60h YueYu
- CER: 75%->76%
- WeiYu
- 50h for training
- 120h labeled ready
Model training
Deletion Error Promblem
- Add one noise phone to alleviate the silence over-training
- Omit sil accuracy in discriminative training
- Testdata: test_1000ju from 8000ju
--------------------------------------------------- model | ins | del | sub | wer --------------------------------------------------- baseMPE 3.mdl | 25 | 68 | 468 | 9.50 --------------------------------------------------- MPE omit_sil_acc 3.mdl| 26 | 72 | 453 | 9.33 ---------------------------------------------------
- Testdata: test_2000ju from 10000ju
---------------------------------------------------- model | ins | del | sub | wer ---------------------------------------------------- baseMPE 3.mdl | 96 | 768 | 1590 | 19.39 ---------------------------------------------------- MPE omit_sil_acc 3.mdl | 165 | 627 | 1685 | 19.58 ----------------------------------------------------
- H smoothing of XEnt and MPE
- Add one silence arc from start-state to end-state
Big-Model Training
- 7*2048-10000h net weight-matrix factoring, to improve the decoding speed --SVD
- SVD looks OK, but fine-tuning still didn't work.
Base WER: relu_2000_mpe_1000H: 17.72 relu_1200_mpe_1000H: 18.60
|layer / nodes retaind| 200 | 400 | 600 | 800 | 1000 | 1200 | 1400 | 1600 | | hidden 2 | | | 22.53 | 20.30 | 19.01 | | | | | hidden 7 | | 18.92 | 18.30 | 17.92 | | | | | | final | | | 18.32 | 18.00 | 17.83 | | | |
- 7*1024 cross-entropy total train, then mpe, 0.2 improvment
- 7*1024 svd factoring, speed the decoding
- 8k
Embedding
- 10000h-chain 5*400+800 DONE.
- Beam affect the performance of chain model significantly, need more investigation.
- 5*576-2400 TDNN model
SinSong Robot
- Test based on 10000h(7*2048-xent) model
------------------------------------------------ condition | clean | replay(0.5m) | real-env ------------------------------------------------ wer | 3 | 18(mpe-14) | too-bad ------------------------------------------------
- Plan to record in restaurant on April 10.
Character LM
- Except Sogou-2T, 9-gram has been done.
- Worse than word-lm(9%->6%)
- Add word boundary tag to Character-LM trainig
- Merge Character-LM & word-LM
- Union
- Compose, success.
- 2-step decoding: first, character-based LM. Then, word-based LM.
Project
- Pingan & Yueyu Deletion error too more
- TDNN deletion error rate > DNN deletion error rate
- TDNN Silence scale is too sensitive for different test cases.
SID
Digit
- Same Channel test EER: 100%
- Speaker confirm
- phone channel
- Cross Channel
- Mic-wav PLDA adaptation EER from 9% to 7% (20-30 persons)