Sinovoice-2016-4-28

来自cslt Wiki
2016年4月28日 (四) 01:47Zhangzy讨论 | 贡献的版本

(差异) ←上一版本 | 最后版本 (差异) | 下一版本→ (差异)
跳转至: 导航搜索

Data

  • 16K LingYun
  • 2000h data ready
  • 4300h real-env data to label
  • YueYu
  • Total 250h(190h-YueYu + 60h-English)
  • Add 60h YueYu
  • CER: 75%->76%
  • WeiYu
  • 50h for training
  • 120h labeled ready

Model training

Deletion Error Promblem

  • Add one noise phone to alleviate the silence over-training
  • Omit sil accuracy in discriminative training
  • H smoothing of XEnt and MPE
  • Testdata: test_1000ju from 8000ju
  -----------------------------------------------------------------------------
                 model                    | ins  |  del  | sub | wer/tot-err  
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix         |  24  |  56   | 408 | 8.26/488
  -----------------------------------------------------------------------------
svd600_lr2e-5_1000H_mpe_uv-fix_omitsilacc |  32  |  48   | 409 | 8.28/489
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 |  24  |  57   | 406 | 8.24/487
  -----------------------------------------------------------------------------
  • Testdata: test_2000ju from 10000ju
  -----------------------------------------------------------------------------
                 model                    | ins  |  del  |  sub | wer/tot-err  
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix         |  86  |  790  | 1471 | 18.55/2347
  -----------------------------------------------------------------------------
svd600_lr2e-5_1000H_mpe_uv-fix_omitsilacc |  256 |  473  | 1669 | 18.95/2398
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 |  95  |  704  | 1548 | 18.55/2347
  -----------------------------------------------------------------------------


  • Add one silence arc from start-state to end-state

Big-Model Training

  • 7*2048-10000h net weight-matrix factoring, to improve the decoding speed --SVD
  • SVD looks OK, but fine-tuning still didn't work.
 Base WER:
 relu_2000_mpe_1000H: 17.72
 relu_1200_mpe_1000H: 18.60
 |layer / nodes retaind|  200  |  400  |  600  |  800  | 1000  | 1200  | 1400  |  1600  |
 |      hidden 2       |       |       | 22.53 | 20.30 | 19.01 |       |       |        |
 |      hidden 7       |       | 18.92 | 18.30 | 17.92 |       |       |       |        |     
 |       final         |       |       | 18.32 | 18.00 | 17.83 |       |       |        |     
  • 7*1024 cross-entropy total train, then mpe, 0.2 improvment
  • 7*1024 svd factoring, speed the decoding
  • 8k

Embedding

  • 10000h-chain 5*400+800 DONE.
  • Beam affect the performance of chain model significantly, need more investigation.
  • 5*576-2400 TDNN model

SinSong Robot

  • Test based on 10000h(7*2048-xent) model
 ------------------------------------------------
   condition | clean  | replay(0.5m) | real-env
 ------------------------------------------------
     wer     |   3    |  18(mpe-14)  | too-bad
 ------------------------------------------------
  • Plan to record in restaurant on April 10.

Character LM

  • Except Sogou-2T, 9-gram has been done.
  • Worse than word-lm(9%->6%)
  • Add word boundary tag to Character-LM trainig
  • Merge Character-LM & word-LM
  • Union
  • Compose, success.
  • 2-step decoding: first, character-based LM. Then, word-based LM.

Project

  • Pingan & Yueyu Deletion error too more
  • TDNN deletion error rate > DNN deletion error rate
  • TDNN Silence scale is too sensitive for different test cases.

SID

Digit

  • Same Channel test EER: 100%
  • Speaker confirm
  • phone channel
  • Cross Channel
  • Mic-wav PLDA adaptation EER from 9% to 7% (20-30 persons)