“Sinovoice-2016-4-28”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
(以“==Data== *16K LingYun :* 2000h data ready :* 4300h real-env data to label * YueYu :* Total 250h(190h-YueYu + 60h-English) :* Add 60h YueYu :* CER: 75%->76% * WeiY...”为内容创建页面)
 
 
(2位用户的6个中间修订版本未显示)
第29行: 第29行:
 
   -----------------------------------------------------------------------------
 
   -----------------------------------------------------------------------------
 
     svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 |  24  |  57  | 406 | 8.24/487
 
     svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 |  24  |  57  | 406 | 8.24/487
 +
  -----------------------------------------------------------------------------
 +
 +
:* Testdata: test_8000ju
 +
  -----------------------------------------------------------------------------
 +
                  model                    | ins  |  del  | sub  | wer/tot-err 
 +
  -----------------------------------------------------------------------------
 +
    svd600_lr2e-5_1000H_mpe_uv-fix        |  140 |  562  | 3686 | 9.19/4388    | 47753-total-word
 +
  -----------------------------------------------------------------------------
 +
    svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 |  146 |  510  | 3705 | 9.13/481
 
   -----------------------------------------------------------------------------
 
   -----------------------------------------------------------------------------
  
第42行: 第51行:
 
   -----------------------------------------------------------------------------
 
   -----------------------------------------------------------------------------
  
 +
:* Testdata: test_10000ju
 +
  -----------------------------------------------------------------------------
 +
                  model                    | ins  |  del  | sub  | wer/tot-err 
 +
  -----------------------------------------------------------------------------
 +
    svd600_lr2e-5_1000H_mpe_uv-fix        |  478 | 3905  | 7698 | 18.31/12081  | 65989-total-word
 +
  -----------------------------------------------------------------------------
 +
    svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 |  481 | 3741  | 7773 | 18.18/11995
 +
  -----------------------------------------------------------------------------
  
 
* Add one silence arc from start-state to end-state
 
* Add one silence arc from start-state to end-state
  
 
===Big-Model Training===
 
===Big-Model Training===
* 7*2048-10000h net weight-matrix factoring, to improve the decoding speed --SVD
+
* 16k
:* SVD looks OK, but fine-tuning still didn't work.
+
  ================================================================================================
   Base WER:
+
  |                      |  TDNN 7-1200  | TDNN 7-1200 enhance | TDNN 7-1200 svd600 |
   relu_2000_mpe_1000H: 17.72
+
  ------------------------------------------------------------------------------------------------
   relu_1200_mpe_1000H: 18.60
+
  |8000ju frame_skip=1  |                |  0.0556 / 0.349    |  0.0559 / 0.306    |
 +
  |8000ju frame_skip=2  |  0.059 / 0.243  |  0.0591 / 0.231    |  0.0589 / 0.228    |
 +
  ------------------------------------------------------------------------------------------------
 +
  |10000ju frame_skip=1  |                |  0.1241 / 0.341    |  0.1244 / 0.358    |
 +
   |10000ju frame_skip=2  |  0.1348 / 0.234 |  0.1315 / 0.245    |  0.1311 / 0.204    |
 +
   ------------------------------------------------------------------------------------------------
 +
  |English frame_skip=1  |                |  0.3897 / 0.370    |  0.4062 / 0.353    |
 +
   |English frame_skip=2  |  0.4296        |  0.4237 / 0.276    |  0.4306 / 0.252    |
 +
  ================================================================================================
  
  |layer / nodes retaind|  200  |  400  |  600  |  800  | 1000  | 1200  | 1400  |  1600  |
 
  |      hidden 2      |      |      | 22.53 | 20.30 | 19.01 |      |      |        |
 
  |      hidden 7      |      | 18.92 | 18.30 | 17.92 |      |      |      |        |   
 
  |      final        |      |      | 18.32 | 18.00 | 17.83 |      |      |        |   
 
* 7*1024 cross-entropy total train, then mpe, 0.2 improvment
 
* 7*1024 svd factoring, speed the decoding
 
  
 
* 8k
 
* 8k
 +
  PingAn:
 +
  ===============================================================================
 +
  |    AM / config            |  all beam9  |all beam9 biglm||  KeHu beam9  |
 +
  -------------------------------------------------------------------------------
 +
  | tdnn 7-2048 xEnt          |    16.45    |    16.22    || 36.49 / 25.18 |
 +
  | tdnn 7-2048 MPE            |    15.22    |    14.87    || 32.77 / 23.48 |
 +
  | tdnn 7-2048 MPE adapt-PABX |    14.67    |    14.63    || 31.33 / 22.76 |
 +
  -------------------------------------------------------------------------------
 +
  | tdnn 7-1024 xEnt          |    16.60    |    16.25    || 35.91 / 25.58 |
 +
  | tdnn 7-1024 MPE            |    15.67    |    15.61    || 32.77 / 26.09 |
 +
  | tdnn 7-1024 MPE adapt-PABX |    14.80    |    14.76    || 30.48 / 22.56 |
 +
  ===============================================================================
 +
 +
 +
  LiaoNingYiDong:
 +
  ==============================================================================
 +
  |    AM / config            |    beam9    |  beam9 biglm |    beam13    |
 +
  ------------------------------------------------------------------------------
 +
  | tdnn 7-2048 xEnt          |    21.51    |    21.05    |    21.17    |
 +
  | tdnn 7-2048 MPE            |    20.09    |    19.74    |    19.74    |
 +
  | tdnn 7-2048 MPE adapt-LNYD |    17.92    |    17.87    |    17.58    |
 +
  ------------------------------------------------------------------------------
 +
  | tdnn 7-1024 xEnt          |    21.72    |    22.74    |    21.64    |
 +
  | tdnn 7-1024 MPE            |    20.99    |    20.77    |    20.74    |
 +
  | tdnn 7-1024 MPE adapt-LNYD |              |              |              |
 +
  ==============================================================================
 +
  
 
===Embedding===
 
===Embedding===
* 10000h-chain 5*400+800 DONE.
+
* The size of nnet1 AM is 6.4M (3M after decomposition). So we need to control AM size within 10M.
:* Beam affect the performance of chain model significantly, need more investigation.
+
* 5*576-2400 TDNN model training done. AM size is about 17M
* 5*576-2400 TDNN model
+
* 5*500-2400 TDNN model on training.
  
 
===SinSong Robot===
 
===SinSong Robot===
第84行: 第130行:
 
:* Compose, success.
 
:* Compose, success.
 
* 2-step decoding: first, character-based LM. Then, word-based LM.
 
* 2-step decoding: first, character-based LM. Then, word-based LM.
 +
*Word boundary character training
  
 
===Project===
 
===Project===
第92行: 第139行:
 
==SID==
 
==SID==
 
===Digit===
 
===Digit===
* Same Channel test EER: 100%  
+
* DNN-PLDA gets better performance than i-Vector;
:* Speaker confirm
+
DNN
:* phone channel
+
cosine
 +
10.4167%, at threshold 89.3973
 +
9.72222%, at threshold 87.8146
 +
8.68056%, at threshold 84.2021
 +
3.47222%, at threshold 11.5852
 +
lda
 +
3.125%, at threshold 54.1172
 +
2.77778%, at threshold 50.1447
 +
2.43056%, at threshold 48.6887
 +
1.73611%, at threshold 14.5075
 +
plda
 +
2.43056%, at threshold -23.954
 +
2.08333%, at threshold -24.6051
 +
2.08333%, at threshold -21.0524
 +
1.73611%, at threshold 4.83949
  
* Cross Channel
+
ivector
:* Mic-wav PLDA adaptation EER from 9% to 7% (20-30 persons)
+
plda
 +
3.15789%, at threshold 0.563044
 +
3.85965%, at threshold 0.525273
 +
3.85965%, at threshold 0.502531
 +
2.80702%, at threshold 0.429186

2016年4月28日 (四) 06:28的最后版本

Data

  • 16K LingYun
  • 2000h data ready
  • 4300h real-env data to label
  • YueYu
  • Total 250h(190h-YueYu + 60h-English)
  • Add 60h YueYu
  • CER: 75%->76%
  • WeiYu
  • 50h for training
  • 120h labeled ready

Model training

Deletion Error Promblem

  • Add one noise phone to alleviate the silence over-training
  • Omit sil accuracy in discriminative training
  • H smoothing of XEnt and MPE
  • Testdata: test_1000ju from 8000ju
  -----------------------------------------------------------------------------
                 model                    | ins  |  del  | sub | wer/tot-err  
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix         |  24  |  56   | 408 | 8.26/488
  -----------------------------------------------------------------------------
svd600_lr2e-5_1000H_mpe_uv-fix_omitsilacc |  32  |  48   | 409 | 8.28/489
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 |  24  |  57   | 406 | 8.24/487
  -----------------------------------------------------------------------------
  • Testdata: test_8000ju
  -----------------------------------------------------------------------------
                 model                    | ins  |  del  | sub  | wer/tot-err  
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix         |  140 |  562  | 3686 | 9.19/4388     | 47753-total-word
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 |  146 |  510  | 3705 | 9.13/481
  -----------------------------------------------------------------------------
  • Testdata: test_2000ju from 10000ju
  -----------------------------------------------------------------------------
                 model                    | ins  |  del  |  sub | wer/tot-err  
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix         |  86  |  790  | 1471 | 18.55/2347
  -----------------------------------------------------------------------------
svd600_lr2e-5_1000H_mpe_uv-fix_omitsilacc |  256 |  473  | 1669 | 18.95/2398
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 |  95  |  704  | 1548 | 18.55/2347
  -----------------------------------------------------------------------------
  • Testdata: test_10000ju
  -----------------------------------------------------------------------------
                 model                    | ins  |  del  | sub  | wer/tot-err  
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix         |  478 | 3905  | 7698 | 18.31/12081  | 65989-total-word
  -----------------------------------------------------------------------------
   svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 |  481 | 3741  | 7773 | 18.18/11995
  -----------------------------------------------------------------------------
  • Add one silence arc from start-state to end-state

Big-Model Training

  • 16k
 ================================================================================================
 |                      |   TDNN 7-1200   | TDNN 7-1200 enhance | TDNN 7-1200 svd600 |
 ------------------------------------------------------------------------------------------------
 |8000ju frame_skip=1   |                 |   0.0556 / 0.349    |  0.0559 / 0.306    |
 |8000ju frame_skip=2   |  0.059 / 0.243  |   0.0591 / 0.231    |  0.0589 / 0.228    |
 ------------------------------------------------------------------------------------------------
 |10000ju frame_skip=1  |                 |   0.1241 / 0.341    |  0.1244 / 0.358    |
 |10000ju frame_skip=2  |  0.1348 / 0.234 |   0.1315 / 0.245    |  0.1311 / 0.204    |
 ------------------------------------------------------------------------------------------------
 |English frame_skip=1  |                 |   0.3897 / 0.370    |  0.4062 / 0.353    |
 |English frame_skip=2  |  0.4296         |   0.4237 / 0.276    |  0.4306 / 0.252    |
 ================================================================================================


  • 8k
 PingAn:
 ===============================================================================
 |     AM / config            |   all beam9   |all beam9 biglm||  KeHu beam9   |
 -------------------------------------------------------------------------------
 | tdnn 7-2048 xEnt           |     16.45     |     16.22     || 36.49 / 25.18 |
 | tdnn 7-2048 MPE            |     15.22     |     14.87     || 32.77 / 23.48 |
 | tdnn 7-2048 MPE adapt-PABX |     14.67     |     14.63     || 31.33 / 22.76 |
 -------------------------------------------------------------------------------
 | tdnn 7-1024 xEnt           |     16.60     |     16.25     || 35.91 / 25.58 |
 | tdnn 7-1024 MPE            |     15.67     |     15.61     || 32.77 / 26.09 |
 | tdnn 7-1024 MPE adapt-PABX |     14.80     |     14.76     || 30.48 / 22.56 |
 ===============================================================================


 LiaoNingYiDong:
 ==============================================================================
 |     AM / config            |     beam9     |   beam9 biglm |     beam13    |
 ------------------------------------------------------------------------------
 | tdnn 7-2048 xEnt           |     21.51     |     21.05     |     21.17     |
 | tdnn 7-2048 MPE            |     20.09     |     19.74     |     19.74     |
 | tdnn 7-2048 MPE adapt-LNYD |     17.92     |     17.87     |     17.58     |
 ------------------------------------------------------------------------------
 | tdnn 7-1024 xEnt           |     21.72     |     22.74     |     21.64     |
 | tdnn 7-1024 MPE            |     20.99     |     20.77     |     20.74     |
 | tdnn 7-1024 MPE adapt-LNYD |               |               |               |
 ==============================================================================


Embedding

  • The size of nnet1 AM is 6.4M (3M after decomposition). So we need to control AM size within 10M.
  • 5*576-2400 TDNN model training done. AM size is about 17M
  • 5*500-2400 TDNN model on training.

SinSong Robot

  • Test based on 10000h(7*2048-xent) model
 ------------------------------------------------
   condition | clean  | replay(0.5m) | real-env
 ------------------------------------------------
     wer     |   3    |  18(mpe-14)  | too-bad
 ------------------------------------------------
  • Plan to record in restaurant on April 10.

Character LM

  • Except Sogou-2T, 9-gram has been done.
  • Worse than word-lm(9%->6%)
  • Add word boundary tag to Character-LM trainig
  • Merge Character-LM & word-LM
  • Union
  • Compose, success.
  • 2-step decoding: first, character-based LM. Then, word-based LM.
  • Word boundary character training

Project

  • Pingan & Yueyu Deletion error too more
  • TDNN deletion error rate > DNN deletion error rate
  • TDNN Silence scale is too sensitive for different test cases.

SID

Digit

  • DNN-PLDA gets better performance than i-Vector;

DNN cosine 10.4167%, at threshold 89.3973 9.72222%, at threshold 87.8146 8.68056%, at threshold 84.2021 3.47222%, at threshold 11.5852 lda 3.125%, at threshold 54.1172 2.77778%, at threshold 50.1447 2.43056%, at threshold 48.6887 1.73611%, at threshold 14.5075 plda 2.43056%, at threshold -23.954 2.08333%, at threshold -24.6051 2.08333%, at threshold -21.0524 1.73611%, at threshold 4.83949

ivector plda 3.15789%, at threshold 0.563044 3.85965%, at threshold 0.525273 3.85965%, at threshold 0.502531 2.80702%, at threshold 0.429186