Sinovoice-2014-04-22

来自cslt Wiki
跳转至: 导航搜索

h1. Environment setting

  • Sinovoice internal server deployment. Usage standard draft is released
  • Email notification is problematic. Need obtain an SMTP server
  • Will train an redmine administrator for Sinovoice

h1. Corpora

  • 300h Guangxi telecom text transcription prepared. 180h completed.
  • Now totally 1338h (470 + 346 + 105BJ mobile + 200 PICC + 108h HBTc + 109h New BJ mobile) telephone speech is ready.
  • 16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data.
  • Standard established for LM-speech-text labeling (speech data transcription for LM enhancement)
  • Xiaona is preparing noise database. Extract noise data from the original wav files.

h1. Acoustic modeling

h2. Telephone model training

h3. 1000h Training

  • Baseline: 8k states, 470+300 MPE4, 20.29
  • Jietong phone, 200 hour seed, 10k states training:
  • Xent 16 iteration: 22.90
  • MPE1 : 20.89
  • MPE2 : 20.68
  • MPE3 : 20.61
  • MPE4 : 20.56
  • CSLT phone, 8k states training
  • MPE1: 20.60
  • MPE2: 20.37
  • MPE3: 20.37
  • MPE4: 20.37
  • Found a problem on data processing. Some data were cut off incorrectly. Re-training the model.


h2. 6000 hour 16k training

h3. Training progress

  • Baseline: 1700h, MPE5, JT phone. 9.91
  • 6000h/CSLT phone set training
  • Xent: 12.83
  • MPE1: 9.21
  • MPE2: 9.13
  • MPE3: 9.10


  • 6000h/jt phone set phone set training
  • MPE1: 10.63

h3. Train Analysis

  • The Qihang model used a subset of the 6k data
  • 2500+950H+tang500h*+20131220, approximately 1700+2400 hours
  • GMM training using this subset achieved 22.47%. Xiaoming's result is 16.1%.
  • Seems the database is still not very consistent
  • Xiaoming kicked off the job to reproduce the Qihang training using this subset

h3. Multilanguage Training

  • Prepare Chinglish data: will try to select 100h first to train a baseline model
  • AMIDA database downloading
  • Prepare shared DNN structure for multilingual training
  • Baseline Chinese-English system is done
  • Need some configuration on the size of hidden layers, need more sharing structure
  • Need investigate knowledge based phone sharing

h3. Noise robust feature

  • GFbank can be propagated to Sinovoice
  • 1700h JT phone: MPE3: Fbank: 10.48 GFBank: 10.23
  • Prepare to train the 1000h telephone speech
  • Liuchao will prepare fast computing code


h1. Language modeling

h2. Domain specific atom-LM construction

h3. Some potential problems

  • Unclear domain definition
  • Using the same development set (8k transcription) is not very appropriate

h3. Text data filtering

  • A telecom specific word list is ready. Will work with Xiaona to prepare a new version of lexicon.
  • A comparison of document classification is done by LiuRong:
                   财经       IT      健康     体育     旅游     教育      招聘     文化      军事       总体
vsm                0.92    0.906   0.921    0.983   0.954    0.916     0.953    0.996     0.9339   0.94
lda(50)           0.84    0.39    0.79     0.85    0.60     0.368     0.61     0.31      0.86     0.62
w2v (50)          0.69    0.77    0.67     0.59    0.70     0.62      0.74     0.79      0.88     0.73


h1. DNN Decoder

h2. decoder optimization

  • Test computation cost of each step
  • beam 9/5000: netforward 65%
  • beam 13/7000: netforward 28%
  • This has been verified by Liuchao with the CSLT engine
  • The acceleration code was checked in to GIT, with small modification on heap management.

h2. Frame-skipping

  • Zhiyong & Liuchao will deliver the frame-skipping approach.

h2. BigLM optimization

  • Investigate BigLM retrieval optimization.