“2014-10-20”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
Lr讨论 | 贡献
Text Processing
第128行: 第128行:
 
====Domain specific LM====
 
====Domain specific LM====
  
h2. ngram generation is on going
+
h2. lm based on baidu_hi and baiduzhidao is done, test on shujutang test set.
h2. look the memory and baidu_hi done
+
h2. weibo lm were training with pruning on counts(5,10,10,20,20),because it is too large. the ppl is twice as high than baidu_hi && baidu_zhidao. 
  
 
h2. NUM tag LM:
 
h2. NUM tag LM:
* maxi work is released.
+
* use HIT's LTP tool to segment,pos and ner. the program is running(about 3 days) on baiduHi and baiduzhidao(total 365G)
* yuanbin continue the tag lm work.
+
* will use the small test set from xiaoxi for address-tag..
* add the ner to tag lm .
+
* now about more 1M address,will prune it using frequency.
* Boost specific words like wifi if TAG model does not work for a particular word.
+
  
  
第149行: 第148行:
 
* Knowledge vector started
 
* Knowledge vector started
 
:* format the data
 
:* format the data
 +
:* yuanbin will continue this work with help of xingchao.
  
 
* Character to word conversion
 
* Character to word conversion
第155行: 第155行:
  
 
* Google word vector train
 
* Google word vector train
:* improve the sampling method
+
:* some ideal will discuss on weekly report.
 
+
 
===RNN LM===
 
===RNN LM===
 
*rnn
 
*rnn
 +
: get baseline on nbest rescore of wer.
 
*lstm+rnn
 
*lstm+rnn
: install the tool and prepare the data of wsj
+
: trained the RNN+LSTM lm on wsj_np_data about 200M. the neural net work is 100*100(lstm cell)*10000 with 100 classes. it cost about 200 minutes each epoch.
: prepare the baseline.
+
: get baseline on nbest rescore of wer.
 +
: more detail on LSTM 
 
===Translation===
 
===Translation===
  
 
* v3.0 demo released
 
* v3.0 demo released
 
:* still slow
 
:* still slow
:* re-segment the word using new dictionary.
+
:* re-segment the word using new dictionary.will use the tencent-dic about 11w.
 
:* check new data.
 
:* check new data.
  
第172行: 第173行:
  
 
* search method:
 
* search method:
:* add the vsm and BM25 to improve the search. and the strategy of selecting the answer
+
: add the vsm and BM25 to improve the search. and the strategy of selecting the answer.
:* segment the word using minimum granularity for lucene index and bag-of-words method.
+
* spell check
 +
: get ngram tool and make a simple demo.
 +
: get domain word list and pingyin tool from huilan.
 
* new inter will install SEMPRE
 
* new inter will install SEMPRE

2014年10月20日 (一) 07:12的版本

Speech Processing

AM development

Contour

  • NAN problem
  • nan recurrence
  ------------------------------------------------------------
   grid/atr.  |   Reproducible  |    add.
  ------------------------------------------------------------
   grid-10    |     yes         |   
  ------------------------------------------------------------
   grid-12    |     no          | "nan" in different position
  ------------------------------------------------------------
   grid-14    |     yes         |  
  ------------------------------------------------------------

Sparse DNN

  • Performance improvement found when pruned slightly
  • Experiments show that
  • Suggest to use TIMIT / AURORA 4 for training

RNN AM

  • Initial test on WSJ , leads to out-memory.
  • Using AURORA 4 short-sentence with a smaller number of targets.

Noise training

  • First draft of the noisy training journal paper
  • Paper Correction (Yinshi, Liuchao, Lin Yiye), be going.

Drop out & Rectification & convolutive network

  • Drop out
  • dataset:wsj, testset:eval92
       std |  dropout0.4 | dropout0.5 | dropout0.7 | dropout0.8
    ------------------------------------------------------------- 
       4.5 |     5.39    |    4.80    |   4.36     |    -      
  • Test on noisy AURORA4 dataset
       std |  dropout0.4 | dropout0.5 | dropout0.7 | dropout0.8
    ------------------------------------------------------------- 
      6.05 |     -       |    -       |   -        |   -
  • Continue the droptout on normal trained XEnt NNET , eg wsj. (+)
  • Draft the dropout-DNN weight distribution. (+)
  • Rectification
  • Still NAN error, need to debug. (+)
  • MaxOut (+)
  • Convolutive network
  • Test more configurations
  • Yiye will work on CNN
  • Reading CNN tutorial

Denoising & Farfield ASR

  • ICASSP paper submitted.

VAD

  • Add more silence tag "#" in pure-silence utterance text(train).
  • xEntropy model be training
  • need to test baseline.
  • Sum all sil-pdf as the silence posterior probability.
  • Program done, to tune the threshold

Speech rate training

  • Seems ROS model is superior to the normal one with faster speech
  • Suggest to extract speech data of different ROS, construct a new test set(+)
  • Suggest to use Tencent training data(+)

low resource language AM training

  • Use Chinese NN as initial NN, change the last layer
  • Various the used Chinese trained DNN layer numbers.
    • feature_transform = 6000h_transform + 6000_N*hidden-layers
 nnet.init = random (4-N)*hidden-layers + output-layer
 | N / learn_rate | 0.008         | 0.001 | 0.0001 |
 |   baseline     | 17.00(14*2h)  |       |        |
 |       4        | 17.75(9*0.6h) | 18.64 |        |
 |       3        | 16.85         |       |        |
 |       2        | 16.69         |       |        |
 |       1        | 16.87         |       |        |
 |       0        | 16.88         |       |        |  
    • feature_transform = uyghur_transform + 6000_N*hidden-layers
 nnet.init = random (4-N)*hidden-layers + output-layer
 Note: This is reproduced Yinshi's experiment
 | N / learn_rate | 0.008 | 0.001 | 0.0001 |
 |   baseline     | 17.00 |       |        |
 |       4        | 28.23 | 30.72 | 37.32  |
 |       3        | 22.40 |       |        |
 |       2        | 19.76 |       |        |
 |       1        | 17.41 |       |        |
 |       0        |       |       |        |
    • feature_transform = 6000_transform + 6000_N*hidden-layers
 nnet.init = uyghur (4-N)*hidden-layers + output-layer
 | N / learn_rate | 0.008 | 0.001 | 0.0001 |
 |   baseline     | 17.00 |       |        |
 |       4        | 17.80 | 18.55 | 21.06  |
 |       3        | 16.89 | 17.64 |        |
 |       2        |       |       |        |
 |       1        |       |       |        |
 |       0        |       |       |        |

Scoring

  • global scoring done.
  • Pitch & rhythm done, need testing
  • Harmonics program done, experiment to be done.

Confidence

  • Reproduce the experiments on fisher dataset.
  • Use the fisher DNN model to decode all-wsj dataset


Speaker ID

  • Preparing GMM-based server.

Emotion detection

  • Sinovoice is implementing the server


Text Processing

LM development

Domain specific LM

h2. lm based on baidu_hi and baiduzhidao is done, test on shujutang test set. h2. weibo lm were training with pruning on counts(5,10,10,20,20),because it is too large. the ppl is twice as high than baidu_hi && baidu_zhidao.

h2. NUM tag LM:

  • use HIT's LTP tool to segment,pos and ner. the program is running(about 3 days) on baiduHi and baiduzhidao(total 365G)
  • will use the small test set from xiaoxi for address-tag..
  • now about more 1M address,will prune it using frequency.


Word2Vector

W2V based doc classification

  • Initial results variable Bayesian GMM obtained. Performance is not as good as the conventional GMM.
  • Non-linear inter-language transform: English-Spanish-Czch: wv model training done, transform model on investigation
  • SSA-based local linear mapping still on running.
  • k-means classes change to 2.
  • Knowledge vector started
  • format the data
  • yuanbin will continue this work with help of xingchao.
  • Character to word conversion
  • prepare the task: word similarity
  • prepare the dict.
  • Google word vector train
  • some ideal will discuss on weekly report.

RNN LM

  • rnn
get baseline on nbest rescore of wer.
  • lstm+rnn
trained the RNN+LSTM lm on wsj_np_data about 200M. the neural net work is 100*100(lstm cell)*10000 with 100 classes. it cost about 200 minutes each epoch.
get baseline on nbest rescore of wer.
more detail on LSTM

Translation

  • v3.0 demo released
  • still slow
  • re-segment the word using new dictionary.will use the tencent-dic about 11w.
  • check new data.

QA

  • search method:
add the vsm and BM25 to improve the search. and the strategy of selecting the answer.
  • spell check
get ngram tool and make a simple demo.
get domain word list and pingyin tool from huilan.
  • new inter will install SEMPRE