2014-11-10

来自cslt Wiki
2014年11月9日 (日) 22:37Lr讨论 | 贡献的版本

(差异) ←上一版本 | 最后版本 (差异) | 下一版本→ (差异)
跳转至: 导航搜索

Text Processing

LM development

Domain specific LM

  • domain lm
  • weibo lm with pruning 0 10 10 20 20 testing done. weibo lm with pruning 0 10 8 8 8 under testing. weibo lm without pruning 4/8 done.
  • merger weibo、baiduhi and baiduzhidao lm and test (this week)
  • confirm the size of alpa with xiaomin for business application.(like e-13)
  • get the general test data from miaomin .this test set may get from online.
  • new dict.
  • Tested the earlier vocabulary on 6000.txt with ppl.
               old150K      new166K      new150K
   baiduzhidao     394          369          333
   baiduhi         217          190          188
  • Built new 100K,150K,200K vocabulary
  • Had fixed some bugs in sogou dict spider.
  • new toolkit:find method to update the new dict. can get new wordlist from sougou and get word information from baidu.(two week)

tag LM

  • set new test
  • result


RNN LM

  • rnn
  • RNNLM=>ALPA make a report
  • test RNNLM on Chinese data from jietong-data
  • check the rnnlm code.
  • lstm+rnn
  • check the lstm-rnnlm code

Word2Vector

W2V based doc classification

  • Initial results variable Bayesian GMM obtained. Performance is not as good as the conventional GMM.
  • Non-linear inter-language transform: English-Spanish-Czch: wv model training done, transform model on investigation
  • SSA-based local linear mapping still on running.
  • k-means classes change to 2.
  • Knowledge vector started
  • format the data
  • yuanbin will continue this work with help of xingchao.
  • Character to word conversion
  • prepare the task: word similarity
  • prepare the dict.
  • Google word vector train
  • some ideal will discuss on weekly report.

Translation

  • v4.0 demo released
  • cut the dict and use new segment-tool

QA

  • lucene Optimization
  • rewrite the method to select the 50 standard question not same template.(this week)
  • test the boost keyword weight and extract the synonyms word.(this week)
  • check the word segment for template.(this week)
  • min-segment method improve the accuracy.(0.61->0.66)
  • check the query method for getting lucene information and to rewrite the score method like the idf value.
  • test
  • test the different idf vale from baidu sougou in fuzzymatch.(this week)
  • need to check the other 10% error.(this week)
  • spell check
  • simple demo done.
  • new inter will install SEMPRE