Dongxu Zhang 14-11-03
来自cslt Wiki
Accomplished this week
- Create 100k,200k,150576 vocabulary. And use 150576 to build baiduhi, baiduzhidao language model(still running, preprocess).
- Use 166k vocabulary to train lm on baiduhi, baiduzhidao seperately,(still running ,pruning)
- Extract sentences which contains English and numbers from weibo corpus.
- Running BPTT using rwthlm. Still not normal. High ppl, low wer. But it seems that using rwthlm itself, lstm is indeed better than standard bptt.
- Found a tool called Shenlan which can parse Sogou cell vocabulary. Using its code with a crawler, we can update our vocabulary with new words.
Planned for next week
- Working on building lm and comparing vocabulary.
- Working on rwthlm.