“LM optimization with annealing in Chinese”版本间的差异
(以内容“There is a problem particular for Chinese when building LM. We have known that word-based LM is better than character based LM, and we choose a word list for example ...”创建新页面) |
(没有差异)
|
2012年9月13日 (四) 00:55的最后版本
There is a problem particular for Chinese when building LM.
We have known that word-based LM is better than character based LM, and we choose a word list for example 20k. The problem is that Chinese words are open while the characters are close. For other words outside of 20k, if we just delete them from the training data, we will get loss.
A possible solution is:
1. segment words, and choose 20k list by frequency (some tips as well, e.g., substitue numbers) 2. for those words outside of 20k, split them into sequences of short words (even characters), and then amend the word frequency 3. double check if the 20k word list changed. Since words after 20k usually do not take many counts, this should not change things significantly 4. use the splitting rules to split the corresponding words into short word sequences 5. re-train the model