LM optimization with annealing in Chinese

There is a problem particular for Chinese when building LM.

We have known that word-based LM is better than character based LM, and we choose a word list for example 20k. The problem is that Chinese words are open while the characters are close. For other words outside of 20k, if we just delete them from the training data, we will get loss.

A possible solution is:

1. segment words, and choose 20k list by frequency (some tips as well, e.g., substitue numbers) 2. for those words outside of 20k, split them into sequences of short words (even characters), and then amend the word frequency 3. double check if the 20k word list changed. Since words after 20k usually do not take many counts, this should not change things significantly 4. use the splitting rules to split the corresponding words into short word sequences 5. re-train the model

LM optimization with annealing in Chinese

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具