Deal with numbers in LM training

来自cslt Wiki
2012年9月13日 (四) 01:39166.111.134.19讨论的版本

(差异) ←上一版本 | 最后版本 (差异) | 下一版本→ (差异)
跳转至: 导航搜索

Numbers are not simple to handle, for all languages. The basic problem is that numbers are open, and therefore the context of numbers are not simple to model. Our approach is to substitue numbers into a single token "NUM". By bulding NUM-beared LM and a graph for NUM and composing these two graphs, we hope to train a robust model.

The first step, hence, is resubstitue numbers into NUM. The following steps are taken:

1. find all words with number 0-9, and replace it to NUM directly 2. find all words with chinese number '零'-'九', form a number word list L0 3. since some of the words are actually not numbers, such as '三纲五常', we remove the words in a pre-defined lexicon V from L0, get L=L0-V 4. the pre-defined lexicon V is from a general lexicon V0, by removing pure numbers, such as '一','二','一九一九' 5. design the mapping M=L -> num 6. using M to substitue numbers in the training text to 'NUM'.