“Deal with numbers in LM training”版本间的差异
(以内容“Numbers are not simple to handle, for all languages. The basic problem is that numbers are open, and therefore the context of numbers are not simple to model. Our appr...”创建新页面) |
(没有差异)
|
2012年9月13日 (四) 01:39的最后版本
Numbers are not simple to handle, for all languages. The basic problem is that numbers are open, and therefore the context of numbers are not simple to model. Our approach is to substitue numbers into a single token "NUM". By bulding NUM-beared LM and a graph for NUM and composing these two graphs, we hope to train a robust model.
The first step, hence, is resubstitue numbers into NUM. The following steps are taken:
1. find all words with number 0-9, and replace it to NUM directly 2. find all words with chinese number '零'-'九', form a number word list L0 3. since some of the words are actually not numbers, such as '三纲五常', we remove the words in a pre-defined lexicon V from L0, get L=L0-V 4. the pre-defined lexicon V is from a general lexicon V0, by removing pure numbers, such as '一','二','一九一九' 5. design the mapping M=L -> num 6. using M to substitue numbers in the training text to 'NUM'.