Deal with numbers in LM training

Numbers are not simple to handle, for all languages. The basic problem is that numbers are open, and therefore the context of numbers are not simple to model. Our approach is to substitue numbers into a single token "NUM". By bulding NUM-beared LM and a graph for NUM and composing these two graphs, we hope to train a robust model.

The first step, hence, is resubstitue numbers into NUM. The following steps are taken:

1. find all words with number 0-9, and replace it to NUM directly 2. find all words with chinese number '零'-'九', form a number word list L0 3. since some of the words are actually not numbers, such as '三纲五常', we remove the words in a pre-defined lexicon V from L0, get L=L0-V 4. the pre-defined lexicon V is from a general lexicon V0, by removing pure numbers, such as '一','二','一九一九' 5. design the mapping M=L -> num 6. using M to substitue numbers in the training text to 'NUM'.

Deal with numbers in LM training

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具