NLP based class LM

来自cslt Wiki
2012年9月13日 (四) 00:35115.170.223.144讨论的版本

(差异) ←上一版本 | 最后版本 (差异) | 下一版本→ (差异)
跳转至: 导航搜索

A particular problem of LM is that some words exists only a few times, but the context of these words should not be computed as such. For example, numbers 12537. It may occur in the training text only once, however it is context ( the context that contains numbers) is pretty solid. This motivates the class LM.

In class LMs, words with the same context are grouped as a class and the context is estimated by replacing all the class words with this class. In the class, words might be random selected or selected with some probability. This idea is a bit similar as decision tree (by the way, can we introduce tree LM?)

A class should be (1) share the same context in linguistics (2) large enough, even infinite (open) so that token-based context estimation is incorrect.

There are at least two classes: number, and name entities. Numbers are relatively simple, while name entities are not trivial. An interesting research is applying NLP approaches to find out name entities first, and then group the name entities into one or a few classes, if that can be obtained, for example, address, name, city...