Gigabye LM

1. very initial, without any prunning, character based. Here is the size and perplexity.

The training is with Gigabytes except the cna data, and ppl testing is based on a sub set from the cna data (big52gb applied)

2gram:

25M 2gram.4000.gz: 0 zeroprobs, logprob= -9.39983e+06 ppl= 161.965 ppl1= 177.141

3gram:

47M 3gram.500.gz:0 zeroprobs, logprob= -6.34868e+06 ppl= 85.1361 ppl1= 94.2525

117M 3gram.1000.gz :0 zeroprobs, logprob= -7.43809e+06 ppl= 80.6408 ppl1= 87.7439

195M 3gram.2000.gz:0 zeroprobs, logprob= -7.95872e+06 ppl= 79.9875 ppl1= 86.5196

221M 3gram.3000.gz:0 zeroprobs, logprob= -8.04799e+06 ppl= 80.2418 ppl1= 86.7277

229M 3gram.4000.gz:0 zeroprobs, logprob= -8.15697e+06 ppl= 82.6585 ppl1= 89.3392

4gram:

205M 4gram.500.gz:0 zeroprobs, logprob= -6.25395e+06 ppl= 79.6739 ppl1= 88.0716

472M 4gram.1000.gz:0 zeroprobs, logprob= -7.21607e+06 ppl= 70.737 ppl1= 76.774

2. pruning the 4k 3gram LM.

Model 1gram 2gram 3gram              size        ppl
1           1e-7  1e-7      1e-7               30M     logprob= -8.55982e+06 ppl= 102.796 ppl1= 111.532
2           1e-6  1e-6      1e-6               5M       logprob= -9.26982e+06 ppl= 150.96   ppl1= 164.9
3           1e-7  1e-6.5    1e-6.5           11M      logprob= -9.09681e+06 ppl= 137.467 ppl1= 149.913

3. word-based 3-gram

tri-gram size:

org	th-7	th-7/6	th-6
10k: 52M	23M	8M	4M
20k: 57M	24M	9M	4M

final fst size:

org	th-7	th-7/6	th-6
10k: -	770M	193M	135M
20k: -	-	-	-

Gigabye LM

1. very initial, without any prunning, character based. Here is the size and perplexity.

2. pruning the 4k 3gram LM.

3. word-based 3-gram

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具