Gigabye LM
1. very initial, without any prunning, character based. Here is the size and perplexity.
The training is with Gigabytes except the cna data, and ppl testing is based on a sub set from the cna data (big52gb applied)
2gram:
25M 2gram.4000.gz: 0 zeroprobs, logprob= -9.39983e+06 ppl= 161.965 ppl1= 177.141
3gram:
47M 3gram.500.gz:0 zeroprobs, logprob= -6.34868e+06 ppl= 85.1361 ppl1= 94.2525
117M 3gram.1000.gz :0 zeroprobs, logprob= -7.43809e+06 ppl= 80.6408 ppl1= 87.7439
195M 3gram.2000.gz:0 zeroprobs, logprob= -7.95872e+06 ppl= 79.9875 ppl1= 86.5196
221M 3gram.3000.gz:0 zeroprobs, logprob= -8.04799e+06 ppl= 80.2418 ppl1= 86.7277
229M 3gram.4000.gz:0 zeroprobs, logprob= -8.15697e+06 ppl= 82.6585 ppl1= 89.3392
4gram:
205M 4gram.500.gz:0 zeroprobs, logprob= -6.25395e+06 ppl= 79.6739 ppl1= 88.0716
472M 4gram.1000.gz:0 zeroprobs, logprob= -7.21607e+06 ppl= 70.737 ppl1= 76.774
2. pruning the 4k 3gram LM.
Model | 2gram | 3gram | size | ppl | fst size |
---|---|---|---|---|---|
1 | 1e-7 | 1e-7 | 30M | ppl= 102.796 | 860M |
2 | 1e-6 | 1e-6 | 5M | ppl= 150.96 | 152M |
3 | 1e-7 | 1e-6 | 11M | ppl= 137.467 | 224M |
3. word-based 3-gram
tri-gram size:
org | th-7 | th-7/6 | th-6 |
10k: 52M | 23M | 8M | 4M |
20k: 57M | 24M | 9M | 4M |
final fst size:
org | th-7 | th-7/6 | th-6 |
10k: - | 770M | 193M | 135M |
20k: - | - | 217M | 142M |
Test is performed on 863 M49, LDA+LLT (tri2b), in terms of character error rate (CER). The NUM part is deleted from the decoding result. The pair after CER represents (1/acweight, t/utt).
- | th-6 | th-7/6 | th-7 |
---|---|---|---|
10k | 23.77(13,0.92) | 22.41(11,0.93) | 21.96(11,0.93) |
20k | 21.92(13,0.99) | 20.33(12,0.97) | 19.38(12,0.96) |
Results with LDA+MLLT+MMI
- | th-6 | th-7/6 | th-7 |
---|---|---|---|
10k | 22.95(13, 1.0) | 21.83(13,1.0) | 21.41(10, 0.98) |
20k | 20.71(11, 1.1) | 19.26(11, 1.1) | 18.44(10, 1.1) |
Results with LDA+MLLT+bMMI
- | th-6 | th-7/6 | th-7 |
---|---|---|---|
10k | 22.68(10,1.0) | 21.46(10,1.0) | 20.96(10,1.0) |
20k | 20.39(12, 1.1) | 18.97(11,1.1) | 18.23(10,1.1) |