“Gigabye LM”版本间的差异

2012年9月14日 (五) 08:23的最后版本

1. very initial, without any prunning, character based. Here is the size and perplexity.

The training is with Gigabytes except the cna data, and ppl testing is based on a sub set from the cna data (big52gb applied)

2gram:

25M 2gram.4000.gz: 0 zeroprobs, logprob= -9.39983e+06 ppl= 161.965 ppl1= 177.141

3gram:

47M 3gram.500.gz:0 zeroprobs, logprob= -6.34868e+06 ppl= 85.1361 ppl1= 94.2525

117M 3gram.1000.gz :0 zeroprobs, logprob= -7.43809e+06 ppl= 80.6408 ppl1= 87.7439

195M 3gram.2000.gz:0 zeroprobs, logprob= -7.95872e+06 ppl= 79.9875 ppl1= 86.5196

221M 3gram.3000.gz:0 zeroprobs, logprob= -8.04799e+06 ppl= 80.2418 ppl1= 86.7277

229M 3gram.4000.gz:0 zeroprobs, logprob= -8.15697e+06 ppl= 82.6585 ppl1= 89.3392

4gram:

205M 4gram.500.gz:0 zeroprobs, logprob= -6.25395e+06 ppl= 79.6739 ppl1= 88.0716

472M 4gram.1000.gz:0 zeroprobs, logprob= -7.21607e+06 ppl= 70.737 ppl1= 76.774

2. pruning the 4k 3gram LM.

Model	2gram	3gram	size	ppl	fst size
1	1e-7	1e-7	30M	ppl= 102.796	860M
2	1e-6	1e-6	5M	ppl= 150.96	152M
3	1e-7	1e-6	11M	ppl= 137.467	224M

3. word-based 3-gram

tri-gram size:

org	th-7	th-7/6	th-6
10k: 52M	23M	8M	4M
20k: 57M	24M	9M	4M

final fst size:

org	th-7	th-7/6	th-6
10k: -	770M	193M	135M
20k: -	-	217M	142M

Test is performed on 863 M49, LDA+LLT (tri2b), in terms of character error rate (CER). The NUM part is deleted from the decoding result. The pair after CER represents (1/acweight, t/utt).

-	th-6	th-7/6	th-7
10k	23.77(13,0.92)	22.41(11,0.93)	21.96(11,0.93)
20k	21.92(13,0.99)	20.33(12,0.97)	19.38(12,0.96)

Results with LDA+MLLT+MMI

-	th-6	th-7/6	th-7
10k	22.95(13, 1.0)	21.83(13,1.0)	21.41(10, 0.98)
20k	20.71(11, 1.1)	19.26(11, 1.1)	18.44(10, 1.1)

Results with LDA+MLLT+bMMI

-	th-6	th-7/6	th-7
10k	22.68(10,1.0)	21.46(10,1.0)	20.96(10,1.0)
20k	20.39(12, 1.1)	18.97(11,1.1)	18.23(10,1.1)

@@ 第1行： / 第1行： @@
 == 1. very initial, without any prunning, character based. Here is the size and perplexity. ==
+The training is with Gigabytes except the cna data, and ppl testing is based on a sub set from the cna data (big52gb applied)
+'''2gram:'''
+M  2gram.4000.gz: 0 zeroprobs, logprob= -9.39983e+06 ppl= 161.965 ppl1= 177.141
+'''3gram:'''
-'''3gram:
-'''
 M   3gram.500.gz:0 zeroprobs, logprob= -6.34868e+06 ppl= 85.1361 ppl1= 94.2525
@@ 第22行： / 第27行： @@
-----
+== 2. pruning the 4k 3gram LM. ==
+{|class="wikitable"
+! Model ||2gram ||3gram  ||            size  ||      ppl || fst size
+|-
+| 1          ||           1e-7    ||  1e-7          ||     30M   ||   ppl= 102.796 || 860M
+|-
+|2           ||           1e-6    ||  1e-6          ||     5M     ||   ppl= 150.96  ||  152M
+|-
+| 3          ||          1e-7    ||  1e-6          ||     11M    ||   ppl= 137.467 ||  224M
+|-
+|}
+== 3. word-based 3-gram ==
+tri-gram size:
+{| class="wikitable"
+|-
+|       org     ||     th-7   || th-7/6 || th-6
+|-
+|10k: 52M   ||      23M ||   8M    ||  4M
+|-
+|20k: 57M   ||      24M ||   9M    ||  4M
+|-
+|}
+final fst size:
+{| class="wikitable"
+|-
+|       org     ||     th-7         || th-7/6     || th-6
+|-
+|10k:  -  ||     770M             ||  193M     ||  135M
+|-
+|20k: -   ||     -                     ||   217M    || 142M
+|-
+|}
+Test is performed on 863 M49, LDA+LLT (tri2b), in terms of character error rate (CER). The NUM part is deleted from the decoding result. The pair after CER represents (1/acweight, t/utt).
+{| class="wikitable"
+!-        !! th-6 !!th-7/6 !!th-7
+|-
+|10k    ||23.77(13,0.92)||  22.41(11,0.93)|| 21.96(11,0.93)
+|-
+|20k    ||21.92(13,0.99)||  20.33(12,0.97)|| 19.38(12,0.96)
+|-
+|}
+Results with LDA+MLLT+MMI
+{| class="wikitable"
+!-        !! th-6 !!th-7/6 !!th-7
+|-
+|10k    || 22.95(13, 1.0)||  21.83(13,1.0)|| 21.41(10, 0.98)
+|-
+|20k    || 20.71(11, 1.1)||  19.26(11, 1.1)|| 18.44(10, 1.1)
+|-
+|}
+Results with LDA+MLLT+bMMI
+{| class="wikitable"
+!-        !! th-6 !!th-7/6 !!th-7
+|-
+|10k    || 22.68(10,1.0) || 21.46(10,1.0) || 20.96(10,1.0)
+|-
+|20k    || 20.39(12, 1.1) || 18.97(11,1.1)|| 18.23(10,1.1)
+|-
+|}

“Gigabye LM”版本间的差异

2012年9月14日 (五) 08:23的最后版本

1. very initial, without any prunning, character based. Here is the size and perplexity.

2. pruning the 4k 3gram LM.

3. word-based 3-gram

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具