“11-16 Bin Yuan”版本间的差异

2014年11月23日 (日) 14:51的最后版本

Accomplished this week

build a new jsgf file
construct a test set for address tag language model
conduct a new experiment, result is as below

Planned for next week

check the relation that between weight and size of dict.
the short term should be punished.
make a summary about tag-lm.
read some paper about knowledge vector.

@@ 第2行： / 第2行： @@
 * build a new jsgf file
 * construct a test set for address tag language model
-* conduct a new experiment, result in
+* conduct a new experiment, result is as below
 === Planned for next week ===
+* check the relation that between weight and size of dict.
-=== Result===
+* the short term should be punished.
-. experiment 1
+* make a summary about tag-lm.
+* read some paper about knowledge vector.
-.1 baseline
-    corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
-    am: mdl_v3.0.S
-    test set: test_BJYD
-    result:
-      %WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
-      %SER 93.20 [ 1096 / 1176 ]
-      BeiJing: 6 / 10 (BJYD test set's text contains 10 "BeiJing", decode 6 of 10)
-.2 use address tag:
-    jsgf: extract top 500 frequent address(include "BeiJing") from corpus
-    corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt，remove sentences containing "BeiJing",
-      add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我 在 清华大学 上课",
-      then add a sentence "我 在 <address> 上课" to corpus)
-    am: mdl_v3.0.S
-    test set: test_BJYD
-    try different merge weight, the result is as follow:
-      weight: 0.1
-        %WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
-        %SER 94.98 [ 1117 / 1176 ]
-        BeiJing: 4 / 10
-      weight: 0.5
-        %WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
-        %SER 93.88 [ 1104 / 1176 ]
-        BeiJing: 4 / 10
-      weight: 1
-        %WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
-        %SER 93.28 [ 1097 / 1176 ]
-        BeiJing: 2 / 10
-      weight: 2
-        %WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
-        %SER 93.71 [ 1102 / 1176 ]
-        BeiJing: 1 / 10
-      weight: 3
-        can't decode "BeiJing"
--------------------------------------------------------------------------------
-This weekend I find two mistakes in experiment 1:
-. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
-      and run this script under my directory, leading to higher WER.
-. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
-      generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
-      and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
-      so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
-      the decode result contains TAG, it would always be truncated. This explains why the deletion error is
-      high when merge weight is small in experiment 1.2.
-. experiment 2
-.1 pre-work:
-.1.1 build jsgf file
-      extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address
-      from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
-      appear in the corpus.
-      some samples of the 490 address:
-        黑龙江省、宿迁市、安定门、吉林省 吉林市、芙蓉 西街、南三环 中路、朝阳 北路 大悦城、石门县
-      some samples of the 10 address:
-        上海市 浦东新区 陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、BeiJing市 海淀区 清华大学、明斯克、摩纳哥
-.1.2 construct a new test set named "test_address_tag", some sample is as follow:
-      测试集中120条文本包含的地名有三种情况：
-        训练预料中频繁的地名（出现次数大于10），不在jsgf当中（30条，按照地名在训练预料中出现的次数等间隔采样）
-        jsgf中的第一种地名：在训练预料中出现次数小于10次（40条，按照地名在训练预料中出现的次数等间隔采样）
-        jsgf中的第二种地名：在训练预料中没出现过（50条，每个地名的测试样本5条）
-条文本每条录音两遍（不是同一个人），一共240个音频，12个人录音，每人录音20条
-.2 baseline
-    corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
-    am: mdl_1400
-    test set: test_address_tag
-    result:
-    %WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
-    %SER 73.33 [ 176 / 240 ]
-.3 address tag
-    corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
-    am: mdl_1400
-    test set: test_address_tag
-    weight: 1
-      %WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
-      %SER 69.17 [ 166 / 240 ]

“11-16 Bin Yuan”版本间的差异

2014年11月23日 (日) 14:51的最后版本

Accomplished this week

Planned for next week

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具