|
|
(相同用户的5个中间修订版本未显示) |
第2行: |
第2行: |
| * build a new jsgf file | | * build a new jsgf file |
| * construct a test set for address tag language model | | * construct a test set for address tag language model |
− | * conduct a new experiment, result in | + | * conduct a new experiment, result is as below |
| | | |
| === Planned for next week === | | === Planned for next week === |
− | | + | * check the relation that between weight and size of dict. |
− | === Result===
| + | * the short term should be punished. |
− | 1. experiment 1
| + | * make a summary about tag-lm. |
− | | + | * read some paper about knowledge vector. |
− | 1.1 baseline
| + | |
− | corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
| + | |
− | am: mdl_v3.0.S
| + | |
− | test set: test_BJYD
| + | |
− | result:
| + | |
− | %WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
| + | |
− | %SER 93.20 [ 1096 / 1176 ]
| + | |
− | BeiJing: 6 / 10 (BJYD test set's text contains 10 "BeiJing", decode 6 of 10)
| + | |
− | | + | |
− | 1.2 use address tag:
| + | |
− | jsgf: extract top 500 frequent address(include "BeiJing") from corpus
| + | |
− | corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt,remove sentences containing "BeiJing",
| + | |
− | add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我 在 清华大学 上课",
| + | |
− | then add a sentence "我 在 <address> 上课" to corpus)
| + | |
− | am: mdl_v3.0.S
| + | |
− | test set: test_BJYD
| + | |
− | | + | |
− | try different merge weight, the result is as follow:
| + | |
− | weight: 0.1
| + | |
− | %WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
| + | |
− | %SER 94.98 [ 1117 / 1176 ]
| + | |
− | BeiJing: 4 / 10
| + | |
− | | + | |
− | weight: 0.5
| + | |
− | %WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
| + | |
− | %SER 93.88 [ 1104 / 1176 ]
| + | |
− | BeiJing: 4 / 10
| + | |
− | | + | |
− | weight: 1
| + | |
− | %WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
| + | |
− | %SER 93.28 [ 1097 / 1176 ]
| + | |
− | BeiJing: 2 / 10
| + | |
− | | + | |
− | weight: 2
| + | |
− | %WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
| + | |
− | %SER 93.71 [ 1102 / 1176 ]
| + | |
− | BeiJing: 1 / 10
| + | |
− | | + | |
− | weight: 3
| + | |
− | can't decode "BeiJing"
| + | |
− | | + | |
− | -------------------------------------------------------------------------------
| + | |
− | This weekend I find two mistakes in experiment 1:
| + | |
− | 1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
| + | |
− | and run this script under my directory, leading to higher WER.
| + | |
− | 2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
| + | |
− | generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
| + | |
− | and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
| + | |
− | so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
| + | |
− | the decode result contains TAG, it would always be truncated. This explains why the deletion error is
| + | |
− | high when merge weight is small in experiment 1.2.
| + | |
− | | + | |
− | 2. experiment 2
| + | |
− | 2.1 pre-work:
| + | |
− | 2.1.1 build jsgf file
| + | |
− | extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address
| + | |
− | from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
| + | |
− | appear in the corpus.
| + | |
− | | + | |
− | some samples of the 490 address:
| + | |
− | 黑龙江省、宿迁市、安定门、吉林省 吉林市、芙蓉 西街、南三环 中路、朝阳 北路 大悦城、石门县
| + | |
− | some samples of the 10 address:
| + | |
− | 上海市 浦东新区 陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、BeiJing市 海淀区 清华大学、明斯克、摩纳哥
| + | |
− | | + | |
− | 2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
| + | |
− | 测试集中120条文本包含的地名有三种情况:
| + | |
− | 训练预料中频繁的地名(出现次数大于10),不在jsgf当中(30条,按照地名在训练预料中出现的次数等间隔采样)
| + | |
− | jsgf中的第一种地名:在训练预料中出现次数小于10次(40条,按照地名在训练预料中出现的次数等间隔采样)
| + | |
− | jsgf中的第二种地名:在训练预料中没出现过(50条,每个地名的测试样本5条)
| + | |
− | 120条文本每条录音两遍(不是同一个人),一共240个音频,12个人录音,每人录音20条
| + | |
− | | + | |
− | 2.2 baseline
| + | |
− | corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
| + | |
− | am: mdl_1400
| + | |
− | test set: test_address_tag
| + | |
− | result:
| + | |
− | %WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
| + | |
− | %SER 73.33 [ 176 / 240 ]
| + | |
− | | + | |
− | 2.3 address tag
| + | |
− | corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
| + | |
− | am: mdl_1400
| + | |
− | test set: test_address_tag
| + | |
− | weight: 1
| + | |
− | %WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
| + | |
− | %SER 69.17 [ 166 / 240 ]
| + | |