11-16 Bin Yuan
来自cslt Wiki
Accomplished this week
- build a new jsgf file
- construct a test set for address tag language model
- conduct a new experiment, result in
Planned for next week
Result
1. experiment 1
1.1 baseline corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt am: mdl_v3.0.S test set: test_BJYD result: %WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ] %SER 93.20 [ 1096 / 1176 ] BeiJing: 6 / 10 (BJYD test set's text contains 10 "BeiJing", decode 6 of 10)
1.2 use address tag: jsgf: extract top 500 frequent address(include "BeiJing") from corpus corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt,remove sentences containing "BeiJing", add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我 在 清华大学 上课", then add a sentence "我 在 <address> 上课" to corpus) am: mdl_v3.0.S test set: test_BJYD
try different merge weight, the result is as follow: weight: 0.1 %WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ] %SER 94.98 [ 1117 / 1176 ] BeiJing: 4 / 10
weight: 0.5 %WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ] %SER 93.88 [ 1104 / 1176 ] BeiJing: 4 / 10
weight: 1 %WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ] %SER 93.28 [ 1097 / 1176 ] BeiJing: 2 / 10
weight: 2 %WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ] %SER 93.71 [ 1102 / 1176 ] BeiJing: 1 / 10
weight: 3 can't decode "BeiJing"
This weekend I find two mistakes in experiment 1:
1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory and run this script under my directory, leading to higher WER. 2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file, and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt, so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when the decode result contains TAG, it would always be truncated. This explains why the deletion error is high when merge weight is small in experiment 1.2.
2. experiment 2
2.1 pre-work: 2.1.1 build jsgf file extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address from the address which appears no more than 10 times in the corpus, finally add 10 address which does not appear in the corpus.
some samples of the 490 address: 黑龙江省、宿迁市、安定门、吉林省 吉林市、芙蓉 西街、南三环 中路、朝阳 北路 大悦城、石门县 some samples of the 10 address: 上海市 浦东新区 陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、BeiJing市 海淀区 清华大学、明斯克、摩纳哥
2.1.2 construct a new test set named "test_address_tag", some sample is as follow: 测试集中120条文本包含的地名有三种情况:
训练预料中频繁的地名(出现次数大于10),不在jsgf当中(30条,按照地名在训练预料中出现的次数等间隔采样) jsgf中的第一种地名:在训练预料中出现次数小于10次(40条,按照地名在训练预料中出现的次数等间隔采样) jsgf中的第二种地名:在训练预料中没出现过(50条,每个地名的测试样本5条) 120条文本每条录音两遍(不是同一个人),一共240个音频,12个人录音,每人录音20条
2.2 baseline corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt am: mdl_1400 test set: test_address_tag result: %WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ] %SER 73.33 [ 176 / 240 ]
2.3 address tag corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus am: mdl_1400 test set: test_address_tag weight: 1 %WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
%SER 69.17 [ 166 / 240 ]