11-16 Bin Yuan

来自cslt Wiki
2014年11月18日 (二) 05:11Yuanb讨论 | 贡献的版本

跳转至: 导航搜索

Accomplished this week

  • build a new jsgf file
  • construct a test set for address tag language model
  • conduct a new experiment, result is as below

Planned for next week

  • check the relation that between weight and size of dict.
  • the short term should be punished.
  • make a summary about tag-lm.
  • read some paper about knowledge vector.

Result

1. experiment 1

 1.1 baseline
   corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
   am: mdl_v3.0.S
   test set: test_BJYD
   result:
     %WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
     %SER 93.20 [ 1096 / 1176 ]
     北京: 6 / 10 (BJYD test set's text contains 10 "北京", decode 6 of 10)
 1.2 use address tag:
   jsgf: extract top 500 frequent address(include "北京") from corpus
   corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt,remove sentences containing "北京", 
     add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我 在 清华大学 上课", 
     then add a sentence "我 在 <address> 上课" to corpus) 
   am: mdl_v3.0.S
   test set: test_BJYD
   try different merge weight, the result is as follow:
     weight: 0.1 
       %WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
       %SER 94.98 [ 1117 / 1176 ]
       北京: 4 / 10
     weight: 0.5
       %WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
       %SER 93.88 [ 1104 / 1176 ]
       北京: 4 / 10
     weight: 1
       %WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
       %SER 93.28 [ 1097 / 1176 ]
       北京: 2 / 10
     weight: 2
       %WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
       %SER 93.71 [ 1102 / 1176 ]
       北京: 1 / 10
     weight: 3
       can't decode "北京"

This weekend I find two mistakes in experiment 1:

   1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory 
     and run this script under my directory, leading to higher WER.
   2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst 
     generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file, 
     and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
     so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when 
     the decode result contains TAG, it would always be truncated. This explains why the deletion error is
     high when merge weight is small in experiment 1.2.

2. experiment 2

 2.1 pre-work:
   2.1.1 build jsgf file
     extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address 
     from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
     appear in the corpus.
     some samples of the 490 address:
       黑龙江省、宿迁市、安定门、吉林省 吉林市、芙蓉 西街、南三环 中路、朝阳 北路 大悦城、石门县
     some samples of the 10 address:
       上海市 浦东新区 陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、北京市 海淀区 清华大学、明斯克、摩纳哥
   2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
     测试集中120条文本包含的地名有三种情况:
       训练预料中频繁的地名(出现次数大于10),不在jsgf当中(30条,按照地名在训练预料中出现的次数等间隔采样)   
       jsgf中的第一种地名:在训练预料中出现次数小于10次(40条,按照地名在训练预料中出现的次数等间隔采样) 
       jsgf中的第二种地名:在训练预料中没出现过(50条,每个地名的测试样本5条)
     120条文本每条录音两遍(不是同一个人),一共240个音频,12个人录音,每人录音20条
 2.2 baseline
   corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
   am: mdl_1400
   test set: test_address_tag
   result:
   %WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
   %SER 73.33 [ 176 / 240 ]
   %ADD_ER[ 6 / 30, 16 / 40, 32 / 50 ]
 2.3 address tag
   corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
   am: mdl_1400
   test set: test_address_tag
   weight: 1
     %WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
     %SER 69.17 [ 166 / 240 ]
     %ADD_ER[ 6 / 30, 8 / 40, 12 / 50 ]