cslt Wiki - 用户贡献 [zh-cn]

2015-spring-time-table

2015-02-05T06:40:23Z

Yuanb：

2015年春节时间表

{|class='wikitable'
!人员 !! 春节往返时间统计 !! 天数
|-
|张之勇 || 2月13号-2月26号 || 14
|-
|赵梦原 || 2月14号-2月27号 || 14
|-
|张雪薇 || 2月18号-2月28号 || 11
|-
|王晓曦 || 2月15号-2月 28号 || 14
|-
|刘荣 || 2月14号-2月26号 || 13
|-
|骆天一 || 2月12号-2月25号 || 14
|-
|刘超 || 2月6号-2月11号，春节未定 || ?
|-
|殷实 || 2月12号-3月1号 || 18
|-
|林一叶 || 2月14号-未定 || ?
|-
|王冕 || 2月10号-2月28号 || 19
|-
|邢超 || 2月9号-2月27号 || 19
|-
|袁彬 || 2月10号-2月28号 || 19
|-
|张东旭 ||2月10号-2月28号 || 19
|-
|马习 || 2月12号-2月26号 || 15
|-
|曹立 || -3月3号 || O
|-
|曾翔宇 || 2月11号-2月28 || 18
|-
|}

注：标准放假时间：工程师:14天假期, 学生: 17天假期

Bin Yuan 2015-02-03

2015-02-03T03:41:36Z

Yuanb：以“last week done: 1. construct the large-scale training data for knowledge vector 2. prepare to submit the MTAP paper 3. finish the tag lm toolkit plan fo...”为内容创建页面

last week done:
1. construct the large-scale training data for knowledge vector
2. prepare to submit the MTAP paper
3. finish the tag lm toolkit

plan for this week:
1. conduct paragraph vector experiment
2. review papers for knowledge vector
3. start knowledge vector draft

2015-02-02

2015-02-03T03:35:13Z

Yuanb：

[[Mengyuan Zhao 2015-02-02]]

[[Dongxu Zhang 2014-02-02]]

[[Chao Liu 2014-02-02]]

[[Miao Fan 2014-02-02]]

[[Fanhu Bie 2014-02-03]]

[[Lantian Li 2015-02-03]]

[[Yiye Lin 2015-02-03]]

[[Tianyi Luo 2015-02-03]]

[[Xiangyu Zeng 2015-02-03]]

[[Xuewei Zhang 2015-02-03]]

[[Xiaoxi Wang 2015-02-03]]

[[Bin Yuan 2015-02-03]]

Bin Yuan 2015-01-26

2015-01-26T07:20:23Z

Yuanb：以“=== Accomplished this week === * try the hinge loss function for tree-based knowledge vector learning, and improve the correlation score from 0.5 to 0.8 * try the si...”为内容创建页面

=== Accomplished this week ===
* try the hinge loss function for tree-based knowledge vector learning, and improve the correlation score from 0.5 to 0.8
* try the sigmoid of inner product similarity function, result is comparable with the inner product function
* finish the taglm paper
=== Plan for next week ===
* make a larger test set
* use gensim toolkit to implement paragraph vector as a baseline

2015-01-26

2015-01-26T07:06:38Z

Yuanb：

[[Lantian Li 2015-01-26]]

[[Rong Liu 2015-01-26]]

[[Zhongda Xie 2015-01-26]]

[[Miao Fan 2015-01-26]]

[[Xiaoxi Wang 2015-01-26]]

[[Bin Yuan 2015-01-26]]

Bin Yuan 2015-01-19

2015-01-19T02:21:57Z

Yuanb：以“=== Accomplished this week === * replace wikipedia's taxonomy with Yago's taxonomy. === Plan for next week === * alter the objective function.”为内容创建页面

=== Accomplished this week ===
* replace wikipedia's taxonomy with Yago's taxonomy.
=== Plan for next week ===
* alter the objective function.

2015-01-19

2015-01-19T02:15:08Z

Yuanb：

[[Miao Fan 2015-01-17]]

[[2015-01-19 Rong Liu]]

[[Bin Yuan 2015-01-19]]

Bin Yuan 2015-01-12

2015-01-12T06:20:18Z

Yuanb：以“=== Accomplished this week === * implement multi-thread training for large wiki graph === Plan for next week === * continue knowledge vector.”为内容创建页面

=== Accomplished this week ===
* implement multi-thread training for large wiki graph
=== Plan for next week ===
* continue knowledge vector.

2015-01-12

2015-01-12T06:15:26Z

Yuanb：

[[Miao Fan 2015-01-12]]

[[Xiaoxi Wang 2015-01-12]]

[[Rong Liu 2015-01-12]]

[[Mengyuan Zhao 2015-01-12]]

[[Bin Yuan 2015-01-12]]

Bi-monthly-2015-01

2015-01-05T06:04:59Z

Yuanb：

文件:Bi-month report yuanbin pdf.pdf

2015-01-05T06:02:41Z

Yuanb：

Bin Yuan 2014-12-28

2014-12-29T01:09:30Z

Yuanb：以“=== Accomplished this week === * separate different types of link, and assign different weight for training. * use entry abstract words for training. === Plan for ne...”为内容创建页面

=== Accomplished this week ===
* separate different types of link, and assign different weight for training.
* use entry abstract words for training.
=== Plan for next week ===
* continue knowledge vector, try more reasonable objective function.

2014-12-28

2014-12-29T00:59:24Z

Yuanb：

[[Miao Fan 2014-12-28]]

[[Mengyuan Zhao 2014-12-28]]

[[Rong Liu 2014-12-28]]

[[Bin Yuan 2014-12-28]]

Bin Yuan 14-12-21

2014-12-22T01:32:24Z

Yuanb：以“=== Accomplished this week === * change tag lm merge method to add, and conduct a experiment about hclg merge, result see http://cslt.riit.tsinghua.edu.cn/cgi-bin/...”为内容创建页面

=== Accomplished this week ===
* change tag lm merge method to add, and conduct a experiment about hclg merge, result see [[http://cslt.riit.tsinghua.edu.cn/cgi-bin/cvss/cvss_request.pl?account=yuanb&step=view_request&cvssid=304]]
* help zhiyong fix the merge bug.
* knowledge vector baseline done.
=== Plan for next week ===
* continue knowledge vector, use more information to train.

2014-12-21

2014-12-22T01:25:39Z

Yuanb：

[[2014-12-19 Miao Fan]]

[[Dongxu Zhang 14-12-21]]

[[Rong Liu 14-12-21]]

[[Xiaoxi Wang 14-12-21]]

[[Bin Yuan 14-12-21]]

文件:Taglm draft.pdf

2014-12-22T01:24:07Z

Yuanb：Yuanb上传“文件:Taglm draft.pdf”的新版本

tag lm draft

文件:Tag lm report.pdf

2014-12-15T01:18:34Z

Yuanb：Yuanb上传“文件:Tag lm report.pdf”的新版本

Bin Yuan 14-12-14

2014-12-14T19:36:54Z

Yuanb：/* Accomplished this week */

=== Accomplished this week ===
* finish the tag lm technical report, see [[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/d/d6/Taglm_draft.pdf]].
* find a bug in our code for merging the grammar with lm, the reason is clear and to be fixed.
* knowledge vector baseline setup mainly done, need to find a task to evaluate the result.

=== Plan for next week ===
* continue knowledge vector.
* fix the bug mentioned above.

Bin Yuan 14-12-14

2014-12-14T19:35:47Z

Yuanb：以“=== Accomplished this week === * finish the tag lm technical report. * find a bug in our code for merging the grammar with lm, the reason is clear and to be fixed. *...”为内容创建页面

=== Accomplished this week ===
* finish the tag lm technical report.
* find a bug in our code for merging the grammar with lm, the reason is clear and to be fixed.
* knowledge vector baseline setup mainly done, need to find a task to evaluate the result.
=== Plan for next week ===
* continue knowledge vector.
* fix the bug mentioned above.

2014-12-14

2014-12-14T19:26:01Z

Yuanb：

[[Fanhu bie 14-12-14]]

[[Rong Liu 14-12-14]]

[[Bin Yuan 14-12-14]]

文件:Taglm draft.pdf

2014-12-14T19:24:09Z

Yuanb：tag lm draft

tag lm draft

文件:Tag lm report.pdf

2014-12-14T19:21:12Z

Yuanb：Yuanb上传“文件:Tag lm report.pdf”的新版本

文件:Tag lm report.pdf

2014-12-14T19:19:35Z

Yuanb：Yuanb上传“文件:Tag lm report.pdf”的新版本

文件:Tag lm report.pdf

2014-12-08T00:41:19Z

Yuanb：Yuanb上传“文件:Tag lm report.pdf”的新版本

Bin Yuan 14-12-07

2014-12-08T00:37:06Z

Yuanb：以“=== Accomplished this week === * knowledge vector build graph done, extract one specific domain entries(such as Category:Animals), now we can extract a subgraph give...”为内容创建页面

=== Accomplished this week ===
* knowledge vector build graph done, extract one specific domain entries(such as Category:Animals), now we can extract a subgraph given a root entry and can find out all the paths from root entry to a given leaf entry.
* update tag language model technical report, see [[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/f/f5/Tag_lm_report.pdf]]
* read one paper about merging both FSG and N-gram into a single decoding graph, see [[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/3/3c/CustomizedASR_ICSR2012_CameraReady.pdf]]
=== Plan for next week ===
* integrate the experiment part with Xiaoxi's part.
* finish knowledge vector baseline set up.

文件:CustomizedASR ICSR2012 CameraReady.pdf

2014-12-08T00:36:27Z

Yuanb：tag based language model using wfst

tag based language model using wfst

2014-12-07

2014-12-08T00:22:59Z

Yuanb：

[[Xiaoxi Wang 14-12-07]]

[[Dongxu Zhang 14-12-07]]

[[Xiangyu Zeng 14-12-07]]

[[Miao Fan 14-12-07]]

[[Bin Yuan 14-12-07]]

文件:Tag lm report.pdf

2014-12-01T01:31:34Z

Yuanb：Yuanb上传“文件:Tag lm report.pdf”的新版本

Bin Yuan 14-12-01

2014-12-01T01:21:43Z

Yuanb：/* Accomplished this week */

=== Accomplished this week ===
* write tag language model technical report,almost done, result see [[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/f/f5/Tag_lm_report.pdf]]
* knowledge vector tree building algorithm done.

=== Planned for next week ===
* write the experiment part of tag language model draft.
* set up knowledge vector baseline.

文件:Tag lm report.pdf

2014-12-01T01:20:55Z

Yuanb：

Bin Yuan 14-12-01

2014-12-01T00:49:40Z

Yuanb：以“=== Accomplished this week === * write tag language model technical report,almost done, result see [[]] * knowledge vector tree building algorithm done. === Planned...”为内容创建页面

=== Accomplished this week ===
* write tag language model technical report,almost done, result see [[]]
* knowledge vector tree building algorithm done.
=== Planned for next week ===
* write the experiment part of tag language model draft.
* set up knowledge vector baseline.

2014-12-01

2014-12-01T00:45:05Z

Yuanb：

[[Miao Fan 14-12-01]]

[[Bin Yuan 14-12-01]]

Bin Yuan 14-11-24

2014-11-23T14:56:47Z

Yuanb：以“=== Accomplished this week === * do some experiments and find the relationship between optimal merge weight and jsgf address number. result see http://cslt.riit.ts...”为内容创建页面

=== Accomplished this week ===
* do some experiments and find the relationship between optimal merge weight and jsgf address number. result see [[http://cslt.riit.tsinghua.edu.cn/cgi-bin/cvss/cvss_request.pl?account=yuanb&step=view_request&cvssid=304]]
* read some papers about wiki related information extraction and make a report.
* tell Zhenlong Han how to do tag lm stuff.
=== Planned for next week ===
* make a summary about tag-lm.
* read some paper about knowledge vector.

11-16 Bin Yuan

2014-11-23T14:51:20Z

Yuanb：以“=== Accomplished this week === * build a new jsgf file * construct a test set for address tag language model * conduct a new experiment, result is as below === Planned for n...”替换内容

2014-11-24

2014-11-23T14:49:10Z

Yuanb：

[[Miao Fan 14-11-24]]

[[Jun Wang 14-11-24]]

[[Rong Liu 14-11-24]]

[[Fanhu Bie 14-11-23]]

[[Bin Yuan 14-11-24]]

Text-2014-11-19

2014-11-20T04:03:30Z

Yuanb：

==Abstract==
==resource==
*ppt
:*about knowledge vector disccussion[[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/c/c7/Knowledge_vector.ppt]]
:*about infomation extraction using wikipedia[[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/8/8c/Open_Information_Extraction_using_Wikipedia.pptx]]
*related paper
:*Open Infomation Extraction using Wikipedia[[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/6/6e/Open_information_extraction_using_Wikipedia.pdf]]

Text-2014-11-19

2014-11-20T03:59:55Z

Yuanb：以“*ppt1http://cslt.riit.tsinghua.edu.cn/mediawiki/images/c/c7/Knowledge_vector.ppt *ppt2http://cslt.riit.tsinghua.edu.cn/mediawiki/images/8/8c/Open_Information_E...”为内容创建页面

*ppt1[[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/c/c7/Knowledge_vector.ppt]]
*ppt2[[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/8/8c/Open_Information_Extraction_using_Wikipedia.pptx]]
*paper[[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/6/6e/Open_information_extraction_using_Wikipedia.pdf]]

文件:Open information extraction using Wikipedia.pdf

2014-11-20T03:57:58Z

Yuanb：

文件:Knowledge vector.ppt

2014-11-20T03:53:08Z

Yuanb：knowledge vector disscussion

knowledge vector disscussion

文件:Open Information Extraction using Wikipedia.pptx

2014-11-20T03:50:04Z

Yuanb：ppt for open IE using wikipedia

ppt for open IE using wikipedia

Meeting minutes

2014-11-20T02:42:26Z

Yuanb：

11-16 Bin Yuan

2014-11-18T05:16:58Z

Yuanb：/* Result */

=== Accomplished this week ===
* build a new jsgf file
* construct a test set for address tag language model
* conduct a new experiment, result is as below

=== Planned for next week ===
* check the relation that between weight and size of dict.
* the short term should be punished.
* make a summary about tag-lm.
* read some paper about knowledge vector.

=== Result===
1. experiment 1

1.1 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_v3.0.S
test set: test_BJYD
result:
%WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
%SER 93.20 [ 1096 / 1176 ]
北京: 6 / 10 (BJYD test set's text contains 10 "北京", decode 6 of 10)

1.2 use address tag:
jsgf: extract top 500 frequent address(include "北京") from corpus
corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt，remove sentences containing "北京",
add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我在清华大学上课",
then add a sentence "我在 <address> 上课" to corpus)
am: mdl_v3.0.S
test set: test_BJYD

try different merge weight, the result is as follow:
weight: 0.1
%WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
%SER 94.98 [ 1117 / 1176 ]
北京: 4 / 10

weight: 0.5
%WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
%SER 93.88 [ 1104 / 1176 ]
北京: 4 / 10

weight: 1
%WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
%SER 93.28 [ 1097 / 1176 ]
北京: 2 / 10

weight: 2
%WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
%SER 93.71 [ 1102 / 1176 ]
北京: 1 / 10

weight: 3
can't decode "北京"

-------------------------------------------------------------------------------
This weekend I find two mistakes in experiment 1:
1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
and run this script under my directory, leading to higher WER.
2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
the decode result contains TAG, it would always be truncated. This explains why the deletion error is
high when merge weight is small in experiment 1.2.

2. experiment 2
2.1 pre-work:
2.1.1 build jsgf file
extract a address list from corpus, sort and count the address list, and uniformly sample 490 address
from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
appear in the corpus.

some samples of the 490 address:
黑龙江省、宿迁市、安定门、吉林省吉林市、芙蓉西街、南三环中路、朝阳北路大悦城、石门县
some samples of the 10 address:
上海市浦东新区陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、北京市海淀区清华大学、明斯克、摩纳哥

2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
测试集中120条文本包含的地名有三种情况：
训练预料中频繁的地名（出现次数大于10），不在jsgf当中（30条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第一种地名：在训练预料中出现次数小于10次（40条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第二种地名：在训练预料中没出现过（50条，每个地名的测试样本5条）
120条文本每条录音两遍（不是同一个人），一共240个音频，12个人录音，每人录音20条

2.2 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_1400
test set: test_address_tag
result:
%WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
%SER 73.33 [ 176 / 240 ]
%ADD_ER[ 6 / 30, 16 / 40, 32 / 50 ]

2.3 address tag
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
am: mdl_1400
test set: test_address_tag
weight: 1
%WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
%SER 69.17 [ 166 / 240 ]
%ADD_ER[ 6 / 30, 8 / 40, 12 / 50 ]

11-16 Bin Yuan

2014-11-18T05:11:59Z

Yuanb：/* Result */

=== Accomplished this week ===
* build a new jsgf file
* construct a test set for address tag language model
* conduct a new experiment, result is as below

=== Planned for next week ===
* check the relation that between weight and size of dict.
* the short term should be punished.
* make a summary about tag-lm.
* read some paper about knowledge vector.

=== Result===
1. experiment 1

1.1 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_v3.0.S
test set: test_BJYD
result:
%WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
%SER 93.20 [ 1096 / 1176 ]
北京: 6 / 10 (BJYD test set's text contains 10 "北京", decode 6 of 10)

1.2 use address tag:
jsgf: extract top 500 frequent address(include "北京") from corpus
corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt，remove sentences containing "北京",
add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我在清华大学上课",
then add a sentence "我在 <address> 上课" to corpus)
am: mdl_v3.0.S
test set: test_BJYD

try different merge weight, the result is as follow:
weight: 0.1
%WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
%SER 94.98 [ 1117 / 1176 ]
北京: 4 / 10

weight: 0.5
%WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
%SER 93.88 [ 1104 / 1176 ]
北京: 4 / 10

weight: 1
%WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
%SER 93.28 [ 1097 / 1176 ]
北京: 2 / 10

weight: 2
%WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
%SER 93.71 [ 1102 / 1176 ]
北京: 1 / 10

weight: 3
can't decode "北京"

-------------------------------------------------------------------------------
This weekend I find two mistakes in experiment 1:
1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
and run this script under my directory, leading to higher WER.
2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
the decode result contains TAG, it would always be truncated. This explains why the deletion error is
high when merge weight is small in experiment 1.2.

2. experiment 2
2.1 pre-work:
2.1.1 build jsgf file
extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address
from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
appear in the corpus.

some samples of the 490 address:
黑龙江省、宿迁市、安定门、吉林省吉林市、芙蓉西街、南三环中路、朝阳北路大悦城、石门县
some samples of the 10 address:
上海市浦东新区陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、北京市海淀区清华大学、明斯克、摩纳哥

2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
测试集中120条文本包含的地名有三种情况：
训练预料中频繁的地名（出现次数大于10），不在jsgf当中（30条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第一种地名：在训练预料中出现次数小于10次（40条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第二种地名：在训练预料中没出现过（50条，每个地名的测试样本5条）
120条文本每条录音两遍（不是同一个人），一共240个音频，12个人录音，每人录音20条

2.2 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_1400
test set: test_address_tag
result:
%WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
%SER 73.33 [ 176 / 240 ]
%ADD_ER[ 6 / 30, 16 / 40, 32 / 50 ]

2.3 address tag
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
am: mdl_1400
test set: test_address_tag
weight: 1
%WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
%SER 69.17 [ 166 / 240 ]
%ADD_ER[ 6 / 30, 8 / 40, 12 / 50 ]

11-16 Bin Yuan

2014-11-17T09:53:01Z

Yuanb：

=== Accomplished this week ===
* build a new jsgf file
* construct a test set for address tag language model
* conduct a new experiment, result is as below

=== Planned for next week ===
* check the relation that between weight and size of dict.
* the short term should be punished.
* make a summary about tag-lm.
* read some paper about knowledge vector.

=== Result===
1. experiment 1

1.1 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_v3.0.S
test set: test_BJYD
result:
%WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
%SER 93.20 [ 1096 / 1176 ]
北京: 6 / 10 (BJYD test set's text contains 10 "北京", decode 6 of 10)

1.2 use address tag:
jsgf: extract top 500 frequent address(include "北京") from corpus
corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt，remove sentences containing "北京",
add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我在清华大学上课",
then add a sentence "我在 <address> 上课" to corpus)
am: mdl_v3.0.S
test set: test_BJYD

try different merge weight, the result is as follow:
weight: 0.1
%WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
%SER 94.98 [ 1117 / 1176 ]
北京: 4 / 10

weight: 0.5
%WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
%SER 93.88 [ 1104 / 1176 ]
北京: 4 / 10

weight: 1
%WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
%SER 93.28 [ 1097 / 1176 ]
北京: 2 / 10

weight: 2
%WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
%SER 93.71 [ 1102 / 1176 ]
北京: 1 / 10

weight: 3
can't decode "北京"

-------------------------------------------------------------------------------
This weekend I find two mistakes in experiment 1:
1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
and run this script under my directory, leading to higher WER.
2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
the decode result contains TAG, it would always be truncated. This explains why the deletion error is
high when merge weight is small in experiment 1.2.

2. experiment 2
2.1 pre-work:
2.1.1 build jsgf file
extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address
from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
appear in the corpus.

some samples of the 490 address:
黑龙江省、宿迁市、安定门、吉林省吉林市、芙蓉西街、南三环中路、朝阳北路大悦城、石门县
some samples of the 10 address:
上海市浦东新区陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、北京市海淀区清华大学、明斯克、摩纳哥

2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
测试集中120条文本包含的地名有三种情况：
训练预料中频繁的地名（出现次数大于10），不在jsgf当中（30条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第一种地名：在训练预料中出现次数小于10次（40条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第二种地名：在训练预料中没出现过（50条，每个地名的测试样本5条）
120条文本每条录音两遍（不是同一个人），一共240个音频，12个人录音，每人录音20条

2.2 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_1400
test set: test_address_tag
result:
%WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
%SER 73.33 [ 176 / 240 ]

2.3 address tag
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
am: mdl_1400
test set: test_address_tag
weight: 1
%WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
%SER 69.17 [ 166 / 240 ]

11-16 Bin Yuan

2014-11-17T09:50:11Z

Yuanb：

=== Accomplished this week ===
* build a new jsgf file
* construct a test set for address tag language model
* conduct a new experiment, result is as below

=== Planned for next week ===

=== Result===
1. experiment 1

1.1 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_v3.0.S
test set: test_BJYD
result:
%WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
%SER 93.20 [ 1096 / 1176 ]
北京: 6 / 10 (BJYD test set's text contains 10 "北京", decode 6 of 10)

1.2 use address tag:
jsgf: extract top 500 frequent address(include "北京") from corpus
corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt，remove sentences containing "北京",
add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我在清华大学上课",
then add a sentence "我在 <address> 上课" to corpus)
am: mdl_v3.0.S
test set: test_BJYD

try different merge weight, the result is as follow:
weight: 0.1
%WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
%SER 94.98 [ 1117 / 1176 ]
北京: 4 / 10

weight: 0.5
%WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
%SER 93.88 [ 1104 / 1176 ]
北京: 4 / 10

weight: 1
%WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
%SER 93.28 [ 1097 / 1176 ]
北京: 2 / 10

weight: 2
%WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
%SER 93.71 [ 1102 / 1176 ]
北京: 1 / 10

weight: 3
can't decode "北京"

-------------------------------------------------------------------------------
This weekend I find two mistakes in experiment 1:
1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
and run this script under my directory, leading to higher WER.
2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
the decode result contains TAG, it would always be truncated. This explains why the deletion error is
high when merge weight is small in experiment 1.2.

2. experiment 2
2.1 pre-work:
2.1.1 build jsgf file
extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address
from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
appear in the corpus.

some samples of the 490 address:
黑龙江省、宿迁市、安定门、吉林省吉林市、芙蓉西街、南三环中路、朝阳北路大悦城、石门县
some samples of the 10 address:
上海市浦东新区陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、北京市海淀区清华大学、明斯克、摩纳哥

2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
测试集中120条文本包含的地名有三种情况：
训练预料中频繁的地名（出现次数大于10），不在jsgf当中（30条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第一种地名：在训练预料中出现次数小于10次（40条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第二种地名：在训练预料中没出现过（50条，每个地名的测试样本5条）
120条文本每条录音两遍（不是同一个人），一共240个音频，12个人录音，每人录音20条

2.2 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_1400
test set: test_address_tag
result:
%WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
%SER 73.33 [ 176 / 240 ]

2.3 address tag
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
am: mdl_1400
test set: test_address_tag
weight: 1
%WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
%SER 69.17 [ 166 / 240 ]

11-16 Bin Yuan

2014-11-17T01:21:00Z

Yuanb：/* Result */

=== Accomplished this week ===
* build a new jsgf file
* construct a test set for address tag language model
* conduct a new experiment, result in

=== Planned for next week ===

=== Result===
1. experiment 1

1.1 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_v3.0.S
test set: test_BJYD
result:
%WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
%SER 93.20 [ 1096 / 1176 ]
北京: 6 / 10 (BJYD test set's text contains 10 "北京", decode 6 of 10)

1.2 use address tag:
jsgf: extract top 500 frequent address(include "北京") from corpus
corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt，remove sentences containing "北京",
add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我在清华大学上课",
then add a sentence "我在 <address> 上课" to corpus)
am: mdl_v3.0.S
test set: test_BJYD

try different merge weight, the result is as follow:
weight: 0.1
%WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
%SER 94.98 [ 1117 / 1176 ]
北京: 4 / 10

weight: 0.5
%WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
%SER 93.88 [ 1104 / 1176 ]
北京: 4 / 10

weight: 1
%WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
%SER 93.28 [ 1097 / 1176 ]
北京: 2 / 10

weight: 2
%WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
%SER 93.71 [ 1102 / 1176 ]
北京: 1 / 10

weight: 3
can't decode "北京"

-------------------------------------------------------------------------------
This weekend I find two mistakes in experiment 1:
1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
and run this script under my directory, leading to higher WER.
2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
the decode result contains TAG, it would always be truncated. This explains why the deletion error is
high when merge weight is small in experiment 1.2.

2. experiment 2
2.1 pre-work:
2.1.1 build jsgf file
extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address
from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
appear in the corpus.

some samples of the 490 address:
黑龙江省、宿迁市、安定门、吉林省吉林市、芙蓉西街、南三环中路、朝阳北路大悦城、石门县
some samples of the 10 address:
上海市浦东新区陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、北京市海淀区清华大学、明斯克、摩纳哥

2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
测试集中120条文本包含的地名有三种情况：
训练预料中频繁的地名（出现次数大于10），不在jsgf当中（30条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第一种地名：在训练预料中出现次数小于10次（40条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第二种地名：在训练预料中没出现过（50条，每个地名的测试样本5条）
120条文本每条录音两遍（不是同一个人），一共240个音频，12个人录音，每人录音20条

2.2 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_1400
test set: test_address_tag
result:
%WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
%SER 73.33 [ 176 / 240 ]

2.3 address tag
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
am: mdl_1400
test set: test_address_tag
weight: 1
%WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
%SER 69.17 [ 166 / 240 ]

11-16 Bin Yuan

2014-11-17T01:19:48Z

Yuanb：/* Result */

=== Accomplished this week ===
* build a new jsgf file
* construct a test set for address tag language model
* conduct a new experiment, result in

=== Planned for next week ===

=== Result===
1. experiment 1

1.1 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_v3.0.S
test set: test_BJYD
result:
%WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
%SER 93.20 [ 1096 / 1176 ]
BeiJing: 6 / 10 (BJYD test set's text contains 10 "BeiJing", decode 6 of 10)

1.2 use address tag:
jsgf: extract top 500 frequent address(include "BeiJing") from corpus
corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt，remove sentences containing "BeiJing",
add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我在清华大学上课",
then add a sentence "我在 <address> 上课" to corpus)
am: mdl_v3.0.S
test set: test_BJYD

try different merge weight, the result is as follow:
weight: 0.1
%WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
%SER 94.98 [ 1117 / 1176 ]
BeiJing: 4 / 10

weight: 0.5
%WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
%SER 93.88 [ 1104 / 1176 ]
BeiJing: 4 / 10

weight: 1
%WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
%SER 93.28 [ 1097 / 1176 ]
BeiJing: 2 / 10

weight: 2
%WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
%SER 93.71 [ 1102 / 1176 ]
BeiJing: 1 / 10

weight: 3
can't decode "BeiJing"

-------------------------------------------------------------------------------
This weekend I find two mistakes in experiment 1:
1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
and run this script under my directory, leading to higher WER.
2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
the decode result contains TAG, it would always be truncated. This explains why the deletion error is
high when merge weight is small in experiment 1.2.

2. experiment 2
2.1 pre-work:
2.1.1 build jsgf file
extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address
from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
appear in the corpus.

some samples of the 490 address:
黑龙江省、宿迁市、安定门、吉林省吉林市、芙蓉西街、南三环中路、朝阳北路大悦城、石门县
some samples of the 10 address:
上海市浦东新区陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、BeiJing市海淀区清华大学、明斯克、摩纳哥

2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
测试集中120条文本包含的地名有三种情况：
训练预料中频繁的地名（出现次数大于10），不在jsgf当中（30条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第一种地名：在训练预料中出现次数小于10次（40条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第二种地名：在训练预料中没出现过（50条，每个地名的测试样本5条）
120条文本每条录音两遍（不是同一个人），一共240个音频，12个人录音，每人录音20条

2.2 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_1400
test set: test_address_tag
result:
%WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
%SER 73.33 [ 176 / 240 ]

2.3 address tag
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
am: mdl_1400
test set: test_address_tag
weight: 1
%WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
%SER 69.17 [ 166 / 240 ]

11-16 Bin Yuan

2014-11-17T01:17:43Z

Yuanb：

=== Accomplished this week ===
* build a new jsgf file
* construct a test set for address tag language model
* conduct a new experiment, result in

=== Planned for next week ===

=== Result===
1. experiment 1

1.1 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_v3.0.S
test set: test_BJYD
result:
%WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
%SER 93.20 [ 1096 / 1176 ]
BeiJing: 6 / 10 (BJYD test set's text contains 10 "BeiJing", decode 6 of 10)

1.2 use address tag:
jsgf: extract top 500 frequent address(include "BeiJing") from corpus
corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt，remove sentences containing "BeiJing",
add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我在清华大学上课",
then add a sentence "我在 <address> 上课" to corpus)
am: mdl_v3.0.S
test set: test_BJYD

try different merge weight, the result is as follow:
weight: 0.1
%WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
%SER 94.98 [ 1117 / 1176 ]
BeiJing: 4 / 10

weight: 0.5
%WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
%SER 93.88 [ 1104 / 1176 ]
BeiJing: 4 / 10

weight: 1
%WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
%SER 93.28 [ 1097 / 1176 ]
BeiJing: 2 / 10

weight: 2
%WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
%SER 93.71 [ 1102 / 1176 ]
BeiJing: 1 / 10

weight: 3
can't decode "BeiJing"

-------------------------------------------------------------------------------
This weekend I find two mistakes in experiment 1:
1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
and run this script under my directory, leading to higher WER.
2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
the decode result contains TAG, it would always be truncated. This explains why the deletion error is
high when merge weight is small in experiment 1.2.

2. experiment 2
2.1 pre-work:
2.1.1 build jsgf file
extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address
from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
appear in the corpus.

some samples of the 490 address:
黑龙江省、宿迁市、安定门、吉林省吉林市、芙蓉西街、南三环中路、朝阳北路大悦城、石门县
some samples of the 10 address:
上海市浦东新区陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、BeiJing市海淀区清华大学、明斯克、摩纳哥

2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
测试集中120条文本包含的地名有三种情况：
训练预料中频繁的地名（出现次数大于10），不在jsgf当中（30条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第一种地名：在训练预料中出现次数小于10次（40条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第二种地名：在训练预料中没出现过（50条，每个地名的测试样本5条）
120条文本每条录音两遍（不是同一个人），一共240个音频，12个人录音，每人录音20条

2.2 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_1400
test set: test_address_tag
result:
%WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
%SER 73.33 [ 176 / 240 ]

2.3 address tag
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
am: mdl_1400
test set: test_address_tag
weight: 1
%WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
%SER 69.17 [ 166 / 240 ]

11-16 Bin Yuan

2014-11-17T01:16:55Z

Yuanb：以“=== Accomplished this week === * build a new jsgf file * construct a test set for address tag language model * conduct a new experiment, result in === Planned for...”为内容创建页面

=== Accomplished this week ===
* build a new jsgf file
* construct a test set for address tag language model
* conduct a new experiment, result in

=== Planned for next week ===

=== Result===
1. experiment 1

1.1 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_v3.0.S
test set: test_BJYD
result:
%WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
%SER 93.20 [ 1096 / 1176 ]
BeiJing: 6 / 10 (BJYD test set's text contains 10 "BeiJing", decode 6 of 10)

1.2 use address tag:
jsgf: extract top 500 frequent address(include "BeiJing") from corpus
corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt，remove sentences containing "BeiJing",
add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我在清华大学上课",
then add a sentence "我在 <address> 上课" to corpus)
am: mdl_v3.0.S
test set: test_BJYD

try different merge weight, the result is as follow:
weight: 0.1
%WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
%SER 94.98 [ 1117 / 1176 ]
BeiJing: 4 / 10

weight: 0.5
%WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
%SER 93.88 [ 1104 / 1176 ]
BeiJing: 4 / 10

weight: 1
%WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
%SER 93.28 [ 1097 / 1176 ]
BeiJing: 2 / 10

weight: 2
%WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
%SER 93.71 [ 1102 / 1176 ]
BeiJing: 1 / 10

weight: 3
can't decode "BeiJing"

-------------------------------------------------------------------------------
This weekend I find two mistakes in experiment 1:
1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
and run this script under my directory, leading to higher WER.
2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
the decode result contains TAG, it would always be truncated. This explains why the deletion error is
high when merge weight is small in experiment 1.2.

2. experiment 2
2.1 pre-work:
2.1.1 build jsgf file
extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address
from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
appear in the corpus.

some samples of the 490 address:
黑龙江省、宿迁市、安定门、吉林省吉林市、芙蓉西街、南三环中路、朝阳北路大悦城、石门县
some samples of the 10 address:
上海市浦东新区陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、BeiJing市海淀区清华大学、明斯克、摩纳哥

2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
测试集中120条文本包含的地名有三种情况：
训练预料中频繁的地名（出现次数大于10），不在jsgf当中（30条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第一种地名：在训练预料中出现次数小于10次（40条，按照地名在训练预料中出现的次数等间隔采样）
jsgf中的第二种地名：在训练预料中没出现过（50条，每个地名的测试样本5条）
120条文本每条录音两遍（不是同一个人），一共240个音频，12个人录音，每人录音20条

2.2 baseline
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
am: mdl_1400
test set: test_address_tag
result:
%WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
%SER 73.33 [ 176 / 240 ]

2.3 address tag
corpus：BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
am: mdl_1400
test set: test_address_tag
weight: 1
%WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
%SER 69.17 [ 166 / 240 ]

2014-11-16

2014-11-17T00:56:30Z

Yuanb：

[[11-16 Zhiyong Zhang]]

[[11-16 Xiangyu Zeng]]

[[11-16 Fanhu Bie]]

[[11-16 Guoyu Tang]]

[[11-16 Yiye Lin]]

[[11-16 Dongxu Zhang]]

[[11-16 Miao Fan]]

[[11-16 Bin Yuan]]