LM-release-v0.2.2

来自cslt Wiki
跳转至: 导航搜索
RELEASE TITLE: LM RELEASE
RELEASE VERSION: v0.2.2
RELEASE TYPE: STEP RELEASE
RELEASE LOCATION: /work5/release/weiy/project/myhexin/lm/v0.2.2
RELATED BUGDB: 

1. BACKGROUND:

本版本发布是同花顺语音识别项目的成果发布的内部结点成果(STEP RELEASE),
版本号为V0.2.2。发布的目的是验证在现有技术下,实现同花顺的目标的可行性,提供一个可选择的基础版本,为总结问题,验证性能提供参考。

2. TECHNOLOGY SUMMARY:
以49G金融语料和85G通用语料训练出的两个模型为基础,通过不断增加语料,得到三组不同语料训练出的语言模型,并对比效果。

3. RELEASE COMPONENT:

LM:  LM RELEASE v0.2.1
         LM RELEASE v0.2.2

4. TEST RESULT:
===========================================================================================================================
语料	       LM	                                                      ppl/OOV/wer
		                                   myhexin	               2000ju	                      recheck
===========================================================================================================================
1	lm_fin_3gram_1e-7	          |  763.126 / 2787 / 5.99  |	856.218 / 558 / 40.32  |	136.406 / 183 / 15.93
	lm_non_3gram_1e-7	          |  1490.42 / 2790 / 7.89  |	579.682 / 557 / 38.61  |	224.078 / 183 / 18.40
	lm_hybrid_3gram_1e-7	          |  788.452 / 2785 / 6.23  |	598.666 / 557 / 38.73  |	135.459 / 183  / 15.99
===========================================================================================================================
2	lm_fin_all_3gram_1e-7	          |  571.8 / 2268 / 5.72    |	1150.24 / 508 / 41.35  |	163.308 / 21 / 15.84
	lm_hybrid_all_3gram_1e-7	  |  607.18 / 2266 / 5.95   | 	686.223 / 507 / 39.18  |	149.625 / 21 / 15.71
	lm_hybrid_all_3gram_1e-7_myhexin  |  563.656 / 2266 / 5.75  |	918.887 / 507 / 40.60  |	153.768 / 21 / 15.59
	lm_hybrid_all_3gram_1e-7_2000ju	  |  985.705 / 2266 / 6.73  |	602.71 / 507 / 38.61   |	190.716 / 21 / 16.98
	lm_hybrid_all_3gram_1e-7_recheck  |  580.343 / 2266 / 5.88  |	738.939 / 507 / 39.48  |	148.432 / 21 / 15.63
	lm_hybrid_all_5gram_1e-9	  |  315.16 / 2266 / 4.86   |	476.592 / 507 / 36.66  |	94.8788 / 21 / 14.81
===========================================================================================================================
3	lm_hybrid_all_100h_3gram_1e-7	  |  720.404 / 2266 / 6.31  |	376.68 / 507 / 36.64   |	101.352 / 21 / 14.87
===========================================================================================================================


===============================================
	    构成	        来源
===============================================
	金融	通用	
===============================================
104G  |	40G	64G  |	2016-11-08前爬取的语料
===============================================
30G   |	9.1G	21G  |	2016-12-23前爬取的语料
===============================================
10G   |	10G	0    |	同花顺提供的语料
===============================================
100h  |	    6.2M     |	
===============================================


===================================================================================================
         output	                              input1	                input2	        weight
===================================================================================================
lm_hybrid_3gram_1e-7	                lm_fin_3gram_1e-7	    lm_non_3gram_1e-7	  0.5
lm_fin_all_3gram_1e-7	                lm_fin_3gram_1e-7	    lm_ft_3gram_1e-7	  0.5
lm_hybrid_all_3gram_1e-7	        lm_fin_all_3gram_1e-7	    lm_non_3gram_1e-7	  0.5
lm_hybrid_all_3gram_1e-7_myhexin	lm_fin_all_3gram_1e-7	    lm_non_3gram_1e-7	  0.900766
lm_hybrid_all_3gram_1e-7_2000ju	        lm_fin_all_3gram_1e-7	    lm_non_3gram_1e-7	  0.0547782
lm_hybrid_all_3gram_1e-7_recheck	lm_fin_all_3gram_1e-7	    lm_non_3gram_1e-7	  0.638826
lm_hybrid_all_5gram_1e-9	        lm_fin_all_5gram_1e-9	    lm_non_5gram_1e-9	  0.5
lm_hybrid_all_100h_3gram_1e-7	        lm_hybrid_all_3gram_1e-7    lm_100h_3gram_1e-7	  0.5
===================================================================================================

Note:
 1.	语料1代表104G语料+30G语料。
 2.	语料2代表104G语料+30G语料+10G语料。
 3.	语料3代表104G语料+30G语料+10G语料+100h语料。
 4.	myhexin代表test_myhexin_20161019,金融领域测试集。
 5.	2000ju代表test_2000ju,通用领域测试集。
 6.	recheck代表test_myhexin_finance_recheck,以金融领域为主的混合测试集。
 7.	lm_ft(中间产物,未测试)由10G语料训练而成。
 8.	lm_100h(中间产物,未测试)由 100h文本训练而成。
 9.	lm_hybrid_all_3gram_1e-7_myhexin 是以使得test_myhexin_20161019测试集结果最优的权重将两个 LM 插值合并而成,lm_hybrid_all_3gram_1e-7_2000ju 和 
        lm_hybrid_all_3gram_1e-7_recheck 同理。
测试环境:
  AM = /work5/release/project/myhexin/am/v0.1	
  词表为 vocab = /work5/release/project/myhexin/vocab/vocab.v0.2  
  分词词典为 dict = /nfs/disk/work/users/zhaomy/soft/jieba/jieba-0.38/jieba/dict.txt.myhexin.v0.2
  发音词典为 lexicon = /work4/singular/public/release/lexicon/lexicon.v0.2
  beam = 13
结论:
 1.金融领域数据训练的LM在金融领域测试集上结果较好,在通用领域测试集上结果较差,通用领域数据训练的LM则相反,插值合并得到的混合LM,在两测试集上结果都接近良好。		
 2.对于同一个LM,5gram版本在所有测试集的结果都比3gram版本的好,但测试时非常消耗资源,故只测一组,作为对照。
 3.lm_hybrid_all 与 lm_hybrid 相比,加入10G金融数据后,金融领域结果变好,通用领域结果变差。					
 4.用哪个测试集计算出的best_mix 作为权重进行插值合并得到的LM就会在那个测试集上表现最好。					
 5.根据金融领域优先,兼顾通用领域的原则,选择lm_hybrid_all_myhexin作为最终结果。
 6.OOV 只与语料和测试集有关,与金融领域或通用领域语言模型关系不大。					
5. RELEASE TEAM:

Author: 魏扬
Contributor: 白子薇
Monitor: 赵梦原