LM-release-v0.2.2
来自cslt Wiki
RELEASE TITLE: LM RELEASE
RELEASE VERSION: v0.2.2
RELEASE TYPE: STEP RELEASE
RELEASE LOCATION: /work5/release/weiy/project/myhexin/lm/v0.2.2
RELATED BUGDB:
1. BACKGROUND:
本版本发布是同花顺语音识别项目的成果发布的内部结点成果(STEP RELEASE),
版本号为V0.2.2。发布的目的是验证在现有技术下,实现同花顺的目标的可行性,提供一个可选择的基础版本,为总结问题,验证性能提供参考。
2. TECHNOLOGY SUMMARY:
以49G金融语料和85G通用语料训练出的两个模型为基础,通过不断增加语料,得到三组不同语料训练出的语言模型,并对比效果。
3. RELEASE COMPONENT:
LM: LM RELEASE v0.2.1
LM RELEASE v0.2.2
4. TEST RESULT:
===========================================================================================================================
语料 LM ppl/OOV/wer
myhexin 2000ju recheck
===========================================================================================================================
1 lm_fin_3gram_1e-7 | 763.126 / 2787 / 5.99 | 856.218 / 558 / 40.32 | 136.406 / 183 / 15.93
lm_non_3gram_1e-7 | 1490.42 / 2790 / 7.89 | 579.682 / 557 / 38.61 | 224.078 / 183 / 18.40
lm_hybrid_3gram_1e-7 | 788.452 / 2785 / 6.23 | 598.666 / 557 / 38.73 | 135.459 / 183 / 15.99
===========================================================================================================================
2 lm_fin_all_3gram_1e-7 | 571.8 / 2268 / 5.72 | 1150.24 / 508 / 41.35 | 163.308 / 21 / 15.84
lm_hybrid_all_3gram_1e-7 | 607.18 / 2266 / 5.95 | 686.223 / 507 / 39.18 | 149.625 / 21 / 15.71
lm_hybrid_all_3gram_1e-7_myhexin | 563.656 / 2266 / 5.75 | 918.887 / 507 / 40.60 | 153.768 / 21 / 15.59
lm_hybrid_all_3gram_1e-7_2000ju | 985.705 / 2266 / 6.73 | 602.71 / 507 / 38.61 | 190.716 / 21 / 16.98
lm_hybrid_all_3gram_1e-7_recheck | 580.343 / 2266 / 5.88 | 738.939 / 507 / 39.48 | 148.432 / 21 / 15.63
lm_hybrid_all_5gram_1e-9 | 315.16 / 2266 / 4.86 | 476.592 / 507 / 36.66 | 94.8788 / 21 / 14.81
===========================================================================================================================
3 lm_hybrid_all_100h_3gram_1e-7 | 720.404 / 2266 / 6.31 | 376.68 / 507 / 36.64 | 101.352 / 21 / 14.87
===========================================================================================================================
===============================================
构成 来源
===============================================
金融 通用
===============================================
104G | 40G 64G | 2016-11-08前爬取的语料
===============================================
30G | 9.1G 21G | 2016-12-23前爬取的语料
===============================================
10G | 10G 0 | 同花顺提供的语料
===============================================
100h | 6.2M |
===============================================
===================================================================================================
output input1 input2 weight
===================================================================================================
lm_hybrid_3gram_1e-7 lm_fin_3gram_1e-7 lm_non_3gram_1e-7 0.5
lm_fin_all_3gram_1e-7 lm_fin_3gram_1e-7 lm_ft_3gram_1e-7 0.5
lm_hybrid_all_3gram_1e-7 lm_fin_all_3gram_1e-7 lm_non_3gram_1e-7 0.5
lm_hybrid_all_3gram_1e-7_myhexin lm_fin_all_3gram_1e-7 lm_non_3gram_1e-7 0.900766
lm_hybrid_all_3gram_1e-7_2000ju lm_fin_all_3gram_1e-7 lm_non_3gram_1e-7 0.0547782
lm_hybrid_all_3gram_1e-7_recheck lm_fin_all_3gram_1e-7 lm_non_3gram_1e-7 0.638826
lm_hybrid_all_5gram_1e-9 lm_fin_all_5gram_1e-9 lm_non_5gram_1e-9 0.5
lm_hybrid_all_100h_3gram_1e-7 lm_hybrid_all_3gram_1e-7 lm_100h_3gram_1e-7 0.5
===================================================================================================
Note:
1. 语料1代表104G语料+30G语料。
2. 语料2代表104G语料+30G语料+10G语料。
3. 语料3代表104G语料+30G语料+10G语料+100h语料。
4. myhexin代表test_myhexin_20161019,金融领域测试集。
5. 2000ju代表test_2000ju,通用领域测试集。
6. recheck代表test_myhexin_finance_recheck,以金融领域为主的混合测试集。
7. lm_ft(中间产物,未测试)由10G语料训练而成。
8. lm_100h(中间产物,未测试)由 100h文本训练而成。
9. lm_hybrid_all_3gram_1e-7_myhexin 是以使得test_myhexin_20161019测试集结果最优的权重将两个 LM 插值合并而成,lm_hybrid_all_3gram_1e-7_2000ju 和 lm_hybrid_all_3gram_1e-7_recheck 同理。
测试环境:
AM = /work5/release/project/myhexin/am/v0.1
词表为 vocab = /work5/release/project/myhexin/vocab/vocab.v0.2
分词词典为 dict = /nfs/disk/work/users/zhaomy/soft/jieba/jieba-0.38/jieba/dict.txt.myhexin.v0.2
发音词典为 lexicon = /work4/singular/public/release/lexicon/lexicon.v0.2
beam = 13
结论:
1.金融领域数据训练的LM在金融领域测试集上结果较好,在通用领域测试集上结果较差,通用领域数据训练的LM则相反,插值合并得到的混合LM,在两测试集上结果都接近良好。
2.对于同一个LM,5gram版本在所有测试集的结果都比3gram版本的好,但测试时非常消耗资源,故只测一组,作为对照。
3.lm_hybrid_all 与 lm_hybrid 相比,加入10G金融数据后,金融领域结果变好,通用领域结果变差。
4.用哪个测试集计算出的best_mix 作为权重进行插值合并得到的LM就会在那个测试集上表现最好。
5.根据金融领域优先,兼顾通用领域的原则,选择lm_hybrid_all_myhexin作为最终结果。
6.OOV 只与语料和测试集有关,与金融领域或通用领域语言模型关系不大。
5. RELEASE TEAM:
Author: 魏扬
Contributor: 白子薇
Monitor: 赵梦原