“LM-release-v0.2”版本间的差异
来自cslt Wiki
第11行: | 第11行: | ||
2. TECHNOLOGY SUMMARY: | 2. TECHNOLOGY SUMMARY: | ||
− | 利用40G金融语料和64G通用语料,以3G的大小进行分割,用cleaning v0. | + | 利用40G金融语料和64G通用语料,以3G的大小进行分割,用cleaning v0.1进行清洗,以vocab v0.2为词表,用 |
− | + | MultilingualSegmenter.jar进行分词,分别训练,得到36个LM,按照1:1进行插值合并,按照不同剪枝率进行剪枝, | |
− | + | 得到2组LM(其中通用领域LM由于体积过大,需先将22个LM合并成2个,再按5gram_1e-9进行剪枝,再将2个LM合并, | |
+ | 再剪枝)。再将2组LM按照1:1进行合并,得到一组混合LM。 | ||
3. RELEASE COMPONENT: | 3. RELEASE COMPONENT: | ||
LM: LM RELEASE v0.2 | LM: LM RELEASE v0.2 | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
4. TEST RESULT: | 4. TEST RESULT: | ||
− | + | ============================================================================== | |
− | + | | wer / ppl | test_myhexin_20161019 | test_1000ju | test_2000ju | | |
− | + | ============================================================================== | |
− | + | | fin_3gram_1e-7 | 6.33/803.275 | 36.76/1623.97 | 45.17/1687.52 | | |
− | + | | uni_3gram_1e-7 | 8.05/1511.17 | 27.78/379.833 | 38.08/454.279 | | |
− | + | | hybrid_3gram_1e-7 | 6.45/799.42 | 28.89/503.623 | 39.66/590.483 | | |
− | ==================================================================== | + | | given_finance | 6.47/1104.9 | 37.51/2362.24 | 45.77/2474.66 | |
− | | | + | | given_universe | 12.23/4277.15 | 20.79/319.08 | 31.77/351.087 | |
− | -- | + | ============================================================================== |
− | | | + | |
− | ==================================================================== | + | |
+ | Note: | ||
+ | Beam=13 | ||
+ | max_active=7000 | ||
+ | fin_3gram_1e-7为40G金融语料训练出的LM | ||
+ | uni_3gram_1e-7为64G通用语料训练出的LM | ||
+ | hybrid_3gram_1e-7为混合LM | ||
+ | given_finance为2.9G金融语料训练出的LM | ||
+ | given_universe为海量通用语料训练出的Base LM | ||
+ | 发音词典为lexicon v0.2 | ||
+ | AM为tonghuashun v0.1 am | ||
5. RELEASE TEAM: | 5. RELEASE TEAM: | ||
− | Author: | + | Author: 魏扬 |
− | Contributor: | + | Contributor: 白子薇 |
− | Monitor: | + | Monitor: 赵梦原 |
</pre> | </pre> |
2016年12月7日 (三) 03:33的版本
RELEASE TITLE: LM RELEASE RELEASE VERSION: v0.2 RELEASE TYPE: STEP RELEASE RELEASE LOCATION: /work4/singular/public/release/lm/v0.2 RELATED BUGDB: 11 1. BACKGROUND: 2. TECHNOLOGY SUMMARY: 利用40G金融语料和64G通用语料,以3G的大小进行分割,用cleaning v0.1进行清洗,以vocab v0.2为词表,用 MultilingualSegmenter.jar进行分词,分别训练,得到36个LM,按照1:1进行插值合并,按照不同剪枝率进行剪枝, 得到2组LM(其中通用领域LM由于体积过大,需先将22个LM合并成2个,再按5gram_1e-9进行剪枝,再将2个LM合并, 再剪枝)。再将2组LM按照1:1进行合并,得到一组混合LM。 3. RELEASE COMPONENT: LM: LM RELEASE v0.2 4. TEST RESULT: ============================================================================== | wer / ppl | test_myhexin_20161019 | test_1000ju | test_2000ju | ============================================================================== | fin_3gram_1e-7 | 6.33/803.275 | 36.76/1623.97 | 45.17/1687.52 | | uni_3gram_1e-7 | 8.05/1511.17 | 27.78/379.833 | 38.08/454.279 | | hybrid_3gram_1e-7 | 6.45/799.42 | 28.89/503.623 | 39.66/590.483 | | given_finance | 6.47/1104.9 | 37.51/2362.24 | 45.77/2474.66 | | given_universe | 12.23/4277.15 | 20.79/319.08 | 31.77/351.087 | ============================================================================== Note: Beam=13 max_active=7000 fin_3gram_1e-7为40G金融语料训练出的LM uni_3gram_1e-7为64G通用语料训练出的LM hybrid_3gram_1e-7为混合LM given_finance为2.9G金融语料训练出的LM given_universe为海量通用语料训练出的Base LM 发音词典为lexicon v0.2 AM为tonghuashun v0.1 am 5. RELEASE TEAM: Author: 魏扬 Contributor: 白子薇 Monitor: 赵梦原