“LM-release-v0.2”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
第11行: 第11行:
 
2. TECHNOLOGY SUMMARY:
 
2. TECHNOLOGY SUMMARY:
  
利用40G金融语料和64G通用语料,以3G的大小进行分割,用cleaning v0.1进行清洗,用MultilingualSegmenter.jar进行分词,
+
利用40G金融语料和64G通用语料,以3G的大小进行分割,用cleaning v0.1进行清洗,以vocab v0.2为词表,用
用得到36个LM,按照1:1进行插值合并,按照不同剪枝率进行剪枝,得到2组LM(其中通用领域LM由于体积过大,需先将22个LM
+
MultilingualSegmenter.jar进行分词,分别训练,得到36个LM,按照1:1进行插值合并,按照不同剪枝率进行剪枝,
合并成2个,再按5gram_1e-9进行剪枝,再将2个LM合并,再剪枝)。再将2组LM按照1:1进行合并,得到一组混合LM。
+
得到2组LM(其中通用领域LM由于体积过大,需先将22个LM合并成2个,再按5gram_1e-9进行剪枝,再将2个LM合并,
 +
再剪枝)。再将2组LM按照1:1进行合并,得到一组混合LM。
  
 
3. RELEASE COMPONENT:
 
3. RELEASE COMPONENT:
  
 
LM:  LM RELEASE v0.2
 
LM:  LM RELEASE v0.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
4. TEST RESULT:
 
4. TEST RESULT:
  
Condition:
+
==============================================================================
LM = 1e-5,
+
|   wer / ppl      | test_myhexin_20161019 |  test_1000ju | test_2000ju |
BIGLM=1e-9,
+
==============================================================================
Beam=9
+
| fin_3gram_1e-7    |      6.33/803.275    | 36.76/1623.97 | 45.17/1687.52 |
max_active=5000
+
| uni_3gram_1e-7    |      8.05/1511.17    | 27.78/379.833 | 38.08/454.279 |
 
+
| hybrid_3gram_1e-7 |      6.45/799.42      | 28.89/503.623 | 39.66/590.483 |
====================================================================
+
| given_finance     |     6.47/1104.9      | 37.51/2362.24 | 45.77/2474.66 |
|   testset  | test_1000ju | test_2000ju | test_myhexin_20161019 |
+
| given_universe   |     12.23/4277.15    | 20.79/319.08  | 31.77/351.087 |
--------------------------------------------------------------------
+
==============================================================================
|     WER     |   28.94    |   38.71   |         7.65        |
+
====================================================================
+
  
 +
Note:
 +
Beam=13
 +
max_active=7000
 +
fin_3gram_1e-7为40G金融语料训练出的LM
 +
uni_3gram_1e-7为64G通用语料训练出的LM
 +
hybrid_3gram_1e-7为混合LM
 +
given_finance为2.9G金融语料训练出的LM
 +
given_universe为海量通用语料训练出的Base LM
 +
发音词典为lexicon v0.2
 +
AM为tonghuashun v0.1 am
  
 
5. RELEASE TEAM:
 
5. RELEASE TEAM:
  
Author: 赵梦原
+
Author: 魏扬
Contributor: 张之勇,白子薇
+
Contributor: 白子薇
Monitor: 刘荣
+
Monitor: 赵梦原
  
 
</pre>
 
</pre>

2016年12月7日 (三) 03:33的版本

RELEASE TITLE: LM RELEASE
RELEASE VERSION: v0.2
RELEASE TYPE: STEP RELEASE
RELEASE LOCATION: /work4/singular/public/release/lm/v0.2
RELATED BUGDB: 11

1. BACKGROUND:


2. TECHNOLOGY SUMMARY:

利用40G金融语料和64G通用语料,以3G的大小进行分割,用cleaning v0.1进行清洗,以vocab v0.2为词表,用
MultilingualSegmenter.jar进行分词,分别训练,得到36个LM,按照1:1进行插值合并,按照不同剪枝率进行剪枝,
得到2组LM(其中通用领域LM由于体积过大,需先将22个LM合并成2个,再按5gram_1e-9进行剪枝,再将2个LM合并,
再剪枝)。再将2组LM按照1:1进行合并,得到一组混合LM。

3. RELEASE COMPONENT:

LM:  LM RELEASE v0.2

4. TEST RESULT:

==============================================================================
|    wer / ppl      | test_myhexin_20161019 |  test_1000ju  |  test_2000ju  |
==============================================================================
| fin_3gram_1e-7    |      6.33/803.275     | 36.76/1623.97 | 45.17/1687.52 |
| uni_3gram_1e-7    |      8.05/1511.17     | 27.78/379.833 | 38.08/454.279 |
| hybrid_3gram_1e-7 |      6.45/799.42      | 28.89/503.623 | 39.66/590.483 |
| given_finance     |      6.47/1104.9      | 37.51/2362.24 | 45.77/2474.66 |
| given_universe    |     12.23/4277.15     | 20.79/319.08  | 31.77/351.087 |
==============================================================================

Note:
Beam=13
max_active=7000
fin_3gram_1e-7为40G金融语料训练出的LM
uni_3gram_1e-7为64G通用语料训练出的LM
hybrid_3gram_1e-7为混合LM
given_finance为2.9G金融语料训练出的LM
given_universe为海量通用语料训练出的Base LM
发音词典为lexicon v0.2
AM为tonghuashun v0.1 am

5. RELEASE TEAM:

Author: 魏扬
Contributor: 白子薇
Monitor: 赵梦原