2014年11月21日 (五) 01:07的版本

MERT-4 Method

different result in lucene
method	Default	BM25	LMDirichlet	DFR	LMJelinekMercer	IB
Accary	0.66228	0.66228	0.4091	0.65476	0.65476	0.6666

boost keyword in lucene
method	Default	idf_train	idf_train_norm	idf_baidu	idf_baidu_norm
Accary	0.66228	0.651629	0.57644	0.647869	0.65288

different result in lucene
method	lucene	vsm_idf(haiguan)	VSM_idf(baidu)	vsm_idf(tain)	vsm_idf(calculate)
Accary	0.6628	0.6228	0.6197	0.5827	0.5426

calculate the similarity value = 1/(5-5*av_value).where av_value = average(word2vec+Synonyms forest+hownet).

lucene4.6 already added synonyms method (org.apache.lucene.analysis.synonym[2]) like :(a -> x) (a b -> y) (b c d -> z) or extend the query.

@@ 第1行： / 第1行： @@
-==MERT-4 Method==
+=MERT-4 Method=
-==lucene method==
+=lucene method=
 *data set
 :* jiangkaipeng:
@@ 第14行： / 第14行： @@
 |-
 |}
-== boost keyword ==
+= boost keyword =
 * boost the query keyword using IDF
 {| border="2px"
@@ 第30行： / 第30行： @@
 * add the new keyword value from proMe method
-==our method==
+=our method=
 {| border="2px"
 |+ different result in lucene
@@ 第41行： / 第41行： @@
 |}
-==synonyms method==
+=synonyms method=
 * fuzzy match
 :* calculate the similarity value = 1/(5-5*av_value).where av_value = average(word2vec+Synonyms forest+hownet).
@@ 第48行： / 第48行： @@
 :*
-==find==
+=find=
 * 采用最细粒度分词(对于标准问题在建立索引时，模板不用),可以提高正确率。61=>66.对于标准问题建索引时.
 * 对输入的问题不应用细粒度分词（细粒度的59%，不用66%）。
 * lucene4.6 已经增加了同义词拓展[http://www.hankcs.com/program/java/lucene-synonymfilterfactory.html]
-==bug fix==
+=bug fix=
 * vsm method
 :* doesn't clear the pattern before search