“Lucene”版本间的差异
来自cslt Wiki
第25行: | 第25行: | ||
*参考公式:[http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html] | *参考公式:[http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html] | ||
[[文件:QQ截图20141128164958.png]] | [[文件:QQ截图20141128164958.png]] | ||
− | :* tf("如何" in d0)= | + | :* tf("如何" in d0)=sqrt{frequency}= sqrt{1}=1 |
− | :* idf("如何")=<math>1+ln( | + | :* idf("如何")=<math>1+ln( {numDocs}/{docFreq+1})=1+ln( {4}/{2+1} ) |
+ | :* 如何".getboost=1 | ||
+ | :* coord(如何,d0) : 0.5 = coord(1/2) | ||
+ | coord(t,d)=overlap /maxOverlap . | ||
+ | overlap - the number of query terms matched in the document | ||
+ | maxOverlap - the total number of terms in the query | ||
+ | :* queryNorm(q)= 1/sqrt(sumOfSquaredWeights)=. | ||
+ | sumOfSquaredWeights = q.getBoost()*q.getBoost()*∑( idf(t) *t.getBoost() )^2 |
2014年11月28日 (五) 08:39的版本
test idf and tf
- data
d0 [{如何,怎么}} {办理,办} {户口,户口本} # 到当地派出所办理 # 如何办理户口 d1 {办理,办} {户口,户口本} [{流程,步骤}] # 到当地派出所办理 # 如何办理户口 d2 [{如何,怎么}} {办理,办} {身份证,身份} # 到当地派出所办理 # 如何办理身份证 d3 {办理,办} {身份证} [{流程,步骤}] # 到当地派出所办理 # 如何办理身份证
- 搜索
query:"如何办理户口" => question:如何 question:办理户口
- result
doc=0 score=0.114656925 shardIndex=-1|0.114656925 = (MATCH) product of: 0.22931385 = (MATCH) sum of: 0.22931385 = (MATCH) weight(question:如何 in 0) [DefaultSimilarity], result of: 0.22931385 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.4748871 = queryWeight, product of: 1.287682 = idf(docFreq=2, maxDocs=4) 0.3687922 = queryNorm 0.48288077 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.287682 = idf(docFreq=2, maxDocs=4) 0.375 = fieldNorm(doc=0) 0.5 = coord(1/2)
- 详细计算流score(query,d0)
- 参考公式:[1]
- tf("如何" in d0)=sqrt{frequency}= sqrt{1}=1
- idf("如何")=<math>1+ln( {numDocs}/{docFreq+1})=1+ln( {4}/{2+1} )
- 如何".getboost=1
- coord(如何,d0) : 0.5 = coord(1/2)
coord(t,d)=overlap /maxOverlap . overlap - the number of query terms matched in the document maxOverlap - the total number of terms in the query
- queryNorm(q)= 1/sqrt(sumOfSquaredWeights)=.
sumOfSquaredWeights = q.getBoost()*q.getBoost()*∑( idf(t) *t.getBoost() )^2