“Lucene”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
Lr讨论 | 贡献
test idf and tf
第1行: 第1行:
 
=test idf and tf=
 
=test idf and tf=
 
*data
 
*data
   [{如何,怎么}} {办理,办} {户口,户口本} # 到当地派出所办理  # 如何办理户口
+
   d0 [{如何,怎么}} {办理,办} {户口,户口本} # 到当地派出所办理  # 如何办理户口
   {办理,办} {户口,户口本} [{流程,步骤}] # 到当地派出所办理  # 如何办理户口
+
   d1 {办理,办} {户口,户口本} [{流程,步骤}] # 到当地派出所办理  # 如何办理户口
   [{如何,怎么}} {办理,办} {身份证,身份} # 到当地派出所办理  # 如何办理身份证
+
   d2 [{如何,怎么}} {办理,办} {身份证,身份} # 到当地派出所办理  # 如何办理身份证
   {办理,办} {身份证} [{流程,步骤}] # 到当地派出所办理  # 如何办理身份证
+
   d3 {办理,办} {身份证} [{流程,步骤}] # 到当地派出所办理  # 如何办理身份证
 
*搜索
 
*搜索
 
  query:"如何办理户口"  => question:如何 question:办理户口
 
  query:"如何办理户口"  => question:如何 question:办理户口
 
*result
 
*result
   doc=0 score=0.11657263 shardIndex=-1|0.11657263 = (MATCH) product of:
+
   doc=0 score=0.114656925 shardIndex=-1|0.114656925 = (MATCH) product of:
    0.23314527 = (MATCH) sum of:
+
    0.22931385 = (MATCH) sum of:
        0.23314527 = (MATCH) weight(question:如何 in 0) [DefaultSimilarity], result of:
+
      0.22931385 = (MATCH) weight(question:如何 in 0) [DefaultSimilarity], result of:
          0.23314527 = score(doc=0,freq=1.0 = termFreq=1.0
+
        0.22931385 = score(doc=0,freq=1.0 = termFreq=1.0
  ), product of:
+
  ), product of:
          0.40397802 = queryWeight, product of:
+
        0.4748871 = queryWeight, product of:
            1.5389965 = idf(docFreq=6, maxDocs=12)
+
          1.287682 = idf(docFreq=2, maxDocs=4)
              0.26249444 = queryNorm
+
          0.3687922 = queryNorm
          0.57712364 = fieldWeight in 0, product of:
+
        0.48288077 = fieldWeight in 0, product of:
              1.0 = tf(freq=1.0), with freq of:
+
          1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
+
            1.0 = termFreq=1.0
            1.5389965 = idf(docFreq=6, maxDocs=12)
+
          1.287682 = idf(docFreq=2, maxDocs=4)
            0.375 = fieldNorm(doc=0)
+
          0.375 = fieldNorm(doc=0)
    0.5 = coord(1/2)
+
  0.5 = coord(1/2)
*详细计算流程
+
*详细计算流score(query,d0)
 +
*参考公式:[http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html]
 +
[[文件:QQ截图20141128164958.png]]
 +
:* tf("如何" in d0)=\sqrt{frequency}=\sqrt{1}=1
 +
:* idf("如何")=1+ln(\frac{numDocs}{docFreq+1})=1+ln(\frac{4}{2+1})

2014年11月28日 (五) 08:30的版本

test idf and tf

  • data
 d0 [{如何,怎么}} {办理,办} {户口,户口本} # 到当地派出所办理  # 如何办理户口
 d1 {办理,办} {户口,户口本} [{流程,步骤}] # 到当地派出所办理  # 如何办理户口
 d2 [{如何,怎么}} {办理,办} {身份证,身份} # 到当地派出所办理  # 如何办理身份证
 d3 {办理,办} {身份证} [{流程,步骤}] # 到当地派出所办理  # 如何办理身份证
  • 搜索
query:"如何办理户口"  => question:如何 question:办理户口
  • result
 doc=0 score=0.114656925 shardIndex=-1|0.114656925 = (MATCH) product of:
   0.22931385 = (MATCH) sum of:
     0.22931385 = (MATCH) weight(question:如何 in 0) [DefaultSimilarity], result of:
       0.22931385 = score(doc=0,freq=1.0 = termFreq=1.0
 ), product of:
       0.4748871 = queryWeight, product of:
         1.287682 = idf(docFreq=2, maxDocs=4)
         0.3687922 = queryNorm
       0.48288077 = fieldWeight in 0, product of:
         1.0 = tf(freq=1.0), with freq of:
           1.0 = termFreq=1.0
         1.287682 = idf(docFreq=2, maxDocs=4)
         0.375 = fieldNorm(doc=0)
  0.5 = coord(1/2)
  • 详细计算流score(query,d0)
  • 参考公式:[1]
QQ截图20141128164958.png
  • tf("如何" in d0)=\sqrt{frequency}=\sqrt{1}=1
  • idf("如何")=1+ln(\frac{numDocs}{docFreq+1})=1+ln(\frac{4}{2+1})