cslt Wiki - 用户贡献 [zh-cn]

Li Cao 14-12-14

2014-12-15T01:09:59Z

Caoli：以“=== Accomplished this week === *use the MERT method trained all kinds of argument .improved the test scores in lucene *Test the 'COORD' value set the rate of accura...”为内容创建页面

=== Accomplished this week ===
*use the MERT method trained all kinds of argument .improved the test scores in lucene
*Test the 'COORD' value set the rate of accuracy on the test
=== Plan for next week ===
*Add the 'SPELL CHECK' to the system
*learn the 'queryAnalysis' function

2014-12-14

2014-12-15T00:57:20Z

Caoli：

[[Fanhu bie 14-12-14]]

[[Rong Liu 14-12-14]]

[[Bin Yuan 14-12-14]]

[[Li Cao 14-12-14]]

Z-MERT

2014-12-09T10:17:04Z

Caoli：

*Test conditions
:*(../res/corpus/20141016凉山州/3文本/testJ.txt) about 1596 questions.
:*Only Lucene

=Test result=
{| border="2px"
|+ different result in lucene
|-
! method !!baseline !! new_index_template(1.0 1.0) !! by Z-MERT(1.8411, 1.0) !!
|-
! Accary
| 0.662280 || 0.669799|| 0.678571
|-
|}

note:above are only sq and pattern.

Z-MERT

2014-12-09T10:14:51Z

Caoli：/* Test result */

*Test conditions
:*(../res/corpus/20141016凉山州/3文本/testJ.txt)
:*Only Lucene

=Test result=
{| border="2px"
|+ different result in lucene
|-
! method !!baseline !! new_index_template !! by Z-MERT !!
|-
! Accary
| 0.662280 || 0.669799|| 0.678571
|-
|}

Z-MERT

2014-12-09T10:12:34Z

Caoli：

Z-MERT

2014-12-09T10:09:16Z

Caoli：

*Test conditions
:*(../res/corpus/20141016凉山州/3文本/testJ.txt)
:*Only Lucene

*test result
--------|-------------------|-------------------|---------------------------|
| | | |
| baseline | new_index_template| change argument by Z-MERT |
| | | |
--------|-------------------|-------------------|---------------------------|
correct|0.6697994987468672 | 0.6697994987468672|0.6785714285714286 |
--------|----------—--------|-------------------|---------------------------|

Multi query in multi field

2014-12-09T07:27:46Z

Caoli：/* test result */

=check the detail of Lucene score=
==data==
d0 [{如何，怎么}} {办理，办} {户口，户口本} # 到当地派出所办理 # 如何办理户口
d1 {办理，办} {户口，户口本} [{流程，步骤}] # 到当地派出所办理 # 如何办理户口
d2 [{如何，怎么}} {办理，办} {身份证，身份} # 到当地派出所办理 # 如何办理身份证
d3 {办理，办} {身份证} [{流程，步骤}] # 到当地派出所办理 # 如何办理身份证

==搜索==
query:"如何办理户口" => question:如何 question:办理户口
==result==
doc=0 score=0.114656925 shardIndex=-1|0.114656925 = (MATCH) product of:
0.22931385 = (MATCH) sum of:
0.22931385 = (MATCH) weight(question:如何 in 0) [DefaultSimilarity], result of:
0.22931385 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.4748871 = queryWeight, product of:
1.287682 = idf(docFreq=2, maxDocs=4)
0.3687922 = queryNorm
0.48288077 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.287682 = idf(docFreq=2, maxDocs=4)
0.375 = fieldNorm(doc=0)
0.5 = coord(1/2)
*详细计算流score(query,d0)
*参考公式：[http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html]
[[文件:QQ截图20141128164958.png]]
:* tf("如何" in d0)=sqrt{frequency}= sqrt{1}=1
:* idf("如何")=<math>1+ln( {numDocs}/{docFreq+1})=1+ln( {4}/{2+1} )
:* 如何".getboost=1
:* coord(如何，d0) : 0.5 = coord(1/2)
coord(t,d)=overlap /maxOverlap .
overlap - the number of query terms matched in the document
maxOverlap - the total number of terms in the query
:* queryNorm(q)= 1/sqrt(sumOfSquaredWeights)=1/sqrt(sum(idf("如何")*1+idf("办理户口")))=1/sqrt(1*(1.287682*1.287682+2.386*2.386))=0.3687.
sumOfSquaredWeights = q.getBoost()*q.getBoost()*∑( idf(t) *t.getBoost() )^2

=mutli =
==data==
d0 [{如何，怎么}} {办理，办} {户口，户口本} # 到当地派出所办理 # 如何办理户口
d1 {办理，办} {户口，户口本} [{流程，步骤}] # 到当地派出所办理 # 如何办理户口
d2 [{如何，怎么}} {办理，办} {身份证，身份} # 到当地派出所办理 # 如何办理身份证
d3 {办理，办} {身份证} [{流程，步骤}] # 到当地派出所办理 # 如何办理身份证
==搜索==
code
BooleanQuery query = new BooleanQuery();
query.add(paternQuery, Occur.MUST); // or Occur.SHOULD if this clause is optional
query.add(ansQuery, Occur.SHOULD); // or Occur.MUST if this clause is required
query.add(sqQuery, Occur.SHOULD);
search:
+((question:如何 question:办理户口)^0.8) ((answer:如何 answer:办理户口)^0.2) ((standardq:如何 standardq:办理户口)^0.2)

==result==
* 计算公式
:* score(Q)=score(q_PTN)+score(q_ANS)+score(q_STD)
:* querynorm(Q),Q=q_PTN+q_ANS+q_STD
::* sumOfSquaredWeights = ∑{q.getBoost()*q.getBoost()*∑( idf(t) *t.getBoost() )^2},q={q_PTN , q_STD, q_ANS}
::* queryNorm(Q)= 1/sqrt(sumOfSquaredWeights)
:* field patern
::* tf("如何" in d0)=sqrt{frequency}= sqrt{1}=1
::* idf("如何")=<math>1+ln( {numDocs}/{docFreq+1})=1+ln( {4}/{2+1} )
::* 如何".getboost=1
::* coord(如何，d0) : 0.5 = coord(1/2)
coord(t,d)=overlap /maxOverlap .
overlap - the number of query terms matched in the document
maxOverlap - the total number of terms in the query
::* queryNorm(q_PTN)=querynorm(Q)*boost(q_PTN)
::* Norm
:*
*detail
:* filed: answer+pattern
score(q,filed-pattern)+score(q,filed-answer)

doc=0 score=0.15459718 shardIndex=-1|0.1545972 = (MATCH) product of:
0.23189577 = (MATCH) sum of:[all]
0.108532876 = (MATCH) product of:[filed:pattern]
0.21706575 = (MATCH) sum of:
0.21706575 = (MATCH) weight(question:如何 in 0) [DefaultSimilarity], result of:
0.21706575 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.44952247 = queryWeight, product of:
1.287682 = idf(docFreq=2, maxDocs=4)
0.3490943 = queryNorm
0.48288077 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.287682 = idf(docFreq=2, maxDocs=4)
0.375 = fieldNorm(doc=0)
0.5 = coord(1/2)
0.12336289 = (MATCH) sum of:[field:answer]
0.032918826 = (MATCH) weight(answer:如何 in 0) [DefaultSimilarity], result of:
0.032918826 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.06779904 = queryWeight, product of:
0.7768564 = idf(docFreq=4, maxDocs=4)
0.087273575 = queryNorm
0.48553526 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.7768564 = idf(docFreq=4, maxDocs=4)
0.625 = fieldNorm(doc=0)
0.090444066 = (MATCH) weight(answer:办理户口 in 0) [DefaultSimilarity], result of:
0.090444066 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.11238062 = queryWeight, product of:
1.287682 = idf(docFreq=2, maxDocs=4)
0.087273575 = queryNorm
0.8048013 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.287682 = idf(docFreq=2, maxDocs=4)
0.625 = fieldNorm(doc=0)
0.6666667 = coord(2/3)

=Z-MERT test result=
[Z-MERT]
*Test conditions
:*(../res/corpus/20141016凉山州/3文本/testJ.txt)
:*Only Lucene

*The default argument(patern:1.0 sq:1.0)
:*test result:0.6697994987468672
*Use MERT method and get the argument(patern:1.811676798378926, sq:1.0)
:*test result:0.6779448621553885

Multi query in multi field

2014-12-09T07:17:50Z

Caoli：/* test result */

=check the detail of Lucene score=
==data==
d0 [{如何，怎么}} {办理，办} {户口，户口本} # 到当地派出所办理 # 如何办理户口
d1 {办理，办} {户口，户口本} [{流程，步骤}] # 到当地派出所办理 # 如何办理户口
d2 [{如何，怎么}} {办理，办} {身份证，身份} # 到当地派出所办理 # 如何办理身份证
d3 {办理，办} {身份证} [{流程，步骤}] # 到当地派出所办理 # 如何办理身份证

==搜索==
query:"如何办理户口" => question:如何 question:办理户口
==result==
doc=0 score=0.114656925 shardIndex=-1|0.114656925 = (MATCH) product of:
0.22931385 = (MATCH) sum of:
0.22931385 = (MATCH) weight(question:如何 in 0) [DefaultSimilarity], result of:
0.22931385 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.4748871 = queryWeight, product of:
1.287682 = idf(docFreq=2, maxDocs=4)
0.3687922 = queryNorm
0.48288077 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.287682 = idf(docFreq=2, maxDocs=4)
0.375 = fieldNorm(doc=0)
0.5 = coord(1/2)
*详细计算流score(query,d0)
*参考公式：[http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html]
[[文件:QQ截图20141128164958.png]]
:* tf("如何" in d0)=sqrt{frequency}= sqrt{1}=1
:* idf("如何")=<math>1+ln( {numDocs}/{docFreq+1})=1+ln( {4}/{2+1} )
:* 如何".getboost=1
:* coord(如何，d0) : 0.5 = coord(1/2)
coord(t,d)=overlap /maxOverlap .
overlap - the number of query terms matched in the document
maxOverlap - the total number of terms in the query
:* queryNorm(q)= 1/sqrt(sumOfSquaredWeights)=1/sqrt(sum(idf("如何")*1+idf("办理户口")))=1/sqrt(1*(1.287682*1.287682+2.386*2.386))=0.3687.
sumOfSquaredWeights = q.getBoost()*q.getBoost()*∑( idf(t) *t.getBoost() )^2

=mutli =
==data==
d0 [{如何，怎么}} {办理，办} {户口，户口本} # 到当地派出所办理 # 如何办理户口
d1 {办理，办} {户口，户口本} [{流程，步骤}] # 到当地派出所办理 # 如何办理户口
d2 [{如何，怎么}} {办理，办} {身份证，身份} # 到当地派出所办理 # 如何办理身份证
d3 {办理，办} {身份证} [{流程，步骤}] # 到当地派出所办理 # 如何办理身份证
==搜索==
code
BooleanQuery query = new BooleanQuery();
query.add(paternQuery, Occur.MUST); // or Occur.SHOULD if this clause is optional
query.add(ansQuery, Occur.SHOULD); // or Occur.MUST if this clause is required
query.add(sqQuery, Occur.SHOULD);
search:
+((question:如何 question:办理户口)^0.8) ((answer:如何 answer:办理户口)^0.2) ((standardq:如何 standardq:办理户口)^0.2)

==result==
* 计算公式
:* score(Q)=score(q_PTN)+score(q_ANS)+score(q_STD)
:* querynorm(Q),Q=q_PTN+q_ANS+q_STD
::* sumOfSquaredWeights = ∑{q.getBoost()*q.getBoost()*∑( idf(t) *t.getBoost() )^2},q={q_PTN , q_STD, q_ANS}
::* queryNorm(Q)= 1/sqrt(sumOfSquaredWeights)
:* field patern
::* tf("如何" in d0)=sqrt{frequency}= sqrt{1}=1
::* idf("如何")=<math>1+ln( {numDocs}/{docFreq+1})=1+ln( {4}/{2+1} )
::* 如何".getboost=1
::* coord(如何，d0) : 0.5 = coord(1/2)
coord(t,d)=overlap /maxOverlap .
overlap - the number of query terms matched in the document
maxOverlap - the total number of terms in the query
::* queryNorm(q_PTN)=querynorm(Q)*boost(q_PTN)
::* Norm
:*
*detail
:* filed: answer+pattern
score(q,filed-pattern)+score(q,filed-answer)

doc=0 score=0.15459718 shardIndex=-1|0.1545972 = (MATCH) product of:
0.23189577 = (MATCH) sum of:[all]
0.108532876 = (MATCH) product of:[filed:pattern]
0.21706575 = (MATCH) sum of:
0.21706575 = (MATCH) weight(question:如何 in 0) [DefaultSimilarity], result of:
0.21706575 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.44952247 = queryWeight, product of:
1.287682 = idf(docFreq=2, maxDocs=4)
0.3490943 = queryNorm
0.48288077 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.287682 = idf(docFreq=2, maxDocs=4)
0.375 = fieldNorm(doc=0)
0.5 = coord(1/2)
0.12336289 = (MATCH) sum of:[field:answer]
0.032918826 = (MATCH) weight(answer:如何 in 0) [DefaultSimilarity], result of:
0.032918826 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.06779904 = queryWeight, product of:
0.7768564 = idf(docFreq=4, maxDocs=4)
0.087273575 = queryNorm
0.48553526 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.7768564 = idf(docFreq=4, maxDocs=4)
0.625 = fieldNorm(doc=0)
0.090444066 = (MATCH) weight(answer:办理户口 in 0) [DefaultSimilarity], result of:
0.090444066 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.11238062 = queryWeight, product of:
1.287682 = idf(docFreq=2, maxDocs=4)
0.087273575 = queryNorm
0.8048013 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.287682 = idf(docFreq=2, maxDocs=4)
0.625 = fieldNorm(doc=0)
0.6666667 = coord(2/3)

=test result=
[Z-MERT]
*Test conditions
:*(../res/corpus/20141016凉山州/3文本/testJ.txt)
:*Only Lucene

*The default argument(patern:1.0 sq:1.0)
:*test result:0.6697994987468672
*Use MERT method and get the argument(patern:1.811676798378926, sq:1.0)
:*test result:0.6779448621553885

Multi query in multi field

2014-12-09T07:13:39Z

Caoli：/* test result */

=check the detail of Lucene score=
==data==
d0 [{如何，怎么}} {办理，办} {户口，户口本} # 到当地派出所办理 # 如何办理户口
d1 {办理，办} {户口，户口本} [{流程，步骤}] # 到当地派出所办理 # 如何办理户口
d2 [{如何，怎么}} {办理，办} {身份证，身份} # 到当地派出所办理 # 如何办理身份证
d3 {办理，办} {身份证} [{流程，步骤}] # 到当地派出所办理 # 如何办理身份证

==搜索==
query:"如何办理户口" => question:如何 question:办理户口
==result==
doc=0 score=0.114656925 shardIndex=-1|0.114656925 = (MATCH) product of:
0.22931385 = (MATCH) sum of:
0.22931385 = (MATCH) weight(question:如何 in 0) [DefaultSimilarity], result of:
0.22931385 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.4748871 = queryWeight, product of:
1.287682 = idf(docFreq=2, maxDocs=4)
0.3687922 = queryNorm
0.48288077 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.287682 = idf(docFreq=2, maxDocs=4)
0.375 = fieldNorm(doc=0)
0.5 = coord(1/2)
*详细计算流score(query,d0)
*参考公式：[http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html]
[[文件:QQ截图20141128164958.png]]
:* tf("如何" in d0)=sqrt{frequency}= sqrt{1}=1
:* idf("如何")=<math>1+ln( {numDocs}/{docFreq+1})=1+ln( {4}/{2+1} )
:* 如何".getboost=1
:* coord(如何，d0) : 0.5 = coord(1/2)
coord(t,d)=overlap /maxOverlap .
overlap - the number of query terms matched in the document
maxOverlap - the total number of terms in the query
:* queryNorm(q)= 1/sqrt(sumOfSquaredWeights)=1/sqrt(sum(idf("如何")*1+idf("办理户口")))=1/sqrt(1*(1.287682*1.287682+2.386*2.386))=0.3687.
sumOfSquaredWeights = q.getBoost()*q.getBoost()*∑( idf(t) *t.getBoost() )^2

=mutli =
==data==
d0 [{如何，怎么}} {办理，办} {户口，户口本} # 到当地派出所办理 # 如何办理户口
d1 {办理，办} {户口，户口本} [{流程，步骤}] # 到当地派出所办理 # 如何办理户口
d2 [{如何，怎么}} {办理，办} {身份证，身份} # 到当地派出所办理 # 如何办理身份证
d3 {办理，办} {身份证} [{流程，步骤}] # 到当地派出所办理 # 如何办理身份证
==搜索==
code
BooleanQuery query = new BooleanQuery();
query.add(paternQuery, Occur.MUST); // or Occur.SHOULD if this clause is optional
query.add(ansQuery, Occur.SHOULD); // or Occur.MUST if this clause is required
query.add(sqQuery, Occur.SHOULD);
search:
+((question:如何 question:办理户口)^0.8) ((answer:如何 answer:办理户口)^0.2) ((standardq:如何 standardq:办理户口)^0.2)

==result==
* 计算公式
:* score(Q)=score(q_PTN)+score(q_ANS)+score(q_STD)
:* querynorm(Q),Q=q_PTN+q_ANS+q_STD
::* sumOfSquaredWeights = ∑{q.getBoost()*q.getBoost()*∑( idf(t) *t.getBoost() )^2},q={q_PTN , q_STD, q_ANS}
::* queryNorm(Q)= 1/sqrt(sumOfSquaredWeights)
:* field patern
::* tf("如何" in d0)=sqrt{frequency}= sqrt{1}=1
::* idf("如何")=<math>1+ln( {numDocs}/{docFreq+1})=1+ln( {4}/{2+1} )
::* 如何".getboost=1
::* coord(如何，d0) : 0.5 = coord(1/2)
coord(t,d)=overlap /maxOverlap .
overlap - the number of query terms matched in the document
maxOverlap - the total number of terms in the query
::* queryNorm(q_PTN)=querynorm(Q)*boost(q_PTN)
::* Norm
:*
*detail
:* filed: answer+pattern
score(q,filed-pattern)+score(q,filed-answer)

doc=0 score=0.15459718 shardIndex=-1|0.1545972 = (MATCH) product of:
0.23189577 = (MATCH) sum of:[all]
0.108532876 = (MATCH) product of:[filed:pattern]
0.21706575 = (MATCH) sum of:
0.21706575 = (MATCH) weight(question:如何 in 0) [DefaultSimilarity], result of:
0.21706575 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.44952247 = queryWeight, product of:
1.287682 = idf(docFreq=2, maxDocs=4)
0.3490943 = queryNorm
0.48288077 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.287682 = idf(docFreq=2, maxDocs=4)
0.375 = fieldNorm(doc=0)
0.5 = coord(1/2)
0.12336289 = (MATCH) sum of:[field:answer]
0.032918826 = (MATCH) weight(answer:如何 in 0) [DefaultSimilarity], result of:
0.032918826 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.06779904 = queryWeight, product of:
0.7768564 = idf(docFreq=4, maxDocs=4)
0.087273575 = queryNorm
0.48553526 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.7768564 = idf(docFreq=4, maxDocs=4)
0.625 = fieldNorm(doc=0)
0.090444066 = (MATCH) weight(answer:办理户口 in 0) [DefaultSimilarity], result of:
0.090444066 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.11238062 = queryWeight, product of:
1.287682 = idf(docFreq=2, maxDocs=4)
0.087273575 = queryNorm
0.8048013 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.287682 = idf(docFreq=2, maxDocs=4)
0.625 = fieldNorm(doc=0)
0.6666667 = coord(2/3)

=test result=
[Z-MERT]
*The default argument(patern:1.0 sq:1.0)
:*test result:0.6697994987468672
*Use MERT method and get the argument(patern:1.811676798378926, sq:1.0)
:*test result:0.6779448621553885

Multi query in multi field

2014-12-09T07:05:38Z

Caoli：/* test result */

Li Cao 14-12-07

2014-12-08T02:11:18Z

Caoli：以“=== Accomplished this week === * Understand the Minimum Error Rate Training in Lucene. * Read several paper about MERT === Plan for next week === * according to the...”为内容创建页面

=== Accomplished this week ===
* Understand the Minimum Error Rate Training in Lucene.
* Read several paper about MERT
=== Plan for next week ===
* according to the MERT method.test and record the result.
* Read the papers about the Mert

2014-12-07

2014-12-08T01:26:07Z

Caoli：

[[Xiaoxi Wang 14-12-07]]

[[Dongxu Zhang 14-12-07]]

[[Xiangyu Zeng 14-12-07]]

[[Miao Fan 14-12-07]]

[[Bin Yuan 14-12-07]]

[[Mengyuan Zhao 14-12-07]]

[[Li Cao 14-12-07]]

2014-11-19

2014-11-19T12:06:51Z

Caoli：/* 原因 */

拼写检查功能模块的测试报告如下:
author CaoLi date:2014 11.19
=建立测试集=
首先对测试集进行手动改错业务词后再自动分词,进行测试.条数:200条.
例如：
手动改错业务词：
申请班里高领老人紧贴变更和终止的实现
原句自动分词后：
申请班里高领老人紧贴变更和终止的实现
测试集为:测试集(.\corpus\20141016凉山州\3文本\testJ.txt)前200条,注意是只取每一条对应的问题.
=评价=
测试结果的评价标准:
正确率=正确识别出需要修改的个体总数/识别出需要修改的个体总数
召回率=正确识别出需要修改的个体总数/测试集中存在的需要修改的个体总数
准确率=修改对的个体总数/个体总数
例如:
正确:
我真想办理身份证呀.
测试用例:
我挣像办理神风证压.
结果:
我证想班里身份证压.

动作:
我->我(correct) 像->想（correct）办理->班里（false）神风证->身份证(correct) 挣->证(false) 压->压(false)
评价：
需要修改: 正确率=3/4. 召回率=3/4.
不要修改：正确率=1/2. 召回率=1/2.
准确率:3/6

=测试结果=
1.使用的语言模型:使用训练集<凉山州政务知识训练集1016.xls>中的<标准问题答案>训练的3-gram语言模型.（详细结果见test-model-RESULT.txt）

RESULT:
需要修改:正确率:498/498 = 1.0 召回率: 498/881 = 0.565266
不要修改:正确率:2228/2611 = 0.853313 召回率: 2228/2228 = 1.0
准确率 :2678/3109 = 0.861370

=结果分析=

根据上面的结果发现召回率较低，

==原因==

可能的原因为：由于是先手动改错业务词再根据词表自动分词的。故系统有可能将一个业务词分成了好几个词。

例如：
[汝, 河, 进行, 开发商, 新建, 房产, 权, 等级]
过程：
[汝, 河, 进行, 开发商, 新建, 房产, 权, 登机]'score is:29.822336867451668
[汝, 河, 进行, 开发商, 新建, 房产, 权, 等级]'score is:29.208215907216072
[汝, 河, 进行, 开发商, 新建, 房产, 权, 登记]'score is:27.493204072117805
[汝, 河, 进行, 开发商, 新建, 房产, 权, 登基]'score is:29.822336867451668
test result:汝河进行开发商新建房产权登记

分析：
由于上面将“汝河”分成了“汝”，“河”两个词,系统就不会对词“汝河”进行重新组合并打分。

把改错的业务词分开的所占的比重：44/98=0.448979

例如：
架势证 ------架势证

==改进==

可能的改进方法：

我们可以用拼音进行分词，但目前还未采取那样做。

2014-11-19

2014-11-19T12:06:02Z

Caoli：以“ 拼写检查功能模块的测试报告如下: author CaoLi date:2014 11.19 =建立测试集= 首先对测试集进行手动改错业务词后再自动分词,...”为内容创建页面

Spell check

2014-11-19T11:00:15Z

Caoli：/* result */

==评价标准==
拼写检查的评价标准:

正确率=正确识别出需要修改的个体总数 / 识别出需要修改的个体总数.

召回率=正确识别出需要修改的个体总数 / 测试集中存在的需要修改的个体总数.

准确率 = 修改对的个体总数/个体总数

注:正确识别的个体为拼写检查正确的个数,识别出的个体总数为所有进行拼写检查动作的总数.

举例:

正确:我真想办理身份证呀. 测试用例: 我挣像办理神风证压. 结果:我证想班里身份证压.

动作:我->我(correct) 像->想（correct）办理->班里（false）神风证->身份证(correct) 挣->证(false) 压->压(false)

需要修改: 正确率=3/4. 召回率=3/4.

不要修改：正确率=1/2. 召回率=1/2.

准确率:3/6
==some source==
* some algorithms of spelling correction [http://www.quora.com/What-are-some-algorithms-of-spelling-correction-that-were-used-by-search-engine][https://documentation.devexpress.com/#WindowsForms/CustomDocument2989]
* How to Write a Spelling Corrector [http://norvig.com/spell-correct.html]
*
==result==
[[2014-11-18]]

[[2014-11-19]]