“Document classification test”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
Document classification of Sougou data
Lr讨论 | 贡献
Document classification of Sougou data
 
(相同用户的一个中间修订版本未显示)
第2行: 第2行:
 
*[[How to import the sparse data of vsm to weka]]
 
*[[How to import the sparse data of vsm to weka]]
  
==Document classification of Sougou data ==
+
==Test==
* DATA
+
[[Sougou data]]
:* Data from SougouLab [http://www.sogou.com/labs/dl/c.html],using SogouC.reduced(30M)
+
:* 9-Classes:财经,IT,健康,体育,旅游,教育,招聘,文化,军事
+
:* train and test: train(),test(),dev()
+
*Text preprocessing
+
:* Segment word using wordlist of 9W.(tencent)
+
:* Remove stop word.stop_wordlist is
+
:*
+
*Some Tools
+
:* weka
+
:* scw
+
:* google word2ve
+
:* LDA
+
*class map
+
C000007 汽车
+
C000008 财经
+
C000010 IT
+
C000013 健康
+
C000014 体育
+
C000016 旅游
+
C000020 教育
+
C000022 招聘
+
C000023 文化
+
C000024 军事
+
* result data [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/4/4e/Document_classification.xlsx]
+
* paper [ Document classification based on word vectors|http://cslt.org:8081/homepages/wangd/public/pdf/w2v.pdf]
+
===VSM Test===
+
*Data
+
:* dimension:9402
+
*Method
+
:* document reprenstion: use the tf-idf weight for word weight
+
:* classifier: Native Bayes
+
*Result
+
 
+
{| border="2px"
+
|+ classification result
+
|-
+
!  !! 财经!! IT!! 健康!! 体育!! 旅游 !!教育 !! 招聘!! 文化!!军事!!sum
+
|-
+
! ACC-test
+
|  0.72139 || 0.72139 || 0.75124 || 0.82089 || 0.79602 || 0.61194 || 0.70647 || 0.64179|| 0.79104 || 0.72913
+
|-
+
! ACC-train
+
| 0.678 || 0.718 || 0.708 || 0.708 || 0.73
+
|-
+
|}
+
 
+
===LDA Test===
+
* LDA test
+
:* result
+
{| border="2px"
+
|+ classification result Of ACC in different dimension
+
|-
+
! Dimension  !! 财经!! IT!! 健康!! 体育!! 旅游 !!教育 !! 招聘!! 文化!!军事!!sum
+
|-
+
!10
+
|0.76119403|| 0.308457711|| 0.68159204|| 0.885572139|| 0.686567164|| 0.179104478|| 0.656716418|| 0.36318408|| 0.915422886|| 0.604201216
+
|-
+
!20
+
|0.845771144|| 0.308457711|| 0.686567164|| 0.835820896|| 0.497512438|| 0.223880597|| 0.572139303|| 0.328358209|| 0.815920398|| 0.568269762
+
|-
+
!30
+
|0.810945274|| 0.164179104|| 0.71641791|| 0.820895522|| 0.52238806|| 0.288557214|| 0.587064677|| 0.378109453|| 0.815920398|| 0.567164179
+
|-
+
!40
+
|0.815920398|| 0.368159204|| 0.711442786|| 0.850746269|| 0.621890547|| 0.36318408|| 0.825870647|| 0.313432836|| 0.870646766|| 0.637921504
+
|-
+
!50
+
|0.845771144|| 0.39800995|| 0.7960199|| 0.850746269|| 0.606965174|| 0.368159204|| 0.611940299|| 0.313432836|| 0.860696517|| 0.627971255
+
|-
+
!60
+
|0.860696517|| 0.293532338|| 0.776119403|| 0.800995025|| 0.47761194|| 0.338308458|| 0.810945274|| 0.298507463|| 0.850746269|| 0.611940299
+
|-
+
!70
+
|0.84079602|| 0.338308458|| 0.781094527|| 0.736318408|| 0.447761194|| 0.417910448|| 0.686567164|| 0.373134328|| 0.885572139|| 0.611940299
+
|-
+
!80
+
|0.860696517|| 0.39800995|| 0.676616915|| 0.751243781|| 0.621890547|| 0.402985075|| 0.820895522|| 0.308457711|| 0.850746269|| 0.632393588
+
|-
+
!90
+
|0.875621891|| 0.338308458|| 0.746268657|| 0.756218905|| 0.507462687|| 0.368159204|| 0.597014925|| 0.407960199|| 0.865671642|| 0.606965174
+
|-
+
!100
+
|0.805970149|| 0.527363184|| 0.736318408|| 0.741293532|| 0.646766169|| 0.417910448|| 0.76119403|| 0.31840796|| 0.885572139|| 0.648977336
+
|-
+
!110
+
|0.860696517|| 0.373134328|| 0.726368159|| 0.711442786|| 0.611940299|| 0.517412935|| 0.771144279|| 0.368159204|| 0.865671642|| 0.645107794
+
|-
+
!120
+
|0.865671642|| 0.427860697|| 0.686567164|| 0.736318408|| 0.656716418|| 0.47761194|| 0.781094527|| 0.378109453|| 0.855721393|| 0.651741294
+
|-
+
!130
+
|0.850746269|| 0.462686567|| 0.751243781|| 0.661691542|| 0.606965174|| 0.487562189|| 0.791044776|| 0.36318408|| 0.835820896|| 0.645660586
+
|-
+
!140
+
|0.7960199|| 0.507462687|| 0.666666667|| 0.731343284|| 0.587064677|| 0.482587065|| 0.756218905|| 0.393034826|| 0.875621891|| 0.644002211
+
|-
+
!150
+
|0.830845771|| 0.432835821|| 0.706467662|| 0.686567164|| 0.492537313|| 0.452736318|| 0.741293532|| 0.417910448|| 0.880597015|| 0.626865672
+
|-
+
!160
+
|0.805970149|| 0.437810945|| 0.676616915|| 0.711442786|| 0.641791045|| 0.47761194|| 0.815920398|| 0.422885572|| 0.870646766|| 0.651188502
+
|-
+
!170
+
|0.825870647|| 0.393034826|| 0.71641791|| 0.736318408|| 0.621890547|| 0.517412935|| 0.587064677|| 0.402985075|| 0.900497512|| 0.633499171
+
|-
+
!180
+
|0.7960199|| 0.502487562|| 0.781094527|| 0.691542289|| 0.552238806|| 0.55721393|| 0.592039801|| 0.452736318|| 0.830845771|| 0.639579878
+
|-
+
!190
+
|0.855721393|| 0.462686567|| 0.766169154|| 0.71641791|| 0.562189055|| 0.507462687|| 0.656716418|| 0.472636816|| 0.865671642|| 0.651741294
+
|-
+
!200
+
|0.835820896|| 0.412935323|| 0.781094527|| 0.706467662|| 0.577114428|| 0.482587065|| 0.641791045|| 0.432835821|| 0.875621891|| 0.638474295
+
|-
+
|}
+
 
+
===Word2vec Test===
+
*Word2vec result
+
:* Dimension
+
 
+
{| border="2px"
+
|+ classification result Of ACC in different dimension
+
|-
+
! Dimension  !! 财经!! IT!! 健康!! 体育!! 旅游 !!教育 !! 招聘!! 文化!!军事!!sum
+
|-
+
! 10
+
| 0.766169154|| 0.383084577|| 0.52238806|| 0.820895522|| 0.666666667|| 0.44278607|| 0.567164179|| 0.721393035|| 0.850746269|| 0.637921504
+
|-
+
!20
+
|0.781094527|| 0.537313433|| 0.572139303|| 0.830845771|| 0.76119403|| 0.452736318|| 0.611940299|| 0.646766169|| 0.860696517|| 0.672747374
+
|-
+
!30
+
|0.815920398|| 0.671641791|| 0.606965174|| 0.835820896|| 0.766169154|| 0.552238806|| 0.577114428|| 0.68159204|| 0.885572139|| 0.710337203
+
|-
+
!40
+
|0.7960199|| 0.68159204|| 0.631840796|| 0.805970149|| 0.756218905|| 0.572139303|| 0.577114428|| 0.701492537|| 0.905472637|| 0.714206744
+
|-
+
!50
+
|0.805970149|| 0.691542289|| 0.641791045|| 0.800995025|| 0.751243781|| 0.552238806|| 0.651741294|| 0.656716418|| 0.910447761|| 0.718076285
+
|-
+
!60
+
|0.7960199|| 0.68159204|| 0.626865672|| 0.776119403|| 0.736318408|| 0.572139303|| 0.626865672|| 0.651741294|| 0.895522388|| 0.707020453
+
|-
+
!70
+
|0.7960199|| 0.701492537|| 0.621890547|| 0.781094527|| 0.771144279|| 0.572139303|| 0.631840796|| 0.656716418|| 0.905472637|| 0.715312327
+
|-
+
!80
+
|0.7960199|| 0.686567164|| 0.626865672|| 0.805970149|| 0.776119403|| 0.582089552|| 0.631840796|| 0.676616915|| 0.905472637|| 0.720840243
+
|-
+
!90
+
|0.805970149|| 0.71641791|| 0.621890547|| 0.776119403|| 0.766169154|| 0.572139303|| 0.646766169|| 0.666666667|| 0.915422886|| 0.720840243
+
|-
+
!100
+
|0.776119403|| 0.706467662|| 0.631840796|| 0.751243781|| 0.786069652|| 0.577114428|| 0.646766169|| 0.666666667|| 0.910447761|| 0.716970702
+
|-
+
!110
+
|0.771144279|| 0.71641791|| 0.656716418|| 0.741293532|| 0.76119403|| 0.597014925|| 0.606965174|| 0.691542289|| 0.910447761|| 0.716970702
+
|-
+
!120
+
|0.76119403|| 0.71641791|| 0.646766169|| 0.756218905|| 0.766169154|| 0.60199005|| 0.661691542|| 0.686567164|| 0.915422886|| 0.723604201
+
|-
+
!130
+
|0.776119403|| 0.731343284|| 0.631840796|| 0.76119403|| 0.771144279|| 0.577114428|| 0.626865672|| 0.701492537|| 0.905472637|| 0.720287452
+
|-
+
!140
+
|0.76119403|| 0.746268657|| 0.63681592|| 0.736318408|| 0.786069652|| 0.587064677|| 0.651741294|| 0.68159204|| 0.900497512|| 0.720840243
+
|-
+
!150
+
|0.756218905|| 0.726368159|| 0.63681592|| 0.736318408|| 0.771144279|| 0.611940299|| 0.651741294|| 0.686567164|| 0.910447761|| 0.720840243
+
|-
+
!160
+
|0.751243781|| 0.71641791|| 0.646766169|| 0.731343284|| 0.776119403|| 0.597014925|| 0.651741294|| 0.696517413|| 0.895522388|| 0.718076285
+
|-
+
!170
+
|0.756218905|| 0.741293532|| 0.661691542|| 0.731343284|| 0.766169154|| 0.60199005|| 0.651741294|| 0.666666667|| 0.900497512|| 0.71973466
+
|-
+
!180
+
|0.781094527|| 0.731343284|| 0.651741294|| 0.736318408|| 0.781094527|| 0.606965174|| 0.631840796|| 0.676616915|| 0.895522388|| 0.721393035
+
|-
+
!190
+
|0.771144279|| 0.726368159|| 0.661691542|| 0.731343284|| 0.766169154|| 0.60199005|| 0.631840796|| 0.706467662|| 0.900497512|| 0.721945826
+
|-
+
!200
+
|0.771144279|| 0.736318408|| 0.641791045|| 0.706467662|| 0.771144279|| 0.606965174|| 0.611940299|| 0.71641791|| 0.900497512|| 0.718076285
+
|-
+
|}
+
:* Window
+
{| border="2px"
+
|+ classification result Of ACC in different dimension
+
|-
+
! windows  !! 财经!! IT!! 健康!! 体育!! 旅游 !!教育 !! 招聘!! 文化!!军事!!sum
+
|-
+
!3
+
|0.805970149|| 0.666666667|| 0.621890547|| 0.766169154|| 0.76119403|| 0.542288557|| 0.646766169|| 0.641791045|| 0.900497512|| 0.70591487
+
|-
+
!4
+
|0.756218905|| 0.686567164|| 0.646766169|| 0.776119403|| 0.776119403|| 0.567164179|| 0.631840796|| 0.651741294|| 0.905472637|| 0.710889994
+
|-
+
!5
+
|0.791044776|| 0.711442786|| 0.641791045|| 0.800995025|| 0.76119403|| 0.567164179|| 0.68159204|| 0.641791045|| 0.895522388|| 0.721393035
+
|-
+
!6
+
|0.820895522|| 0.68159204|| 0.626865672|| 0.771144279|| 0.76119403|| 0.537313433|| 0.656716418|| 0.656716418|| 0.900497512|| 0.712548369
+
|-
+
!7
+
|0.7960199|| 0.656716418|| 0.656716418|| 0.800995025|| 0.756218905|| 0.562189055|| 0.661691542|| 0.621890547|| 0.900497512|| 0.712548369
+
|-
+
!8
+
|0.786069652|| 0.68159204|| 0.631840796|| 0.7960199|| 0.766169154|| 0.552238806|| 0.592039801|| 0.696517413|| 0.910447761|| 0.712548369
+
|-
+
!9
+
|0.786069652|| 0.666666667|| 0.606965174|| 0.860696517|| 0.771144279|| 0.532338308|| 0.582089552|| 0.686567164|| 0.900497512|| 0.710337203
+
|-
+
!10
+
|0.805970149|| 0.671641791|| 0.616915423|| 0.835820896|| 0.771144279|| 0.606965174|| 0.651741294|| 0.666666667|| 0.910447761|| 0.726368159
+
|-
+
!11
+
|0.800995025|| 0.696517413|| 0.631840796|| 0.771144279|| 0.751243781|| 0.587064677|| 0.597014925|| 0.671641791|| 0.885572139|| 0.710337203
+
|-
+
!12
+
|0.7960199|| 0.671641791|| 0.626865672|| 0.7960199|| 0.76119403|| 0.542288557|| 0.606965174|| 0.706467662|| 0.900497512|| 0.711995578
+
|-
+
!13
+
|0.791044776|| 0.661691542|| 0.641791045|| 0.830845771|| 0.766169154|| 0.592039801|| 0.552238806|| 0.71641791|| 0.905472637|| 0.717523494
+
|-
+
!14
+
|0.781094527|| 0.701492537|| 0.676616915|| 0.791044776|| 0.741293532|| 0.587064677|| 0.671641791|| 0.621890547|| 0.900497512|| 0.719181868
+
|-
+
!15
+
|0.810945274|| 0.696517413|| 0.63681592|| 0.815920398|| 0.771144279|| 0.55721393|| 0.55721393|| 0.711442786|| 0.905472637|| 0.718076285
+
|-
+
|}
+
 
+
*train-word2vec result
+
:* Dimension
+
{| border="2px"
+
|+ classification result Of ACC in different dimension
+
|-
+
! Dimension  !! 财经!! IT!! 健康!! 体育!! 旅游 !!教育 !! 招聘!! 文化!!军事!!sum
+
|-
+
!10
+
|0.641791045|| 0.701492537|| 0.671641791|| 0.711442786|| 0.651741294|| 0.606965174|| 0.71641791|| 0.736318408|| 0.885572139|| 0.702598121
+
|-
+
!20
+
|0.656716418|| 0.771144279|| 0.656716418|| 0.691542289|| 0.711442786|| 0.60199005|| 0.68159204|| 0.810945274|| 0.890547264|| 0.719181868
+
|-
+
!30
+
|0.686567164|| 0.771144279|| 0.68159204|| 0.666666667|| 0.741293532|| 0.631840796|| 0.771144279|| 0.746268657|| 0.910447761|| 0.734107242
+
|-
+
!40
+
|0.68159204|| 0.791044776|| 0.686567164|| 0.671641791|| 0.726368159|| 0.63681592|| 0.76119403|| 0.781094527|| 0.885572139|| 0.735765616
+
|-
+
!50
+
|0.696517413|| 0.771144279|| 0.676616915|| 0.597014925|| 0.706467662|| 0.621890547|| 0.741293532|| 0.7960199|| 0.885572139|| 0.721393035
+
|-
+
!60
+
|0.68159204|| 0.786069652|| 0.68159204|| 0.592039801|| 0.731343284|| 0.606965174|| 0.741293532|| 0.805970149|| 0.885572139|| 0.723604201
+
|-
+
!70
+
|0.686567164|| 0.781094527|| 0.686567164|| 0.592039801|| 0.746268657|| 0.611940299|| 0.741293532|| 0.805970149|| 0.900497512|| 0.728026534
+
|-
+
!80
+
|0.676616915|| 0.766169154|| 0.676616915|| 0.592039801|| 0.741293532|| 0.606965174|| 0.746268657|| 0.810945274|| 0.890547264|| 0.72305141
+
|-
+
!90
+
|0.666666667|| 0.781094527|| 0.676616915|| 0.60199005|| 0.726368159|| 0.592039801|| 0.751243781|| 0.805970149|| 0.910447761|| 0.723604201
+
|-
+
!100
+
|0.651741294|| 0.776119403|| 0.68159204|| 0.60199005|| 0.736318408|| 0.616915423|| 0.756218905|| 0.815920398|| 0.895522388|| 0.725815368
+
|-
+
|}
+
:* Window
+
{| border="2px"
+
|+ classification result Of ACC in different dimension
+
|-
+
! windows  !! 财经!! IT!! 健康!! 体育!! 旅游 !!教育 !! 招聘!! 文化!!军事!!sum
+
|-
+
!3
+
|0.656716418|| 0.751243781|| 0.656716418|| 0.582089552|| 0.706467662|| 0.597014925|| 0.726368159|| 0.815920398|| 0.890547264|| 0.70923162
+
|-
+
!4
+
|0.671641791|| 0.776119403|| 0.676616915|| 0.666666667|| 0.736318408|| 0.631840796|| 0.726368159|| 0.825870647|| 0.885572139|| 0.733001658
+
|-
+
!5
+
|0.686567164|| 0.771144279|| 0.701492537|| 0.661691542|| 0.76119403|| 0.582089552|| 0.741293532|| 0.810945274|| 0.885572139|| 0.73355445
+
|-
+
!6
+
|0.696517413|| 0.810945274|| 0.671641791|| 0.711442786|| 0.751243781|| 0.63681592|| 0.746268657|| 0.791044776|| 0.885572139|| 0.744610282
+
|-
+
!7
+
|0.661691542|| 0.7960199|| 0.686567164|| 0.661691542|| 0.726368159|| 0.621890547|| 0.711442786|| 0.810945274|| 0.895522388|| 0.7302377
+
|-
+
!8
+
|0.666666667|| 0.771144279|| 0.701492537|| 0.597014925|| 0.751243781|| 0.651741294|| 0.815920398|| 0.76119403|| 0.900497512|| 0.735212825
+
|-
+
!9
+
|0.706467662|| 0.621890547|| 0.611940299|| 0.388059701|| 0.691542289|| 0.606965174|| 0.60199005|| 0.771144279|| 0.870646766|| 0.652294085
+
|-
+
!10
+
|0.711442786|| 0.766169154|| 0.656716418|| 0.606965174|| 0.746268657|| 0.626865672|| 0.776119403|| 0.800995025|| 0.910447761|| 0.73355445
+
|-
+
!11
+
|0.701492537|| 0.791044776|| 0.701492537|| 0.63681592|| 0.781094527|| 0.651741294|| 0.76119403|| 0.820895522|| 0.92039801|| 0.751796573
+
|-
+
!12
+
|0.701492537|| 0.810945274|| 0.671641791|| 0.641791045|| 0.756218905|| 0.63681592|| 0.786069652|| 0.771144279|| 0.905472637|| 0.742399116
+
|-
+
!13
+
|0.711442786|| 0.781094527|| 0.706467662|| 0.656716418|| 0.771144279|| 0.63681592|| 0.791044776|| 0.805970149|| 0.915422886|| 0.752902156
+
|-
+
!14
+
|0.671641791|| 0.805970149|| 0.676616915|| 0.611940299|| 0.76119403|| 0.641791045|| 0.731343284|| 0.7960199|| 0.915422886|| 0.734660033
+
|-
+
!15
+
|0.671641791|| 0.776119403|| 0.701492537|| 0.626865672|| 0.781094527|| 0.666666667|| 0.741293532|| 0.800995025|| 0.910447761|| 0.741846324
+
|-
+
|}
+

2014年9月28日 (日) 11:33的最后版本

Problem And Solve

Test

Sougou data