Sougou data

来自cslt Wiki
跳转至: 导航搜索

Document classification of Sougou data

  • DATA
  • Data from SougouLab [1],using SogouC.reduced(30M)
  • 9-Classes:财经,IT,健康,体育,旅游,教育,招聘,文化,军事
  • train and test: train(),test(),dev()
  • Text preprocessing
  • Segment word using wordlist of 9W.(tencent)
  • Remove stop word.stop_wordlist is
  • Some Tools
  • weka
  • scw
  • google word2ve
  • LDA
  • class map

C000007 汽车 C000008 财经 C000010 IT C000013 健康 C000014 体育 C000016 旅游 C000020 教育 C000022 招聘 C000023 文化 C000024 军事

  • result data [2]
  • paper Document classification based on word vectors[3]

VSM Test

  • Data
  • dimension:9402
  • Method
  • document reprenstion: use the tf-idf weight for word weight
  • classifier: Native Bayes
  • Result
classification result
财经 IT 健康 体育 旅游 教育 招聘 文化 军事 sum
ACC-test 0.72139 0.72139 0.75124 0.82089 0.79602 0.61194 0.70647 0.64179 0.79104 0.72913
ACC-train 0.678 0.718 0.708 0.708 0.73

LDA Test

  • LDA test
  • result
classification result Of ACC in different dimension
Dimension 财经 IT 健康 体育 旅游 教育 招聘 文化 军事 sum
10 0.76119403 0.308457711 0.68159204 0.885572139 0.686567164 0.179104478 0.656716418 0.36318408 0.915422886 0.604201216
20 0.845771144 0.308457711 0.686567164 0.835820896 0.497512438 0.223880597 0.572139303 0.328358209 0.815920398 0.568269762
30 0.810945274 0.164179104 0.71641791 0.820895522 0.52238806 0.288557214 0.587064677 0.378109453 0.815920398 0.567164179
40 0.815920398 0.368159204 0.711442786 0.850746269 0.621890547 0.36318408 0.825870647 0.313432836 0.870646766 0.637921504
50 0.845771144 0.39800995 0.7960199 0.850746269 0.606965174 0.368159204 0.611940299 0.313432836 0.860696517 0.627971255
60 0.860696517 0.293532338 0.776119403 0.800995025 0.47761194 0.338308458 0.810945274 0.298507463 0.850746269 0.611940299
70 0.84079602 0.338308458 0.781094527 0.736318408 0.447761194 0.417910448 0.686567164 0.373134328 0.885572139 0.611940299
80 0.860696517 0.39800995 0.676616915 0.751243781 0.621890547 0.402985075 0.820895522 0.308457711 0.850746269 0.632393588
90 0.875621891 0.338308458 0.746268657 0.756218905 0.507462687 0.368159204 0.597014925 0.407960199 0.865671642 0.606965174
100 0.805970149 0.527363184 0.736318408 0.741293532 0.646766169 0.417910448 0.76119403 0.31840796 0.885572139 0.648977336
110 0.860696517 0.373134328 0.726368159 0.711442786 0.611940299 0.517412935 0.771144279 0.368159204 0.865671642 0.645107794
120 0.865671642 0.427860697 0.686567164 0.736318408 0.656716418 0.47761194 0.781094527 0.378109453 0.855721393 0.651741294
130 0.850746269 0.462686567 0.751243781 0.661691542 0.606965174 0.487562189 0.791044776 0.36318408 0.835820896 0.645660586
140 0.7960199 0.507462687 0.666666667 0.731343284 0.587064677 0.482587065 0.756218905 0.393034826 0.875621891 0.644002211
150 0.830845771 0.432835821 0.706467662 0.686567164 0.492537313 0.452736318 0.741293532 0.417910448 0.880597015 0.626865672
160 0.805970149 0.437810945 0.676616915 0.711442786 0.641791045 0.47761194 0.815920398 0.422885572 0.870646766 0.651188502
170 0.825870647 0.393034826 0.71641791 0.736318408 0.621890547 0.517412935 0.587064677 0.402985075 0.900497512 0.633499171
180 0.7960199 0.502487562 0.781094527 0.691542289 0.552238806 0.55721393 0.592039801 0.452736318 0.830845771 0.639579878
190 0.855721393 0.462686567 0.766169154 0.71641791 0.562189055 0.507462687 0.656716418 0.472636816 0.865671642 0.651741294
200 0.835820896 0.412935323 0.781094527 0.706467662 0.577114428 0.482587065 0.641791045 0.432835821 0.875621891 0.638474295

Word2vec Test

  • Word2vec result
  • Dimension
classification result Of ACC in different dimension
Dimension 财经 IT 健康 体育 旅游 教育 招聘 文化 军事 sum
10 0.766169154 0.383084577 0.52238806 0.820895522 0.666666667 0.44278607 0.567164179 0.721393035 0.850746269 0.637921504
20 0.781094527 0.537313433 0.572139303 0.830845771 0.76119403 0.452736318 0.611940299 0.646766169 0.860696517 0.672747374
30 0.815920398 0.671641791 0.606965174 0.835820896 0.766169154 0.552238806 0.577114428 0.68159204 0.885572139 0.710337203
40 0.7960199 0.68159204 0.631840796 0.805970149 0.756218905 0.572139303 0.577114428 0.701492537 0.905472637 0.714206744
50 0.805970149 0.691542289 0.641791045 0.800995025 0.751243781 0.552238806 0.651741294 0.656716418 0.910447761 0.718076285
60 0.7960199 0.68159204 0.626865672 0.776119403 0.736318408 0.572139303 0.626865672 0.651741294 0.895522388 0.707020453
70 0.7960199 0.701492537 0.621890547 0.781094527 0.771144279 0.572139303 0.631840796 0.656716418 0.905472637 0.715312327
80 0.7960199 0.686567164 0.626865672 0.805970149 0.776119403 0.582089552 0.631840796 0.676616915 0.905472637 0.720840243
90 0.805970149 0.71641791 0.621890547 0.776119403 0.766169154 0.572139303 0.646766169 0.666666667 0.915422886 0.720840243
100 0.776119403 0.706467662 0.631840796 0.751243781 0.786069652 0.577114428 0.646766169 0.666666667 0.910447761 0.716970702
110 0.771144279 0.71641791 0.656716418 0.741293532 0.76119403 0.597014925 0.606965174 0.691542289 0.910447761 0.716970702
120 0.76119403 0.71641791 0.646766169 0.756218905 0.766169154 0.60199005 0.661691542 0.686567164 0.915422886 0.723604201
130 0.776119403 0.731343284 0.631840796 0.76119403 0.771144279 0.577114428 0.626865672 0.701492537 0.905472637 0.720287452
140 0.76119403 0.746268657 0.63681592 0.736318408 0.786069652 0.587064677 0.651741294 0.68159204 0.900497512 0.720840243
150 0.756218905 0.726368159 0.63681592 0.736318408 0.771144279 0.611940299 0.651741294 0.686567164 0.910447761 0.720840243
160 0.751243781 0.71641791 0.646766169 0.731343284 0.776119403 0.597014925 0.651741294 0.696517413 0.895522388 0.718076285
170 0.756218905 0.741293532 0.661691542 0.731343284 0.766169154 0.60199005 0.651741294 0.666666667 0.900497512 0.71973466
180 0.781094527 0.731343284 0.651741294 0.736318408 0.781094527 0.606965174 0.631840796 0.676616915 0.895522388 0.721393035
190 0.771144279 0.726368159 0.661691542 0.731343284 0.766169154 0.60199005 0.631840796 0.706467662 0.900497512 0.721945826
200 0.771144279 0.736318408 0.641791045 0.706467662 0.771144279 0.606965174 0.611940299 0.71641791 0.900497512 0.718076285
  • Window
classification result Of ACC in different dimension
windows 财经 IT 健康 体育 旅游 教育 招聘 文化 军事 sum
3 0.805970149 0.666666667 0.621890547 0.766169154 0.76119403 0.542288557 0.646766169 0.641791045 0.900497512 0.70591487
4 0.756218905 0.686567164 0.646766169 0.776119403 0.776119403 0.567164179 0.631840796 0.651741294 0.905472637 0.710889994
5 0.791044776 0.711442786 0.641791045 0.800995025 0.76119403 0.567164179 0.68159204 0.641791045 0.895522388 0.721393035
6 0.820895522 0.68159204 0.626865672 0.771144279 0.76119403 0.537313433 0.656716418 0.656716418 0.900497512 0.712548369
7 0.7960199 0.656716418 0.656716418 0.800995025 0.756218905 0.562189055 0.661691542 0.621890547 0.900497512 0.712548369
8 0.786069652 0.68159204 0.631840796 0.7960199 0.766169154 0.552238806 0.592039801 0.696517413 0.910447761 0.712548369
9 0.786069652 0.666666667 0.606965174 0.860696517 0.771144279 0.532338308 0.582089552 0.686567164 0.900497512 0.710337203
10 0.805970149 0.671641791 0.616915423 0.835820896 0.771144279 0.606965174 0.651741294 0.666666667 0.910447761 0.726368159
11 0.800995025 0.696517413 0.631840796 0.771144279 0.751243781 0.587064677 0.597014925 0.671641791 0.885572139 0.710337203
12 0.7960199 0.671641791 0.626865672 0.7960199 0.76119403 0.542288557 0.606965174 0.706467662 0.900497512 0.711995578
13 0.791044776 0.661691542 0.641791045 0.830845771 0.766169154 0.592039801 0.552238806 0.71641791 0.905472637 0.717523494
14 0.781094527 0.701492537 0.676616915 0.791044776 0.741293532 0.587064677 0.671641791 0.621890547 0.900497512 0.719181868
15 0.810945274 0.696517413 0.63681592 0.815920398 0.771144279 0.55721393 0.55721393 0.711442786 0.905472637 0.718076285
  • train-word2vec result
  • Dimension
classification result Of ACC in different dimension
Dimension 财经 IT 健康 体育 旅游 教育 招聘 文化 军事 sum
10 0.641791045 0.701492537 0.671641791 0.711442786 0.651741294 0.606965174 0.71641791 0.736318408 0.885572139 0.702598121
20 0.656716418 0.771144279 0.656716418 0.691542289 0.711442786 0.60199005 0.68159204 0.810945274 0.890547264 0.719181868
30 0.686567164 0.771144279 0.68159204 0.666666667 0.741293532 0.631840796 0.771144279 0.746268657 0.910447761 0.734107242
40 0.68159204 0.791044776 0.686567164 0.671641791 0.726368159 0.63681592 0.76119403 0.781094527 0.885572139 0.735765616
50 0.696517413 0.771144279 0.676616915 0.597014925 0.706467662 0.621890547 0.741293532 0.7960199 0.885572139 0.721393035
60 0.68159204 0.786069652 0.68159204 0.592039801 0.731343284 0.606965174 0.741293532 0.805970149 0.885572139 0.723604201
70 0.686567164 0.781094527 0.686567164 0.592039801 0.746268657 0.611940299 0.741293532 0.805970149 0.900497512 0.728026534
80 0.676616915 0.766169154 0.676616915 0.592039801 0.741293532 0.606965174 0.746268657 0.810945274 0.890547264 0.72305141
90 0.666666667 0.781094527 0.676616915 0.60199005 0.726368159 0.592039801 0.751243781 0.805970149 0.910447761 0.723604201
100 0.651741294 0.776119403 0.68159204 0.60199005 0.736318408 0.616915423 0.756218905 0.815920398 0.895522388 0.725815368
  • Window
classification result Of ACC in different dimension
windows 财经 IT 健康 体育 旅游 教育 招聘 文化 军事 sum
3 0.656716418 0.751243781 0.656716418 0.582089552 0.706467662 0.597014925 0.726368159 0.815920398 0.890547264 0.70923162
4 0.671641791 0.776119403 0.676616915 0.666666667 0.736318408 0.631840796 0.726368159 0.825870647 0.885572139 0.733001658
5 0.686567164 0.771144279 0.701492537 0.661691542 0.76119403 0.582089552 0.741293532 0.810945274 0.885572139 0.73355445
6 0.696517413 0.810945274 0.671641791 0.711442786 0.751243781 0.63681592 0.746268657 0.791044776 0.885572139 0.744610282
7 0.661691542 0.7960199 0.686567164 0.661691542 0.726368159 0.621890547 0.711442786 0.810945274 0.895522388 0.7302377
8 0.666666667 0.771144279 0.701492537 0.597014925 0.751243781 0.651741294 0.815920398 0.76119403 0.900497512 0.735212825
9 0.706467662 0.621890547 0.611940299 0.388059701 0.691542289 0.606965174 0.60199005 0.771144279 0.870646766 0.652294085
10 0.711442786 0.766169154 0.656716418 0.606965174 0.746268657 0.626865672 0.776119403 0.800995025 0.910447761 0.73355445
11 0.701492537 0.791044776 0.701492537 0.63681592 0.781094527 0.651741294 0.76119403 0.820895522 0.92039801 0.751796573
12 0.701492537 0.810945274 0.671641791 0.641791045 0.756218905 0.63681592 0.786069652 0.771144279 0.905472637 0.742399116
13 0.711442786 0.781094527 0.706467662 0.656716418 0.771144279 0.63681592 0.791044776 0.805970149 0.915422886 0.752902156
14 0.671641791 0.805970149 0.676616915 0.611940299 0.76119403 0.641791045 0.731343284 0.7960199 0.915422886 0.734660033
15 0.671641791 0.776119403 0.701492537 0.626865672 0.781094527 0.666666667 0.741293532 0.800995025 0.910447761 0.741846324