|
|
| (相同用户的25个中间修订版本未显示) |
| 第1行: |
第1行: |
| − | ==Problem An Solve== | + | ==Problem And Solve== |
| − | ==Document classification of Sougou data ==
| + | *[[How to import the sparse data of vsm to weka]] |
| − | * DATA
| + | |
| − | :* Data from SougouLab [http://www.sogou.com/labs/dl/c.html],using SogouC.reduced(30M)
| + | |
| − | :* 9-Classes:财经,IT,健康,体育,旅游,教育,招聘,文化,军事
| + | |
| − | :* train and test: train(),test(),dev()
| + | |
| − | *Text preprocessing
| + | |
| − | :* Segment word using wordlist of 9W.(tencent)
| + | |
| − | :* Remove stop word.stop_wordlist is
| + | |
| − | :*
| + | |
| − | *Some Tools
| + | |
| − | :* weka
| + | |
| − | :* scw
| + | |
| − | :* google word2ve
| + | |
| − | :* LDA
| + | |
| − | ===VSM Test===
| + | |
| − | *Data
| + | |
| − | :* dimension:9402
| + | |
| − | *Method
| + | |
| − | :* document reprenstion: use the tf-idf weight for word weight
| + | |
| − | :* classifier: Native Bayes
| + | |
| − | *Result
| + | |
| | | | |
| − | ===LDA Test=== | + | ==Test== |
| − | ===Word2vec Test===
| + | [[Sougou data]] |