DataBase

来自cslt Wiki
2014年2月26日 (三) 06:17Lxs讨论 | 贡献的版本

跳转至: 导航搜索

lm

name size dir description
SogouQ.full.train.3gram.gz 132M /work/lxs/nlphome/lm/SogouQ-500M trainData=SougouQ(800M);dict=11w-tecent
SogouT-11w-merge2-1.3gram.gz 4.1G /work/lxs/nlphome/lm/SogouT-140G trainData=SougouT(140G);dict=11w-tencent
SogouT-11w-merge2-2.3gram.gz 3.9G /work/lxs/nlphome/lm/SogouT-140G
8w8.3gram.tencent.gz 452M /work/lxs/nlphome/lm/Tencent
musicQuery-ltc.3gram.gz 28M /work/lxs/nlphome/lm/TencentQ/musicQuery use qa15w-singer-songs.wordlist
TencentQ.3gram.gz 1.4G /work/lxs/nlphome/lm/TencentQ/qa15w use qa15w.lexicion
mix-corp1-corp2.3gram.gz 1.3G /work/lxs/nlphome/lm/TencentQ/qa15w-nosinger-song use qa15w-nosinger-song.wordlist
mix-corp1_0.5-corp2_0.5.3gram.gz 1.4G /work/lxs/nlphome/lm/TencentQ/qa15w-singer-song use qa15w-singer-song.wordlist
11w_merge6_kn.3gram.gz 4.3G /work/lxs/nlphome/lm/TencentQA-100G trainData=qa(100G),dict=11w-tencent
8w8_new_merge6_kn.3gram0.gz 4.5G /work/lxs/nlphome/lm/TencentQA-100G trainData=qa(100G),dict=8w8-tencent
Hunhe_zhongzi_and_add_and_PPL_5yuan_3e9.lm.utf8.1e-5.3gram.gz 1.4M /work/lxs/nlphome/lm/jietong
Hunhe_zhongzi_and_add_and_PPL_5yuan_3e9.lm.utf8.1e-9.5gram.gz 389M /work/lxs/nlphome/lm/jietong

lexicion wordlist

name size dir description
singer.lexicion 2060 /work/lxs/nlphome/dict/lex-wordlist/music/lr
singer.low.lexicion 2060 /work/lxs/nlphome/dict/lex-wordlist/music/lr
singer.pinyin 2104 /work/lxs/nlphome/dict/lex-wordlist/music/lr
song.lexicion 4639 /work/lxs/nlphome/dict/lex-wordlist/music/lr
song.low.lexicion 4639 /work/lxs/nlphome/dict/lex-wordlist/music/lr
song.pinyin 4644 /work/lxs/nlphome/dict/lex-wordlist/music/lr
qa15w-ch-sinovoice.lexicion 92469 /work/lxs/nlphome/dict/lex-wordlist/qa-check
qa15w-ch.pinyin 92469 /work/lxs/nlphome/dict/lex-wordlist/qa-check
qa15w.lexicion 158404 /work/lxs/nlphome/dict/lex-wordlist/qa-check
11w.lexicion 122172 /work/lxs/nlphome/dict/lex-wordlist/tencent
8w8.lexicion 90795 /work/lxs/nlphome/dict/lex-wordlist/tencent

nolexicion wordlist

name size dir description
singer.wordlist 19k /work/lxs/nlphome/dict/nolex-wordlist/music/lr
song.wordlist 68k /work/lxs/nlphome/dict/nolex-wordlist/music/lr
album.txt 227k /work/lxs/nlphome/dict/nolex-wordlist/music/ltc
area.txt 32bit /work/lxs/nlphome/dict/nolex-wordlist/music/ltc
chart.txt 336bit /work/lxs/nlphome/dict/nolex-wordlist/music/ltc
drama.txt 7.2k /work/lxs/nlphome/dict/nolex-wordlist/music/ltc
language.txt 343bit /work/lxs/nlphome/dict/nolex-wordlist/music/ltc
singer.txt 42k /work/lxs/nlphome/dict/nolex-wordlist/music/ltc
stopwords.txt 6.1k /work/lxs/nlphome/dict/nolex-wordlist/music/ltc
song.txt 408k /work/lxs/nlphome/dict/nolex-wordlist/music/ltc
style.txt 6.6k /work/lxs/nlphome/dict/nolex-wordlist/music/ltc
type.txt 18bit /work/lxs/nlphome/dict/nolex-wordlist/music/ltc
entity.txt 590k /work/lxs/nlphome/dict/nolex-wordlist/music/ltc merge album area chart drama language singer song stopwords style type
qa15w.wordlist 1.2M /work/lxs/nlphome/dict/nolex-wordlist/qa-check
11w.wordlist 888k /work/lxs/nlphome/dict/nolex-wordlist/tencent
8w8.wordlist 666k /work/lxs/nlphome/dict/nolex-wordlist/tencent
scws20w-utf8.wordlist 6.5M /work/lxs/nlphome/dict/nolex-wordlist

lenvxx

path:/nfs/corpus/data/corpora/lenvxx

description:I settle the data in /nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus

(in this directory,it include 4 subdirectory:ChinaDivision , dict , dict4VOD , document Resource)
1.Directory
/nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus/dict
1.include directory 
sogou-dict
  • 城市信息:include many provinces' data about the cities' names and places' names in the province,and some localisms,and some cities' information about bus station and the streets' name
  • 电子游戏
  • 单机游戏:include the console games' name from 2001 to 2011,and some game's wordlist.
  • 网游:include the online games' name from 2008 to 2011 and some game's wordlist.
  • 工程与应用科学:include the specialized vocabulary wordlists in project field.
  • 计算机:include the specialized vocabulary wordlists in computer field,and Alibaba's product vocabulary in many fields.
  • 农林鱼畜:include the wordlist about livestock and agriculture.
  • 人文科学
  • 文学:include the wordlist about ancient Chinese literature and masterwork,and some novels' wordlist.
  • 语言:include the wordlists about idiom and Folklore,Network buzzwords.
  • 哲学:include the wordlists about philosophy.for instance,Hegel,Marxism.
  • 宗教:include the wordlists about Taoism,Buddhism,Islam
  • 历史:include the wordlists about the history about Chinese,and Japanese's warring states period,diplomacy.
  • 其他:include the wordlist about the ancient Chinese numerology.
  • 社会科学
  • 法律:include the wordlists about law.
  • 教育:include the wordlists about some universities' architecture,and some wordlist about textbook,list of Chinese univercity and America famous univercity.
  • 金融:include the wordlists about wordlist about financial.
  • 军事:include the wordlists about military.
  • 政治:include the wordlists about Party and government offices,political,and ancient China Official institutions
  • 其他:include the wordlists about public relations,ethics,anthropology
  • 生活:include the wordlists about many fields in our lief.
  • 医学:include the wordlists about medical science.
  • 艺术
  • 书法篆刻:include the wordlists about sculpture and calligraphy.
  • 舞蹈:include the wordlists about dance and Gymnastics Rhythmic.
  • 戏剧:include the wordlists about drama.
  • 音乐:include the wordlists about music major in Chinese and the west.
  • 其他:include the wordlists of tea,sculpture,er ren zhuan,world heritage,artist.
  • 娱乐
  • 电影电视:include the wordlists about science fiction film.
  • 动漫:include the wordlists about some cartoons.
  • 流行音乐:include the wordlists about a novel of A Song of Ice and Fire,fashionable word or phrase.
  • 明星:include the wordlists about some famous person.
  • 汽车:include the wordlists about car field.
  • 收藏:include the wordlists about advertisement.
  • 时尚品牌:the directory is empty.
  • 运动休闲
  • F1赛车:the directory is empty.
  • 奥运:include the wordlists of Olympic.
  • 垂钓:include the wordlists of fishing.
  • 轮滑:include a wordlist of roller skating.
  • 棋牌:include the wordlists about mahjong,go,chinese chess,san guo sha.
  • 气功:include the wordlists about qigong.
  • 球类:include the wordlists about football,basketball,ping-bang ball,golf,badminton.
  • 杀人游戏:the directory is empty.
  • 跆拳道:include the wordlists of taekwondo.
  • 太极拳:include the wordlists of ba gua,tai ji quan.
  • 武术:include the wordlists of wu shu.
  • 自行车:the directory is empty.
  • 其他:include the wordlists about fencing,judo,wrestling,yoga.
  • 自然科学
  • 化学:include the wordlists of chemistry.
  • 生物:include the wordlists of biology.
  • 数学:include the wordlists of math.
  • 天文学:include the wordlists of astronomy.
  • 物理:include the wordlists of physics.
  • 其他:include the wordlists of stone.
2.include directory 
movie(include many wordlists about movie major)
  • 电影:include the movie wordlists of inland,Hongkong and Taiwan,Europe and America,Asian.
  • 明星:include the movie star wordlists of inland,Hongkong and Taiwan,Europe and America,Asian.
3.include directory 
movie-dict(include the wordlists of actor,director,moviename,roles,style)
4.include directory 
name(include the wordlists of famous person in inland,Hongkong and Taiwan,Europe and America,Asian.)
5.include directory 
NER(include the wordlists of person name in English,Japan,Korea,Russia)
6.include directory 
Pinyin(include a wordlists of duo ying zhi)
7.include directory 
VOD
  • 电视剧:include a wordlist of teleplay.
  • 电影:include a wordlist of movie.
  • 微电影:include a wordlist of micro film.
  • 音乐:include the wordlists of famous songs in inland,Hongkong and Taiwan,Europe and America,Japan and South Korea
  • 综艺:include a wordlists of show.
8.include directory 
领域术语(include the wordlists about computer,economy,travel,sports,medicine)
9.include directory 
语言学词库
  • 基础名词:it include person,abstract noun,nature,person making things,fashion noun.
  • 语言学词汇类别:it include all grammar vocabulary.
2.Directory
/nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus/dict4VOD
the directory include the wordlists of movie distribution company,film award,filmfest,actors'name,chinese and english comparison table.
3.Directory
/nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus/ChinaDivision
the directory include 4 wordlists,which divide in 4 level(province name,city name,region name,street name)