“OC17-data”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
(以“ ==Data allowed to use== The MixASR-CHEN 17 allows the following data resources to be used: * Training data: OC16-CE80 training/dev set + THCHS30 * Development dat...”为内容创建页面)
 
 
(相同用户的5个中间修订版本未显示)
第15行: 第15行:
 
OC16-CE80 is a speech database provided by SpeechOcean (http://www.speechocean.com) for this challenge. The main features involve:
 
OC16-CE80 is a speech database provided by SpeechOcean (http://www.speechocean.com) for this challenge. The main features involve:
  
1400+ speakers
+
* 1400+ speakers
Mobile channel
+
* Mobile channel
80 hours of speech signals
+
* 80 hours of speech signals
Transcriptions are provided
+
* Transcriptions are provided
The licence file is [http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OC16-CE80 here]
+
* The data is free for challenge participants
Data profile is [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/d/d5/OC16-CE80-profile.pdf here]
+
* The licence file is [http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OC16-CE80 here]
 +
* Data profile is [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/d/d5/OC16-CE80-profile.pdf here]
  
 
==OC17-CE10==
 
==OC17-CE10==
 
OC17-CE10 is a speech database provided by SpeechOcean (http://www.speechocean.com) for this challenge. The main features involve:
 
OC17-CE10 is a speech database provided by SpeechOcean (http://www.speechocean.com) for this challenge. The main features involve:
  
100+ speakers
+
* 100+ speakers
Mobile channel
+
* Mobile channel
10 hours of speech signals
+
* 10 hours of speech signals
Transcriptions are provided
+
* Transcriptions are provided
The licence file is [http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OC17-CE10 here]
+
* The data is free for challenge participants
 
+
* The licence file is [[OC17-CE10|here]]
 
+
  
 
==THCHS30==
 
==THCHS30==
第45行: 第45行:
 
http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b
 
http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b
  
 +
==OC17-EnWord list==
  
==CMU English dictionary==
+
OC17-EnWord is a word list that covers most of the OOV words in the OC17-EN10 test set. This can be used to enhance your system. However, no pronunciations for these words are available.
 +
You may want to use some grapheme to phoneme (g2p) tools.

2017年4月20日 (四) 12:07的最后版本

Data allowed to use

The MixASR-CHEN 17 allows the following data resources to be used:

  • Training data: OC16-CE80 training/dev set + THCHS30
  • Development data: OC16-CE80 test set
  • Test data: OC17-CE10 test set
  • Lexicon: THCHS30 Chinese lexicon + CMU English lexicon
  • Additional word list: An additional English word list OC17-EnWord that covers most of the English OOVs in the test set. However, no phone transcriptions are available.
  • LM: THCHS30 LM can be used, but all the transcriptions of OC16-CE80 training/dev/test and THCHS30 can be used to improve the basic LM.

OC16-CE80

OC16-CE80 is a speech database provided by SpeechOcean (http://www.speechocean.com) for this challenge. The main features involve:

  • 1400+ speakers
  • Mobile channel
  • 80 hours of speech signals
  • Transcriptions are provided
  • The data is free for challenge participants
  • The licence file is here
  • Data profile is here

OC17-CE10

OC17-CE10 is a speech database provided by SpeechOcean (http://www.speechocean.com) for this challenge. The main features involve:

  • 100+ speakers
  • Mobile channel
  • 10 hours of speech signals
  • Transcriptions are provided
  • The data is free for challenge participants
  • The licence file is here

THCHS30

THCHS30 is a Chinese speech database provided by CSLT@Tsinghua University. All the resources of THCHS30 can be used to improve the system, especially the lexicon and LM. The data is available at:

http://www.openslr.org/18/

CMU English dictionary

To recognize English words, CMU English dictionary 0.7b is allowed to be used.

http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b

OC17-EnWord list

OC17-EnWord is a word list that covers most of the OOV words in the OC17-EN10 test set. This can be used to enhance your system. However, no pronunciations for these words are available. You may want to use some grapheme to phoneme (g2p) tools.