“OC17-data”版本间的差异
(→OC17-CE10) |
|||
(相同用户的一个中间修订版本未显示) | |||
第15行: | 第15行: | ||
OC16-CE80 is a speech database provided by SpeechOcean (http://www.speechocean.com) for this challenge. The main features involve: | OC16-CE80 is a speech database provided by SpeechOcean (http://www.speechocean.com) for this challenge. The main features involve: | ||
− | 1400+ speakers | + | * 1400+ speakers |
− | Mobile channel | + | * Mobile channel |
− | 80 hours of speech signals | + | * 80 hours of speech signals |
− | Transcriptions are provided | + | * Transcriptions are provided |
− | The licence file is [http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OC16-CE80 here] | + | * The data is free for challenge participants |
− | Data profile is [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/d/d5/OC16-CE80-profile.pdf here] | + | * The licence file is [http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OC16-CE80 here] |
+ | * Data profile is [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/d/d5/OC16-CE80-profile.pdf here] | ||
==OC17-CE10== | ==OC17-CE10== | ||
第29行: | 第30行: | ||
* 10 hours of speech signals | * 10 hours of speech signals | ||
* Transcriptions are provided | * Transcriptions are provided | ||
+ | * The data is free for challenge participants | ||
* The licence file is [[OC17-CE10|here]] | * The licence file is [[OC17-CE10|here]] | ||
2017年4月20日 (四) 12:07的最后版本
目录
Data allowed to use
The MixASR-CHEN 17 allows the following data resources to be used:
- Training data: OC16-CE80 training/dev set + THCHS30
- Development data: OC16-CE80 test set
- Test data: OC17-CE10 test set
- Lexicon: THCHS30 Chinese lexicon + CMU English lexicon
- Additional word list: An additional English word list OC17-EnWord that covers most of the English OOVs in the test set. However, no phone transcriptions are available.
- LM: THCHS30 LM can be used, but all the transcriptions of OC16-CE80 training/dev/test and THCHS30 can be used to improve the basic LM.
OC16-CE80
OC16-CE80 is a speech database provided by SpeechOcean (http://www.speechocean.com) for this challenge. The main features involve:
- 1400+ speakers
- Mobile channel
- 80 hours of speech signals
- Transcriptions are provided
- The data is free for challenge participants
- The licence file is here
- Data profile is here
OC17-CE10
OC17-CE10 is a speech database provided by SpeechOcean (http://www.speechocean.com) for this challenge. The main features involve:
- 100+ speakers
- Mobile channel
- 10 hours of speech signals
- Transcriptions are provided
- The data is free for challenge participants
- The licence file is here
THCHS30
THCHS30 is a Chinese speech database provided by CSLT@Tsinghua University. All the resources of THCHS30 can be used to improve the system, especially the lexicon and LM. The data is available at:
CMU English dictionary
To recognize English words, CMU English dictionary 0.7b is allowed to be used.
http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b
OC17-EnWord list
OC17-EnWord is a word list that covers most of the OOV words in the OC17-EN10 test set. This can be used to enhance your system. However, no pronunciations for these words are available. You may want to use some grapheme to phoneme (g2p) tools.