“CN-Celeb”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
(以“=CN-Celeb= * A large-scale Chinese celebrities dataset collected `in the wild'. * Members:Dong Wang, Yunqi Cai, Lantian Li, Yue Fan, Jiawen Kang * Historical Memb...”为内容创建页面)
 
 
(相同用户的26个中间修订版本未显示)
第1行: 第1行:
=CN-Celeb=
+
===Introduction===
  
* A large-scale Chinese celebrities dataset collected `in the wild'.
+
* CN-Celeb, a large-scale Chinese celebrities dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.
* Members:Dong Wang, Yunqi Cai, Lantian Li, Yue Fan, Jiawen Kang
+
* Historical Members:Ziya Zhou, Kaicheng Li, Haolin Chen, Sitong Cheng, Pengyuan Zhang
+
  
===Target===
+
===Members===
 +
 
 +
* Current:Dong Wang, Yunqi Cai, Lantian Li, Yue Fan, Jiawen Kang
 +
* History:Ziya Zhou, Kaicheng Li, Haolin Chen, Sitong Cheng, Pengyuan Zhang
 +
 
 +
===Description===
  
 
* Collect audio data of 1,000 Chinese celebrities.
 
* Collect audio data of 1,000 Chinese celebrities.
* Automatically clip videoes through a pipeline including face detection, face recognition, speaker validation and speaker diarization.
+
* Automatically clip videos through a pipeline including face detection, face recognition, speaker validation and speaker diarization.
* Create a database.
+
* Create a benchmark database for speaker recognition community.
  
===未来计划===
+
===Basic Methods===
  
* Augment the database to 10,000 people.
+
* Environments: Tensorflow, PyTorch, Keras, MxNet
* Build a model between SyncNet and Speaker_Diarization based on LSTM, which can learn the relationship of them.  
+
* Face detection and tracking: RetinaFace and ArcFace models.
 +
* Active speaker verification: SyncNet model.
 +
* Speaker diarization: UIS-RNN model.
 +
* Double check by speaker recognition: VGG model.
 +
* Input: pictures and videos of POIs (Persons of Interest).
 +
* Output: well-labelled videos of POIs (Persons of Interest).
  
 +
===Reports===
  
===基本方法===
+
* [http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/%E6%96%87%E4%BB%B6:C-STAR.pdf Stage report v1.0]
  
* Tensorflow, PyTorch, Keras, MxNet 实现
+
===Publications===
* 检测、识别人脸的RetinaFace和ArcFace模型,说话人识别的SyncNet模型,Speaker Diarization的UIS-RNN模型
+
<pre>
* 输入为目标主人公的视频、目标主人公的面部图片
+
@misc{fan2019cnceleb,
* 输出为该视频中主人公声音片段的时间标签
+
  title={CN-CELEB: a challenging Chinese speaker recognition dataset},
 +
  author={Yue Fan and Jiawen Kang and Lantian Li and Kaicheng Li and Haolin Chen and Sitong Cheng and Pengyuan Zhang and Ziya Zhou and Yunqi Cai and Dong Wang},
 +
  year={2019},
 +
  eprint={1911.01799},
 +
  archivePrefix={arXiv},
 +
  primaryClass={eess.AS}
 +
}
  
 +
@misc{li2020cn,
 +
  title={CN-Celeb: multi-genre speaker recognition},
 +
  author={Lantian Li and Ruiqi Liu and Jiawen Kang and Yue Fan and Hao Cui and Yunqi Cai and Ravichander Vipperla and Thomas Fang Zheng and Dong Wang},
 +
  year={2020},
 +
  eprint={2012.12468},
 +
  archivePrefix={arXiv},
 +
  primaryClass={eess.AS}
 +
}
 +
</pre>
 +
===Source Code===
  
===项目GitHub地址===
+
* Collection Pipeline: [https://github.com/celebrity-audio-collection/videoprocess celebrity-audio-collection]
[https://github.com/celebrity-audio-collection/videoprocess celebrity-audio-collection]
+
* Baseline Systems: [https://github.com/csltstu/kaldi/tree/cnceleb/egs/cnceleb kaldi-cn-celeb]
 +
 
 +
===Download===
 +
 
 +
* Public (recommended)
 +
OpenSLR: http://www.openslr.org/82/
 +
 
 +
* Local (not recommended)
 +
CSLT@Tsinghua: http://cslt.riit.tsinghua.edu.cn/~data/CN-Celeb/
 +
 
 +
===Future Plans===
 +
 
 +
* Augment the database to 10,000 people.
 +
* Build a model between SyncNet and Speaker_Diarization based on LSTM, which can learn the relationship of them.
  
===项目报告===
+
===License===
[http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/%E6%96%87%E4%BB%B6:C-STAR.pdf v1.0阶段性报告]
+
  
 +
* All the resources contained in the database are free for research institutes and individuals.
 +
* <b>No commerical usage is permitted</b>.
  
 +
===References===
  
===参考文献===
 
 
* Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [https://arxiv.org/pdf/1905.00641.pdf]
 
* Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [https://arxiv.org/pdf/1905.00641.pdf]
 
* Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [https://arxiv.org/abs/1801.07698]
 
* Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [https://arxiv.org/abs/1801.07698]
 
* Wang et al., "CosFace: Large Margin Cosine Loss for Deep Face Recognition", 2018, [https://arxiv.org/pdf/1801.09414.pdf]
 
* Wang et al., "CosFace: Large Margin Cosine Loss for Deep Face Recognition", 2018, [https://arxiv.org/pdf/1801.09414.pdf]
 
* Liu et al., "SphereFace: Deep Hypersphere Embedding for Face Recognition", 2017[https://arxiv.org/pdf/1704.08063.pdf]
 
* Liu et al., "SphereFace: Deep Hypersphere Embedding for Face Recognition", 2017[https://arxiv.org/pdf/1704.08063.pdf]
* Zhong et al., "GhostVLAD for set-based face recognition", 2018. [http://www.robots.ox.ac.uk/~vgg/publications/2018/Zhong18b/zhong18b.pdf link]
+
* Zhong et al., "GhostVLAD for set-based face recognition", 2018. [http://www.robots.ox.ac.uk/~vgg/publications/2018/Zhong18b/zhong18b.pdf]
* Chung et al., "Out of time: automated lip sync in the wild", 2016.[http://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf link]
+
* Chung et al., "Out of time: automated lip sync in the wild", 2016.[http://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf]
* Xie et al., "UTTERANCE-LEVEL AGGREGATION FOR SPEAKER RECOGNITION IN THE WILD", 2019. [https://arxiv.org/pdf/1902.10107.pdf link]
+
* Xie et al., "Utterance-level Aggregation For Speaker Recognition In The Wild", 2019. [https://arxiv.org/pdf/1902.10107.pdf]
* Zhang1 et al., "FULLY SUPERVISED SPEAKER DIARIZATION", 2018. [https://arxiv.org/pdf/1810.04719v1.pdf link]
+
* Zhang1 et al., "Fully Supervised Speaker Diarization", 2018. [https://arxiv.org/pdf/1810.04719v1.pdf]

2021年1月6日 (三) 10:06的最后版本

Introduction

  • CN-Celeb, a large-scale Chinese celebrities dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.

Members

  • Current:Dong Wang, Yunqi Cai, Lantian Li, Yue Fan, Jiawen Kang
  • History:Ziya Zhou, Kaicheng Li, Haolin Chen, Sitong Cheng, Pengyuan Zhang

Description

  • Collect audio data of 1,000 Chinese celebrities.
  • Automatically clip videos through a pipeline including face detection, face recognition, speaker validation and speaker diarization.
  • Create a benchmark database for speaker recognition community.

Basic Methods

  • Environments: Tensorflow, PyTorch, Keras, MxNet
  • Face detection and tracking: RetinaFace and ArcFace models.
  • Active speaker verification: SyncNet model.
  • Speaker diarization: UIS-RNN model.
  • Double check by speaker recognition: VGG model.
  • Input: pictures and videos of POIs (Persons of Interest).
  • Output: well-labelled videos of POIs (Persons of Interest).

Reports

Publications

@misc{fan2019cnceleb,
  title={CN-CELEB: a challenging Chinese speaker recognition dataset},
  author={Yue Fan and Jiawen Kang and Lantian Li and Kaicheng Li and Haolin Chen and Sitong Cheng and Pengyuan Zhang and Ziya Zhou and Yunqi Cai and Dong Wang},
  year={2019},
  eprint={1911.01799},
  archivePrefix={arXiv},
  primaryClass={eess.AS}
}

@misc{li2020cn,
  title={CN-Celeb: multi-genre speaker recognition},
  author={Lantian Li and Ruiqi Liu and Jiawen Kang and Yue Fan and Hao Cui and Yunqi Cai and Ravichander Vipperla and Thomas Fang Zheng and Dong Wang},
  year={2020},
  eprint={2012.12468},
  archivePrefix={arXiv},
  primaryClass={eess.AS}
 }

Source Code

Download

  • Public (recommended)

OpenSLR: http://www.openslr.org/82/

  • Local (not recommended)

CSLT@Tsinghua: http://cslt.riit.tsinghua.edu.cn/~data/CN-Celeb/

Future Plans

  • Augment the database to 10,000 people.
  • Build a model between SyncNet and Speaker_Diarization based on LSTM, which can learn the relationship of them.

License

  • All the resources contained in the database are free for research institutes and individuals.
  • No commerical usage is permitted.

References

  • Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [1]
  • Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [2]
  • Wang et al., "CosFace: Large Margin Cosine Loss for Deep Face Recognition", 2018, [3]
  • Liu et al., "SphereFace: Deep Hypersphere Embedding for Face Recognition", 2017[4]
  • Zhong et al., "GhostVLAD for set-based face recognition", 2018. [5]
  • Chung et al., "Out of time: automated lip sync in the wild", 2016.[6]
  • Xie et al., "Utterance-level Aggregation For Speaker Recognition In The Wild", 2019. [7]
  • Zhang1 et al., "Fully Supervised Speaker Diarization", 2018. [8]