CN-Celeb

来自cslt Wiki
2019年10月29日 (二) 12:06Lilt讨论 | 贡献的版本

跳转至: 导航搜索

Introduction

  • CN-Celeb, a large-scale Chinese celebrities dataset collected `in the wild'.

Members

  • Current:Dong Wang, Yunqi Cai, Lantian Li, Yue Fan, Jiawen Kang
  • History:Ziya Zhou, Kaicheng Li, Haolin Chen, Sitong Cheng, Pengyuan Zhang

Target

  • Collect audio data of 1,000 Chinese celebrities.
  • Automatically clip videoes through a pipeline including face detection, face recognition, speaker validation and speaker diarization.
  • Create a benchmark database for speaker recognition community.

Future Plans

  • Augment the database to 10,000 people.
  • Build a model between SyncNet and Speaker_Diarization based on LSTM, which can learn the relationship of them.

Basic method

  • Environments: Tensorflow, PyTorch, Keras, MxNet
  • Face detection and tracking based on RetinaFace and ArcFace models.
  • Active speaker verification based on SyncNet model.
  • Speaker Diarization based on UIS-RNN model.
  • Double check by speaker recognition based on VGG model.
  • Input: Pictures and videos of POIs (Persons of Interest).
  • Output: well-labelled videos of POIs (Persons of Interest).

GitHub of our project

celebrity-audio-collection

Reports

Stage Report v1.0

References

  • Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [1]
  • Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [2]
  • Wang et al., "CosFace: Large Margin Cosine Loss for Deep Face Recognition", 2018, [3]
  • Liu et al., "SphereFace: Deep Hypersphere Embedding for Face Recognition", 2017[4]
  • Zhong et al., "GhostVLAD for set-based face recognition", 2018. link
  • Chung et al., "Out of time: automated lip sync in the wild", 2016.link
  • Xie et al., "UTTERANCE-LEVEL AGGREGATION FOR SPEAKER RECOGNITION IN THE WILD", 2019. link
  • Zhang1 et al., "FULLY SUPERVISED SPEAKER DIARIZATION", 2018. link