Introduction

Members

Collect audio data of 1,000 Chinese celebrities.
Automatically clip videoes through a pipeline including face detection, face recognition, speaker validation and speaker diarization.
Create a benchmark database for speaker recognition community.

Augment the database to 10,000 people.
Build a model between SyncNet and Speaker_Diarization based on LSTM, which can learn the relationship of them.

Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [1]
Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [2]
Wang et al., "CosFace: Large Margin Cosine Loss for Deep Face Recognition", 2018, [3]
Liu et al., "SphereFace: Deep Hypersphere Embedding for Face Recognition", 2017[4]
Zhong et al., "GhostVLAD for set-based face recognition", 2018. link
Chung et al., "Out of time: automated lip sync in the wild", 2016.link
Xie et al., "UTTERANCE-LEVEL AGGREGATION FOR SPEAKER RECOGNITION IN THE WILD", 2019. link
Zhang1 et al., "FULLY SUPERVISED SPEAKER DIARIZATION", 2018. link