From cslt Wiki
- CN-CVS, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.
- Current：Dong Wang, Chen Chen
- Collect audio and video data of more than 2500 Mandarin speakers.
- Automatically clip videos through a pipeline including shot detection, VAD, face detection, face tracker, audio-visual synchronization detection.
- Manually annotate speaker identity, human check data quality.
- Create a benchmark database for video to speech synthesis task.
- Environments: PyTorch, OpenCV, FFmpeg
- Shot detection: ffmpeg
- VAD: pydub
- Face detection and tracking: dlib.
- Audio-visual synchronization detection: SyncNet model.
- Input: json files of video information.
- Output: videos clips and wav files, as well as metadata json files.
- Collection Pipeline: https://github.com/sectum1919/cncvs_data_collector
- xTS: TODO
- VCA-GAN: TODO
- Public (recommended)
- Local (not recommended)
- Extract text transcription via OCR & ASR & Human check
- Extend baseline to benchmark
- All the resources contained in the database are free for research institutes and individuals.
- No commerical usage is permitted.