CN-CVS

来自cslt Wiki

2022年10月30日 (日) 11:47Cchen（讨论 | 贡献）的版本

(差异) ←上一版本 | 最后版本 (差异) | 下一版本→ (差异)

跳转至：导航、搜索

目录

1 Introduction
2 Members
3 Description
4 Basic Methods
5 Reports
6 Publications
7 Source Code
8 Download
9 Future Plans
10 License
11 References

Introduction

CN-CVS, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.

Members

Current：Dong Wang, Chen Chen

Description

Collect audio and video data of more than 2500 Mandarin speakers.
Automatically clip videos through a pipeline including shot detection, VAD, face detection, face tracker, audio-visual synchronization detection.
Manually annotate speaker identity, human check data quality.
Create a benchmark database for video to speech synthesis task.

Basic Methods

Environments: PyTorch, OpenCV, FFmpeg
Shot detection: ffmpeg
VAD: pydub
Face detection and tracking: dlib.
Audio-visual synchronization detection: SyncNet model.
Input: json files of video information.
Output: videos clips and wav files, as well as metadata json files.

Reports

Publications

Source Code

Collection Pipeline: https://github.com/sectum1919/cncvs_data_collector
xTS: TODO
VCA-GAN: TODO

Download

Public (recommended)

https://cloud.tsinghua.edu.cn/d/83f13126daec49deb8a3/

Local (not recommended)

https://cloud.tsinghua.edu.cn/d/83f13126daec49deb8a3/

Future Plans

Extract text transcription via OCR & ASR & Human check
Extend baseline to benchmark

License

All the resources contained in the database are free for research institutes and individuals.
No commerical usage is permitted.

References

取自“http://index.cslt.org/mediawiki/index.php?title=CN-CVS&oldid=39573”