CN-CVS

来自cslt Wiki
跳转至: 导航搜索

Introduction

  • CN-CVS, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.

Members

  • Current:Dong Wang, Chen Chen

Description

  • Collect audio and video data of more than 2500 Mandarin speakers.
  • Automatically clip videos through a pipeline including shot detection, VAD, face detection, face tracker, audio-visual synchronization detection.
  • Manually annotate speaker identity, human check data quality.
  • Create a benchmark database for video to speech synthesis task.

Basic Methods

  • Environments: PyTorch, OpenCV, FFmpeg
  • Shot detection: ffmpeg
  • VAD: pydub
  • Face detection and tracking: dlib.
  • Audio-visual synchronization detection: SyncNet model.
  • Input: json files of video information.
  • Output: videos clips and wav files, as well as metadata json files.

Reports

Publications


Source Code

Download

  • Public (recommended)

https://cloud.tsinghua.edu.cn/d/83f13126daec49deb8a3/

  • Local (not recommended)

https://cloud.tsinghua.edu.cn/d/83f13126daec49deb8a3/

Future Plans

  • Extract text transcription via OCR & ASR & Human check
  • Extend baseline to benchmark

License

  • All the resources contained in the database are free for research institutes and individuals.
  • No commerical usage is permitted.

References