“CN-CVS”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
Future Plans
Source Code
第32行: 第32行:
 
===Source Code===
 
===Source Code===
  
* Collection Pipeline: TODO
+
* Collection Pipeline: https://github.com/sectum1919/mvs_data_collector
 +
* xTS: TODO
 +
* VCA-GAN: TODO
  
 
===Download===
 
===Download===

2022年10月25日 (二) 06:24的版本

Introduction

  • Mandarin Visual Speech, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.

Members

  • Current:Dong Wang, Chen Chen

Description

  • Collect audio and video data of more than 2500 Mandarin speakers.
  • Automatically clip videos through a pipeline including shot detection, VAD, face detection, face tracker, audio-visual synchronization detection.
  • Manually annotate speaker identity, human check data quality.
  • Create a benchmark database for video to speech synthesis task.

Basic Methods

  • Environments: PyTorch, OpenCV, FFmpeg
  • Shot detection: ffmpeg
  • VAD: pydub
  • Face detection and tracking: dlib.
  • Audio-visual synchronization detection: SyncNet model.
  • Input: json files of video information.
  • Output: videos clips and wav files, as well as metadata json files.

Reports

Publications


Source Code

Download

  • Public (recommended)

TODO

  • Local (not recommended)

TODO

Future Plans

  • Extract text transcription via OCR & ASR & Human check

License

  • All the resources contained in the database are free for research institutes and individuals.
  • No commerical usage is permitted.

References