“CN-CVS”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
Source Code
(Cchen移动页面MVSCN-CVS
(没有差异)

2022年10月30日 (日) 11:44的版本

Introduction

  • Mandarin Visual Speech, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.

Members

  • Current:Dong Wang, Chen Chen

Description

  • Collect audio and video data of more than 2500 Mandarin speakers.
  • Automatically clip videos through a pipeline including shot detection, VAD, face detection, face tracker, audio-visual synchronization detection.
  • Manually annotate speaker identity, human check data quality.
  • Create a benchmark database for video to speech synthesis task.

Basic Methods

  • Environments: PyTorch, OpenCV, FFmpeg
  • Shot detection: ffmpeg
  • VAD: pydub
  • Face detection and tracking: dlib.
  • Audio-visual synchronization detection: SyncNet model.
  • Input: json files of video information.
  • Output: videos clips and wav files, as well as metadata json files.

Reports

Publications


Source Code

Download

  • Public (recommended)

TODO

  • Local (not recommended)

TODO

Future Plans

  • Extract text transcription via OCR & ASR & Human check

License

  • All the resources contained in the database are free for research institutes and individuals.
  • No commerical usage is permitted.

References