“CN-CVS”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
(以“===Introduction=== * Mandarin Visual Speech, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Ts...”为内容创建页面)
 
Future Plans
第43行: 第43行:
  
 
===Future Plans===
 
===Future Plans===
 +
* Extract text transcription via OCR & ASR & Human check
  
 
===License===
 
===License===

2022年10月25日 (二) 05:09的版本

Introduction

  • Mandarin Visual Speech, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.

Members

  • Current:Dong Wang, Chen Chen

Description

  • Collect audio and video data of more than 2500 Mandarin speakers.
  • Automatically clip videos through a pipeline including shot detection, VAD, face detection, face tracker, audio-visual synchronization detection.
  • Manually annotate speaker identity, human check data quality.
  • Create a benchmark database for video to speech synthesis task.

Basic Methods

  • Environments: PyTorch, OpenCV, FFmpeg
  • Shot detection: ffmpeg
  • VAD: pydub
  • Face detection and tracking: dlib.
  • Audio-visual synchronization detection: SyncNet model.
  • Input: json files of video information.
  • Output: videos clips and wav files, as well as metadata json files.

Reports

Publications


Source Code

  • Collection Pipeline: TODO

Download

  • Public (recommended)

TODO

  • Local (not recommended)

TODO

Future Plans

  • Extract text transcription via OCR & ASR & Human check

License

  • All the resources contained in the database are free for research institutes and individuals.
  • No commerical usage is permitted.

References