“CN-CVS”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
(以“===Introduction=== * Mandarin Visual Speech, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Ts...”为内容创建页面)
 
 
(相同用户的3个中间修订版本未显示)
第1行: 第1行:
 
===Introduction===
 
===Introduction===
  
* Mandarin Visual Speech, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.
+
* CN-CVS, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.
  
 
===Members===
 
===Members===
第32行: 第32行:
 
===Source Code===
 
===Source Code===
  
* Collection Pipeline: TODO
+
* Collection Pipeline: https://github.com/sectum1919/cncvs_data_collector
 +
* xTS: TODO
 +
* VCA-GAN: TODO
  
 
===Download===
 
===Download===
  
 
* Public (recommended)
 
* Public (recommended)
TODO
+
https://cloud.tsinghua.edu.cn/d/83f13126daec49deb8a3/
  
 
* Local (not recommended)
 
* Local (not recommended)
TODO
+
https://cloud.tsinghua.edu.cn/d/83f13126daec49deb8a3/
  
 
===Future Plans===
 
===Future Plans===
 +
* Extract text transcription via OCR & ASR & Human check
 +
* Extend baseline to benchmark
  
 
===License===
 
===License===

2022年10月30日 (日) 11:47的最后版本

Introduction

  • CN-CVS, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.

Members

  • Current:Dong Wang, Chen Chen

Description

  • Collect audio and video data of more than 2500 Mandarin speakers.
  • Automatically clip videos through a pipeline including shot detection, VAD, face detection, face tracker, audio-visual synchronization detection.
  • Manually annotate speaker identity, human check data quality.
  • Create a benchmark database for video to speech synthesis task.

Basic Methods

  • Environments: PyTorch, OpenCV, FFmpeg
  • Shot detection: ffmpeg
  • VAD: pydub
  • Face detection and tracking: dlib.
  • Audio-visual synchronization detection: SyncNet model.
  • Input: json files of video information.
  • Output: videos clips and wav files, as well as metadata json files.

Reports

Publications


Source Code

Download

  • Public (recommended)

https://cloud.tsinghua.edu.cn/d/83f13126daec49deb8a3/

  • Local (not recommended)

https://cloud.tsinghua.edu.cn/d/83f13126daec49deb8a3/

Future Plans

  • Extract text transcription via OCR & ASR & Human check
  • Extend baseline to benchmark

License

  • All the resources contained in the database are free for research institutes and individuals.
  • No commerical usage is permitted.

References