Introduction

Speech signals involve complex factors, each contributing in an unknown and secrete way. Recent developed deep learning methods have built up some interesting tools for discovering these latent factors. These tools include various unsupervised models such as VAE, GAN, supervised learning methods such as multi-task learning, knowledge distillation, etc. These tools allow us to decipher secretes of speech signal, based on big data, rather than hypothesis.

These will lead to an unprecedented breakthrough in speech information processing. Some of the signals for this breakthrough includes:

In speaker recognition, speaker factors can be learned within a very small speech segment.
In speech synthesis, speaking styles can be learned as latent variables and discovered in an unsupervised way, and speaker factors can be used to change the speaker trait.
In speech recognition, learning multiple tasks in a collaborative way has shown to be successful.

In previous studies (Phase 1), we have found that using cascade learning, speech signals can be factorized into content, speaker and emotion at the frame level. In this Phase 2, we will try to answer the following questions:

Can we factorize speech signals in an unsupervised way?
How supervised and unsupervised factorizations are integrated?
How to deal with language discrepancy in factorization?
How to discover optimal factorization architectures?

People

Dong Wang, Yunqi Cai, Haoran Sun

Research direction

Basic research

Collaborative learning with AutoML
VAE/dVAE factorization
ASR + TTS cycle training

Applied reseach

Pretraining for ASR, SID, EMD (BERT in speech)
Low-resource ASR, TTS
Signal compression, cleaning up, etc.

Deep Speech Factorization-2

目录

Introduction

People

Research direction

Basic research

Applied reseach

Reading list

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具