“Phonetic Temporal Neural LID”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
(以“=Project name= Phonetic Temporal Neural Model for Language Identification =Project members= Dong Wang, Zhiyuan Tang, Lantian Li, Yixiang Chen, Ying Shi =Introducti...”为内容创建页面)
 
 
(相同用户的4个中间修订版本未显示)
第1行: 第1行:
 
=Project name=
 
=Project name=
Phonetic Temporal Neural Model for Language Identification
+
Phonetic Temporal Neural (PTN) Model for Language Identification
 +
 
  
 
=Project members=
 
=Project members=
Dong Wang, Zhiyuan Tang, Lantian Li, Yixiang Chen, Ying Shi
+
Dong Wang, Zhiyuan Tang, Lantian Li, Ying Shi
 
+
=Introduction=
+
  
  
 +
=Introduction=
 
Deep neural models, particularly the LSTM-RNN
 
Deep neural models, particularly the LSTM-RNN
 
model, have shown great potential for language identification
 
model, have shown great potential for language identification
第13行: 第13行:
 
overlooked by most existing neural LID methods, although this
 
overlooked by most existing neural LID methods, although this
 
information has been used very successfully in conventional
 
information has been used very successfully in conventional
phonetic LID systems. We present a phonetic temporal neural
+
phonetic LID systems. In this project, we present a phonetic temporal neural (PTN)
 
model for LID, which is an LSTM-RNN LID system that accepts
 
model for LID, which is an LSTM-RNN LID system that accepts
 
phonetic features produced by a phone-discriminative DNN as
 
phonetic features produced by a phone-discriminative DNN as
第19行: 第19行:
 
similar to traditional phonetic LID methods, but the phonetic
 
similar to traditional phonetic LID methods, but the phonetic
 
knowledge here is much richer: it is at the frame level and
 
knowledge here is much richer: it is at the frame level and
involves compacted information of all phones. Our experiments
+
involves compacted information of all phones.  
conducted on the Babel database and the AP16-OLR database
+
The PTN model significantly outperforms existing acoustic neural
demonstrate that the temporal phonetic neural approach is very
+
effective, and significantly outperforms existing acoustic neural
+
 
models. It also outperforms the conventional i-vector approach
 
models. It also outperforms the conventional i-vector approach
 
on short utterances and in noisy conditions.
 
on short utterances and in noisy conditions.
  
  
==Speaker feature learning==
+
=Phonetic feature=
  
The discovery of the short-time property of speaker traits is the key step towards speech signal factorization, as
+
All the present neural LID methods are based on acoustic
the speaker trait is one of the two main factors: the other is linguistic content that we have known for a long time
+
features, e.g., Mel filter banks (Fbanks) or Mel frequency cepstral coefficients (MFCCs), with phonetic information
being short-time patterns.  
+
largely overlooked. This may have significantly hindered the
 +
performance of neural LID. Intuitively, it is a long-standing hypothesis that languages can be discriminated between by phonetic properties, either distributional or temporal; additionally,
 +
phonetic features represent information at a higher level than acoustic features, and so are more invariant with respect to noise
 +
and channels.  
  
The key idea of speaker feature learning is simply based on the idea of discriminating training speakers based on
+
[[文件:Phonetic-feat.png|400px]]
short-time frames by deep neural networks (DNN), date back to 2014 by Ehsan et al.[2]. As shown below, the output of the DNN
+
involves the training speakers, and the frame-level speaker features are read from the last hidden layer. The
+
basic assumption here is: if the output of the last hidden layer can be used as the input feature of the
+
last hidden layer (a software regression classifier), these features should be speaker discriminative.
+
  
[[文件:Dnn-spk.png|500px]]
+
* Phonetic DNN: the acoustic model of an ASR system.
 +
* Phonetic features: the output of last hidden layer in phonetic model.
  
However, the vanilla structure of Ehsan et al. performs rather poor compared to the i-vector counterpart. One reason is
 
that the simple back-end scoring is based on average to derive the utterance-based representations (called d-vectors) , but
 
another reason is the vanilla DNN structure that does not consider much of the context and pattern learning. We therefore
 
proposed a CT-DNN model that can learn stronger speaker features. The structure is shown below[1]:
 
  
[[文件:Ctdnn-spk.png|500px]]
+
=Phone-aware model=
  
 +
[[文件:Phone-aware.png|350px]]
  
Recently, we found that an 'all-info' training is effective for learning features. Looking back to DNN and CT-DNN, although the features
+
Phone-aware LID consists of a phonetic DNN (left) to produce phonetic features
read from last hidden layer are discriminative, but not 'all discriminative', because some discriminant info can be also impelemented
+
and an LID RNN (right) to make LID decisions.
in the last affine layer. A better strategy is let the feature generation net (feature net) learns all the things of discrimination.
+
The LID RNN receives both phonetic feature and acoustic feature as input.
To achieve this, we discarded the parametric classifier (the last affine layer) and use the simple cosine distance to conduct the
+
classification. An iterative training scheme can be used to implement this idea, that is, after each epoch, averaging the speaker
+
features to derive speaker vectors, and then use the speaker vectors to replace the last hidden layer. The training will be then
+
taken as usual. The new structure is as follows[4]:
+
  
  
[[文件:fullinfo-spk.png|500px]]
+
[[文件:Phone-aware-sys.png|400px]]
  
 +
The phonetic feature is read from
 +
the last hidden layer of the phonetic DNN which is a TDNN.
 +
The phonetic feature is then propagated to the g function for
 +
the phonetically aware RNN LID system, with acoustic feauture as the LID system's input.
  
=Speech factorization=
 
  
  
The short-time property is a very nice thing, which tells us it is possible to factorize speech signals. By factorization, we can achieve significant benefits:
 
  
A. Individual tasks can be largely improved, as unrelated factors have been removed.
+
=Phonetic Temporal Neural (PTN) model=
  
B. Factors that are disturbs now becomes valuables things, leading to conditional training and collaborative training [5].
+
[[文件:Ptn.png|400px]]
  
C. Once the factors have been separated, single factors can be manipulated, and reassemble these factors can change the signal according to the need.
+
PTN model consists of a phonetic DNN (left) to produce phonetic features
 +
and an LID RNN (right) to make LID decisions.
 +
The LID RNN only receives phonetic feature as input.
  
D. It is a new speech coding scheme that leverage knowledge learned from large data.
 
  
  
Traditional factorization methods are based on probabilistic models and maximum likelihood learning. For example, in JFA, a linear Gaussian is assumed
+
[[文件:Ptn-sys.png|400px]]
for speaker and channel, and then a ML estimation is applied to estimate the loading matrices of each factor, based on a long duration of speech. Almost
+
all these factorizations share these features: shallow, linear, Gaussian, long-term segments.  
+
  
We are more interested in factorization on frame-level, and plays not much assumption on how the factors are mixed. A cascaded factorization approach has been
+
The phonetic feature is read from
proposed[6]. The basic idea is to factorize significant factors first,and then conditioned on the factors that have been derived. The architecture is as
+
the last hidden layer of the phonetic DNN which is a TDNN.
follows, where we factorized speech signals into three factors: linguistic content, speaker trait, emotion. When factorizing each factor, supervised
+
The phonetic feature is then propagated to the g function for
learning is used. Note by this architecture, databases with different target labels can be used in a complementary way, which is different from
+
the phonetically aware RNN LID system, and is the only input
previously joint training approach that needs full-labelled data.
+
for the PTN LID system.
  
  
[[文件:deepfact.png|500px]]
 
  
 +
=Performance=
  
=Speech reconstruction=
 
  
To verify the factorization, we can reconstruct the speech signal from the factors. The reconstruction is simply based on a DNN,
+
==On Babel database==
as shown below.
+
Each factor passes a unique deep neural net, the output of the three DNNs are added together, and compared with the target,
+
which is the logarithm of the spectrum of the original signal. This means that the output of the DNNs of the three factors are
+
assumed to be convolved together to produce the original speech.
+
  
[[文件:fact-recover-dnn.png|500px]]
+
Babel contains seven languages:
 +
Assamese, Bengali, Cantonese, Georgian, Pashto, Tagalog and Turkish.
  
Note that the factors are learned from Fbanks, by which some speech information
+
[[文件:Ptn-babel.png|730px]]
has been lost, however the recovery is rather successfull.  
+
  
  
==View the reconstruction==
+
==On AP16-OLR database==
  
[[文件:fact-recover.png|500px]]
+
[http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OLR_Challenge_2016 AP16-OLR] contains seven languages:
 +
Mandarin, Cantonese, Indonesian, Japanese, Russian, Korean and Vietnamese.
  
 +
[[文件:Ptn-ap16.png|700px]]
  
More recovery examples can be found [[dsf-examples|here]].
 
 
 
==Listen to the reconstruction==
 
 
We can listen to the wave for each factor, by using the original phase.
 
 
 
Original speech: [http://wangd.cslt.org/research/cdf/demo/CHEAVD_1_1_E02_001_worried.wav]
 
 
Linguistic factor: [http://wangd.cslt.org/research/cdf/demo/phone.wav]
 
 
Speaker factor: [http://wangd.cslt.org/research/cdf/demo/speaker.wav]
 
 
Emotion factor: [http://wangd.cslt.org/research/cdf/demo/emotion.wav]
 
 
Liguistic+ Speaker + Emotion: [http://wangd.cslt.org/research/cdf/demo/recovery.wav]
 
  
  
 
=Research directions=
 
=Research directions=
 +
* Multilingual ASR with language information.
 +
* Joint training with multi-task Recurrent Model for ASR and LID.
 +
* Multi-scale RNN LID.
  
* Adversarial factor learning
 
* Phone-aware multiple d-vector back-end for speaker recognition
 
* TTS adaptation based on speaker factors
 
  
  
 
=Reference=
 
=Reference=
 +
[1] Zhiyuan Tang, Dong Wang*, Yixiang Chen, Lantian Li and Andrew Abel.
 +
Phonetic Temporal Neural Model for Language Identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2017.
  
[1] Lantian Li, Yixiang Chen, Ying Shi, Zhiyuan Tang, and Dong Wang, “Deep speaker feature learning for text-independent speaker verification,”, Interspeech 2017.
+
[2] Zhiyuan Tang, Dong Wang*, Yixiang Chen, Ying Shi and Lantian Li. Phone-aware Neural Language Identification. O-COCOSDA 2017. [https://arxiv.org/pdf/1705.03152.pdf pdf]
 
+
[2] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker
+
verification,”, ICASSP 2014.
+
 
+
[3] Lantian Li, Dong Wang, Yixiang Chen, Ying Shing, Zhiyuan Tang, http://wangd.cslt.org/public/pdf/spkfact.pdf
+
 
+
[4] Lantian Li, Zhiyuan Tang, Dong Wang, FULL-INFO TRAINING FOR DEEP SPEAKER FEATURE LEARNING, http://wangd.cslt.org/public/pdf/mlspk.pdf
+
 
+
[5] Zhiyuan Thang, Lantian Li, Dong Wang, Ravi Vipperla "Collaborative Joint Training with Multi-task Recurrent Model for Speech and Speaker Recognition", IEEE Trans. on Audio, Speech and Language Processing, vol. 25, no.3, March 2017.
+
  
[6] Dong Wang,Lantian Li,Ying Shi,Yixiang Chen,Zhiyuan Tang., "Deep Factorization for Speech Signal", https://arxiv.org/abs/1706.01777
+
[3] Zhiyuan Thang, Lantian Li, Dong Wang* and Ravi Vipperla. Collaborative Joint Training with Multi-task Recurrent Model for Speech and Speaker RecognitionIEEE/ACM Transactions on Audio, Speech, and Language Processing. 2017. [http://ieeexplore.ieee.org/document/7782371 online]

2017年10月31日 (二) 11:45的最后版本

Project name

Phonetic Temporal Neural (PTN) Model for Language Identification


Project members

Dong Wang, Zhiyuan Tang, Lantian Li, Ying Shi


Introduction

Deep neural models, particularly the LSTM-RNN model, have shown great potential for language identification (LID). However, the use of phonetic information has been largely overlooked by most existing neural LID methods, although this information has been used very successfully in conventional phonetic LID systems. In this project, we present a phonetic temporal neural (PTN) model for LID, which is an LSTM-RNN LID system that accepts phonetic features produced by a phone-discriminative DNN as the input, rather than raw acoustic features. This new model is similar to traditional phonetic LID methods, but the phonetic knowledge here is much richer: it is at the frame level and involves compacted information of all phones. The PTN model significantly outperforms existing acoustic neural models. It also outperforms the conventional i-vector approach on short utterances and in noisy conditions.


Phonetic feature

All the present neural LID methods are based on acoustic features, e.g., Mel filter banks (Fbanks) or Mel frequency cepstral coefficients (MFCCs), with phonetic information largely overlooked. This may have significantly hindered the performance of neural LID. Intuitively, it is a long-standing hypothesis that languages can be discriminated between by phonetic properties, either distributional or temporal; additionally, phonetic features represent information at a higher level than acoustic features, and so are more invariant with respect to noise and channels.

Phonetic-feat.png

  • Phonetic DNN: the acoustic model of an ASR system.
  • Phonetic features: the output of last hidden layer in phonetic model.


Phone-aware model

Phone-aware.png

Phone-aware LID consists of a phonetic DNN (left) to produce phonetic features and an LID RNN (right) to make LID decisions. The LID RNN receives both phonetic feature and acoustic feature as input.


Phone-aware-sys.png

The phonetic feature is read from the last hidden layer of the phonetic DNN which is a TDNN. The phonetic feature is then propagated to the g function for the phonetically aware RNN LID system, with acoustic feauture as the LID system's input.



Phonetic Temporal Neural (PTN) model

Ptn.png

PTN model consists of a phonetic DNN (left) to produce phonetic features and an LID RNN (right) to make LID decisions. The LID RNN only receives phonetic feature as input.


Ptn-sys.png

The phonetic feature is read from the last hidden layer of the phonetic DNN which is a TDNN. The phonetic feature is then propagated to the g function for the phonetically aware RNN LID system, and is the only input for the PTN LID system.


Performance

On Babel database

Babel contains seven languages: Assamese, Bengali, Cantonese, Georgian, Pashto, Tagalog and Turkish.

Ptn-babel.png


On AP16-OLR database

AP16-OLR contains seven languages: Mandarin, Cantonese, Indonesian, Japanese, Russian, Korean and Vietnamese.

Ptn-ap16.png


Research directions

  • Multilingual ASR with language information.
  • Joint training with multi-task Recurrent Model for ASR and LID.
  • Multi-scale RNN LID.


Reference

[1] Zhiyuan Tang, Dong Wang*, Yixiang Chen, Lantian Li and Andrew Abel. Phonetic Temporal Neural Model for Language Identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2017.

[2] Zhiyuan Tang, Dong Wang*, Yixiang Chen, Ying Shi and Lantian Li. Phone-aware Neural Language Identification. O-COCOSDA 2017. pdf

[3] Zhiyuan Thang, Lantian Li, Dong Wang* and Ravi Vipperla. Collaborative Joint Training with Multi-task Recurrent Model for Speech and Speaker RecognitionIEEE/ACM Transactions on Audio, Speech, and Language Processing. 2017. online