<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="http://index.cslt.org/mediawiki/skins/common/feed.css?303"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="zh-cn">
		<id>http://index.cslt.org/mediawiki/index.php?action=history&amp;feed=atom&amp;title=Deep_Speaker_Feature_Learning</id>
		<title>Deep Speaker Feature Learning - 版本历史</title>
		<link rel="self" type="application/atom+xml" href="http://index.cslt.org/mediawiki/index.php?action=history&amp;feed=atom&amp;title=Deep_Speaker_Feature_Learning"/>
		<link rel="alternate" type="text/html" href="http://index.cslt.org/mediawiki/index.php?title=Deep_Speaker_Feature_Learning&amp;action=history"/>
		<updated>2026-05-10T13:48:05Z</updated>
		<subtitle>本wiki的该页面的版本历史</subtitle>
		<generator>MediaWiki 1.23.3</generator>

	<entry>
		<id>http://index.cslt.org/mediawiki/index.php?title=Deep_Speaker_Feature_Learning&amp;diff=29096&amp;oldid=prev</id>
		<title>Lilt：以“=Project name= Deep Speaker Feature Learning  =Project members= Dong Wang, Lantian Li, Zhiyuan Tang  =Introduction=  The key idea of speaker feature learning is simp...”为内容创建页面</title>
		<link rel="alternate" type="text/html" href="http://index.cslt.org/mediawiki/index.php?title=Deep_Speaker_Feature_Learning&amp;diff=29096&amp;oldid=prev"/>
				<updated>2017-10-31T12:24:43Z</updated>
		
		<summary type="html">&lt;p&gt;以“=Project name= Deep Speaker Feature Learning  =Project members= Dong Wang, Lantian Li, Zhiyuan Tang  =Introduction=  The key idea of speaker feature learning is simp...”为内容创建页面&lt;/p&gt;
&lt;p&gt;&lt;b&gt;新页面&lt;/b&gt;&lt;/p&gt;&lt;div&gt;=Project name=&lt;br /&gt;
Deep Speaker Feature Learning&lt;br /&gt;
&lt;br /&gt;
=Project members=&lt;br /&gt;
Dong Wang, Lantian Li, Zhiyuan Tang&lt;br /&gt;
&lt;br /&gt;
=Introduction=&lt;br /&gt;
&lt;br /&gt;
The key idea of speaker feature learning is simply based on the idea of discriminating training speakers based on &lt;br /&gt;
short-time frames by deep neural networks (DNN), date back to 2014 by Ehsan et al.[2]. As shown below, the output of the DNN &lt;br /&gt;
involves the training speakers, and the frame-level speaker features are read from the last hidden layer. The &lt;br /&gt;
basic assumption here is: if the output of the last hidden layer can be used as the input feature of the &lt;br /&gt;
last hidden layer (a software regression classifier), these features should be speaker discriminative. &lt;br /&gt;
&lt;br /&gt;
[[文件:Dnn-spk.png|300px]]&lt;br /&gt;
&lt;br /&gt;
However, the vanilla structure of Ehsan et al. performs rather poor compared to the i-vector counterpart. One reason is &lt;br /&gt;
that the simple back-end scoring is based on average to derive the utterance-based representations (called d-vectors) , but&lt;br /&gt;
another reason is the vanilla DNN structure that does not consider much of the context and pattern learning. We therefore&lt;br /&gt;
proposed a CT-DNN model that can learn stronger speaker features. The structure is shown below[1]:&lt;br /&gt;
&lt;br /&gt;
[[文件:Ctdnn-spk.png|800px]]&lt;br /&gt;
&lt;br /&gt;
Recently, we found that an 'all-info' training is effective for learning features. Looking back to DNN and CT-DNN, although the features &lt;br /&gt;
read from last hidden layer are discriminative, but not 'all discriminative', because some discriminant info can be also impelemented&lt;br /&gt;
in the last affine layer. A better strategy is let the feature generation net (feature net) learns all the things of discrimination. &lt;br /&gt;
To achieve this, we discarded the parametric classifier (the last affine layer) and use the simple cosine distance to conduct the&lt;br /&gt;
classification. An iterative training scheme can be used to implement this idea, that is, after each epoch, averaging the speaker&lt;br /&gt;
features to derive speaker vectors, and then use the speaker vectors to replace the last hidden layer. The training will be then&lt;br /&gt;
taken as usual. The new structure is as follows[4]:&lt;br /&gt;
&lt;br /&gt;
[[文件:fullinfo-spk.png|600px]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Research directions=&lt;br /&gt;
&lt;br /&gt;
* Adversarial factor learning&lt;br /&gt;
* Phone-aware multiple d-vector back-end for speaker recognition&lt;br /&gt;
* TTS adaptation based on speaker factors&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Reference=&lt;br /&gt;
&lt;br /&gt;
[1] Lantian Li, Yixiang Chen, Ying Shi, Zhiyuan Tang, and Dong Wang, “Deep speaker feature learning for text-independent speaker verification,”, Interspeech 2017. &lt;br /&gt;
&lt;br /&gt;
[2] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker&lt;br /&gt;
verification,”, ICASSP 2014.&lt;br /&gt;
&lt;br /&gt;
[3] Lantian Li, Dong Wang, Yixiang Chen, Ying Shing, Zhiyuan Tang, http://wangd.cslt.org/public/pdf/spkfact.pdf&lt;br /&gt;
&lt;br /&gt;
[4] Lantian Li, Zhiyuan Tang, Dong Wang, FULL-INFO TRAINING FOR DEEP SPEAKER FEATURE LEARNING, http://wangd.cslt.org/public/pdf/mlspk.pdf&lt;br /&gt;
&lt;br /&gt;
[5] Zhiyuan Thang, Lantian Li, Dong Wang, Ravi Vipperla &amp;quot;Collaborative Joint Training with Multi-task Recurrent Model for Speech and Speaker Recognition&amp;quot;, IEEE Trans. on Audio, Speech and Language Processing, vol. 25, no.3, March 2017.&lt;br /&gt;
&lt;br /&gt;
[6] Dong Wang,Lantian Li,Ying Shi,Yixiang Chen,Zhiyuan Tang., &amp;quot;Deep Factorization for Speech Signal&amp;quot;, https://arxiv.org/abs/1706.01777&lt;/div&gt;</summary>
		<author><name>Lilt</name></author>	</entry>

	</feed>