Torch speaker

PyTorch Speaker 一个基于 PyTorch 编写的说话人识别不止限用于科研的工具包。

1 PyTorch Speaker 简介
2 快速安装与上手
3 语音声学特征的提取
4 常用backbone与实现
5 常用pooling layer
6 常用Loss
7 后端打分
8 训练 trick 相关
9 data loader 相关
10 评价指标相关
11 Anti-Spoofing
12 对抗样本攻击与防御
- 12.1 对抗样本攻击
  - 12.1.1 白盒攻击
  - 12.1.2 黑盒攻击
- 12.2 对抗样本防御
13 工具代码和脚本
14 MISC
15 参考

PyTorch Speaker 简介

PyTorch Speaker 是一个由 zhangy20编写基于 PyTorch 编写的说话人识别工具包。

项目特点

不依赖于Kaldi，没有使用高级的shell语法
支持离线手机端/嵌入式设备的快速部署
丰富的数据可视化支持
可用于计算机安全领域(Anti-Spoofing和Adversarial Attack)语音相关的实验和科研
对于声学特征的提取是Online的

项目可能存在的缺陷

模型的训练对硬盘的IO性能要求比较高（为了在训练时达到最高的显卡利用率，强烈推荐使用固态硬盘存放数据）

说话人识别(Speaker Recognition, SRE)技术，也称为声纹识别(Voiceprint Recognition, VPR)技术属于生物特征识别技术的一种，是一项根据语音信号中反映说话人生理和行为特征的语音参数(“声纹”)，自动识别说话人身份的技术。说话人识别本质上是一类模式识别问题。说话人识别按照具体场景和需求的不同可以分为如下图所示的3个子任务：

700px

任务中文名称	任务英语名称	中文定义	英语定义
说话人辨认	Speaker Identification	判断某段语音是由若干人中的哪一个人所说，是“1 vs N”的判别	Speaker Identification that identify the true speaker from a set of candidates.
说话人确认	Speaker Verification	判断某段语音是否是由指定的某个人所说，是“1 vs 1”的判别	Speaker Verification that tests if an alleged speaker is the true speaker.
说话人追踪	Speaker Diarization		Speaker Diarization which addresses the problem of “who spoken and when”, which is a process of partitioning a conversation recording into several speech recordings, each of which belongs to a single speaker.

Demo 演示

系统性能

网络结构	网络参数量	损失函数	数据增强	训练数据集	测试数据集	cosine EER	cosine minDCF(10-2)	cosine minDCF(10-3)	清华网盘链接
resnet34_TSP [1, 32, 64, 128, 256]	6.4M	amsoftmax (margin=0.2, scale=30)	NO	voxceleb1&2-dev (7205)	vox1_clean	1.59%	0.188	0.334	download link
					vox1_E_clean	0.77%	0.099	0.187
					vox1_H_clean	1.40%	0.153	0.280
resnet34_ASP [1, 32, 64, 128, 256]	10.6M	amsoftmax (margin=0.2, scale=30)	NO	voxceleb1-dev (1211)	vox1_clean	5.42%	0.471	0.550	download link
					vox1_E_clean	0.86%	0.100	0.198
					vox1_H_clean	1.46%	0.156	0.295
resnet34_TSP [1, 32, 64, 128, 256]	6.4M	amsoftmax (margin=0.2, scale=30)	NO	voxceleb1-dev (1211)	vox1_clean	4.37%	0.453	0.577	download link
					vox1_E_clean	0.613%	0.063	0.138
					vox1_H_clean	1.03%	0.110	0.226

项目结构

.
├── config/  # 存放yaml配置文件
├── docs/    # 存放文档
├── examples/# 存放示例
├── README.md
├── requirements.txt
├── scripts/ # 存放数据处理、数据可视化脚本
├── setup.py 
├── tools/   # 存放训练、推理、量化部署等脚本
└── torch_speaker/ # 模型pipline的主体实现
    ├── backbone/
    ├── audio/
    ├── loss/
    ├── score/
    ├── module.py
    └── utils/

快速安装与上手

步骤	使用方法	注意事项
1. Installation	git clone cd torch_speaker pip install -r requirements.txt python setup.py develop
2. Data preparation	采用pandas构建datlist.csv来实现数据的准备。 python3 scripts/build_datalist.py \ --extension wav \ --dataset_dir data/train \ --data_list_path data/train.csv python3 scripts/format_trials.py \ --voxceleb1_root your_voxceleb1_path \ --src_trials_path your_src_trail_path \ --dst_trials_path your_dst_trail_path	语音数据默认为单声道wav文件，采用率为16k。
3. Training	python3 tools/train.py \ --config your_yaml_path
4. Log Visualization	tensorboard --log_dir your_log_dir/ --bind_all	250px
5. Evaluation	python3 tools/evaluate.py \ --config your_yaml_path \ --trial_path your_trial_path \ --checkpoint_path your_checkpoint_path
6. Export

语音声学特征的提取

由于 torchaudio 库中可能存在潜在的bug（经常各种warning），且未来我计划采用小波分析的方法(wavelet)进行一些其他的实验，因此提取声学体征的代码也是完全采用PyTorch手动实现（没有依赖于其他第三方库）。

生成缩略图出错：/bin/bash: /usr/bin/convert: No such file or directory

Error code: 127

其中，值得注意的是：

kaldi所实现的特征提取是offline的，采用PyTorch可以实现online的特征提取；
Mel-Spectrogram 也叫Fbank，或是FilterBank，在Kaldi中叫Fbank中比较多，在TTS和VC中Mel-Spectrogram叫的比较多；
由我实现的特征提取代码，kaldi，librosa，torchaudio即便是在相同配置参数的情况下，所提取的到的结果都不同。

特征名	实现流程	代码存放位置
Spectrogram	预加重(Pre-emphasis，弥补了高频部分的损耗，保护声道信息）加窗（hamming窗，降低吉布斯现象）并做短时傅里叶变换(stft) 对stft后的复数结果取模取对数（加1e-9防止出现0） Instance Norm(可以等价于做了cmvn倒谱均值方差归一化)	[link]
Mel-Spectrogram	预加重(Pre-emphasis，弥补了高频部分的损耗，保护声道信息）加窗（hamming窗，降低吉布斯现象）并做短时傅里叶变换(stft) 对stft后的复数结果取模取对数（加1e-9防止出现0） Mel滤波 Instance Norm(可以等价于做了cmvn倒谱均值方差归一化)	[link]
MFCC

常用backbone与实现

backbone这个单词原意指的是人的脊梁骨，后来引申为支柱，核心的意思。在神经网络中，尤其是CV领域，一般先对图像进行特征提取（常见的有vggnet，resnet，谷歌的inception），这一部分是整个CV任务的根基，因为后续的下游任务都是基于提取出来的图像特征去做文章（比如分类，生成等等）。所以将这一部分网络结构称为backbone十分形象，仿佛是一个人站起来的支柱。

ResNet和其变种

ResNet

SENet

Res2Net

400px

TDNN和其变种

TDNN(x-vector)

ECAPA-TDNN

ECAPA-TDNN architecture is based on the popular x-vector topology and it introduces several enhancements to create more robust speaker embeddings.

300px

The pooling layer uses a channel-and context-dependent attention mechanism, which allows the network to attend different frames per channel. 1-dimensional SqueezeExcitation (SE) blocks rescale the channels of the intermediate frame-level feature maps to insert global context information in the locally operating convolutional blocks. Next, the integration of 1-dimensional Res2-blocks improves performance while simultaneously reducing the total parameter count by using grouped convolutions in a hierarchical way.

Finally, Multi-layer Feature Aggregation (MFA) merges complementary information before the statistics pooling by concatenating the final frame-level feature map with an intermediate feature maps of preceding layers.

The network is trained by optimizing the AAM-softmax loss on the speaker identities in the training corpus. The AAM-softmax is a powerful enhancement compared to the regular softmax loss in the context of fine-grained classification and verification problems. It directly optimizes the cosine distance between the speaker embeddings.

The model turned out to work amazingly well for speaker verification and speaker diarization.

VGG和其变种

VGG

RepVGG

常用pooling layer

Pooling Layer
TSP
TAP
ASP
SAP

常用Loss

Loss	描述	计算公式	代码实现
softmax
Triplet Loss	Triplet Loss基本思路是构造一个三元组，由anchor、positive 和 negative 组成，其中 anchor 和 positive 表示来自于同一个人的不同声音，negative 表示来自不同的人的声音，然后，用大量标注好的三元组作为网络输入，训练DNN参数。其优点在于直接使用embeddings之间的相似度作为优化的成本函数，最大化 anchor 和 positive的相似度，同时最小化 anchor和 negative 的相似度。这样，在提取了说话者的 embedding 之后，说话人识别任务就可以简单地通过相似度计算实现。
AM-softmax	Kaldi搭建的声纹系统在模型训练中大多使用Softmax损失函数，但 Softmax 损失函数并不能增大类内紧凑性和类间分离性，为了增强embedding的判别性
AAM-softmax

后端打分

打分方法	功能描述	计算公式	代码实现
cosine
GPLDA
LDA-GPLDA

训练 trick 相关

tricks	注意事项
distributed modes	Data Parallel (accelerator=’dp’) (multiple-gpus, 1 machine) DistributedDataParallel (accelerator=’ddp’) (multiple-gpus across many machines (python script based)). DistributedDataParallel (accelerator=’ddp_spawn’) (multiple-gpus across many machines (spawn based)). DistributedDataParallel 2 (accelerator=’ddp2’) (DP in a machine, DDP across machines). Horovod (accelerator=’horovod’) (multi-machine, multi-gpu, configured at runtime) TPUs (tpu_cores=8\|x) (tpu or TPU pod)
warm-up
learning rate schedule
optimizer	尽管momentum SDG收敛速度慢于Adam，但是SGD的效果在实验上可能往往好于Adam（这个地方存疑），但在数据量大+做了数据增强后，使用SGD的训练时间实在是太久，综合来看还是使用Adam比较好
chunk size	chunk size 对模型的训练效果有着非常显著的影响，最好把chunk size改小一点，不要太大。从目前来的实验结果看2秒往往优于2.5秒和3秒，不清楚chunk size更小能不能有更大的提升。
数据增强	数据增强有两种实现思路：每一个batch中的各个数据做随机的去不同增强，每一个batch中的所有数据做相同的数据增强。第一种直觉上可能会比第二种会更好，但实际使用过程中，第一种方法loss收敛速度很慢，我猜测可能在一个batch中做不同的数据增强会导致loss存在一定程度的发散。第二种还在开发中。如果只是想快速验证一些科研想法，推荐不要开启数据增强。目前有文献表明，如果开启数据增强相比于不开启将会花费6倍的训练时间，数据增强应该是在数据量少，例如只用了voxceleb1 算法和模型验证无误确实有效，需要冲击一个SOTA的模型服务器可以连续训练好几天不停止只跑这一个模型的情况以上几种情况下使用

data loader 相关

数据增强

WavAugment	加性噪声音量扰动加入混响速度扰动
SpecAugment	frequency mask time(frame) mask frequency swap time(frame) swap

训练数据类别均衡

在自然情况下，数据往往都会呈现长尾分布。这种趋势同样出现在从自然科学到社会科学的各个领域各个问题中，直接利用长尾数据来训练的分类和识别系统，往往会对头部数据过拟合，从而在预测时忽略尾部的类别。说话人数据类别分布不均衡可能会导致训练的模型效果一般，为此需要在dataloader上使用一些预处理的手段实现说话人类别的均衡。

长尾分布的最简单的两类基本方法是：

重采样（re-sampling）：对头部类别的欠采样（under-sampling）和对尾部类别的过采样（over-sampling）
重加权（re-weighting）：重加权则主要体现在分类的loss上

这两类方法本质都是利用已知的数据集分布，在学习过程中对数据分布进行反向加权，强化尾部类别的学习，抵消长尾效应。在这个项目中，我采用重采样（re-sampling）解决数据分布不均衡的问题。

1000px

说话人识别常用的数据集

评价指标相关

混淆矩阵

对于一个二分类系统，将实例分为正类（Positive）、负类（Negative），则模式分类器有四种分类结果：

TP（True Positive）：正确的正例，一个实例是正类并且也被判定成正类；
FN（False Negative）：错误的反例，漏报，本为正类但判定为假类；
FP（False Positive）：错误的正例，误报，本为假类但判定为正类；
TN（True Negative）：正确的反例，一个实例是假类并且也被判定成假类。

ROC 曲线

对于Speaker Verification任务，评估模型的方法可以是绘制ROC曲线（Receiver Operating Characteristic curve），首先了解以下定义

真阳率（True Positive Rate, TPR）：描述识别出的所有正例占所有正例的比例
假阳率（False Positive Rate, FPR）：描述将负例识别为正例的情况占所有负例的比例
真阴率（True Negative Rate，TNR）：描述识别出的负例占所有负例的比例
其中其中TPR即为敏感度（sensitivity），TNR即为特异度（specificity）

ROC曲线将TPR定义为X轴，将FPR定义为Y轴； AUC（Area Under Curve）是指ROC曲线下面积，越接近1表示分类器越好。曲线下面积越大，分类的准确性就越高；最靠近坐标图左上方的点为灵敏性和特异性均较高的临界值。 ROC曲线有个很好的特性：当测试集中的正负样本的分布变化的时候，ROC曲线能够保持不变。

Equal Error Rate(EER)

虚警率（False alarm）表示负类样本被分为正类样本在所有负类样本中的比例
漏警率表示（Missing alarm）表示正类样本被分为负类样本在所有正类样本中的比例

Minimum Detection Cost Function(minDCF)

Diarization Error Rate (DER)

Anti-Spoofing

对抗样本攻击与防御

当AI模型/算法设计之初未考虑相关的安全威胁的情况下，AI算法的判断结果容易被恶意攻击者影响，导致AI系统判断失准。其中最主要的安全威胁是闪避攻击，即是指通过修改输人，让AI模型无法对其正确识别。研究表明深度学习系统容易受到精心设计的输人样本的影响，这些输人样本称为对抗样本（Adversarial Examples）

说话人识别系统，无论是基于深度神经网络的/或是基于传统统计模型的i-vector系统都同样存在这样的问题。

对抗样本攻击

白盒攻击

攻击方法	计算方法	代码实现
BIM
PGD

黑盒攻击

对抗样本防御

工具代码和脚本

工具代脚本

脚本名称	实现思路与流程	代码位置
读取waveform	目前各类开源的工具中，语音数据的读取的方法实现主要有两种：一种是以matlab，soundfile为代表的一种是以kaldi，scipy为代表的在本项目中，根据training和evaluation阶段的不同，对语音的读取策略也有所区别。	[link]
读取超参数	超参数的读入参考了nanodet项目的实现，采用了yacs来实现对yaml文件超参数对读取。	[link]
Voice Activity Detection（VAD）	VAD采用PyWebrct实现Python多进程处理
信噪比（SNR）计算
准确率（Accuracy）计算
插值(interpolate)
文档生成
format trials

可视化

功能	效果预览图	代码链接
绘制语谱图(spectrogram)	550px
绘制3D语谱图(3D-spectrogram)	550px
绘制ROC曲线	生成缩略图出错：/bin/bash: /usr/bin/convert: No such file or directory Error code: 127
绘制PR曲线
绘制混淆矩阵

模型部署

以 PyTorch 和 TensorFlow 为代表的深度学习框架集成了模型的训练和推理两个过程。然而在实际模型的使用中，如果想要在不同类型的平台（云/Edge、CPU/GPU 等）上获得最佳性能，需需要调整模型（量化、知识蒸馏）和使用专门的推理库(ONNX，TensorRT)。

ONNX runtime 是一种用于将 ONNX 模型部署到生产环境的高性能推理引擎。它针对云和 Edge 进行了优化，适用于 Linux、Windows 和 Mac。它使用 C++ 编写，还包含 C、Python、C#、Java 和 Javascript (Node.js) API，可在各种环境中使用。 ONNX 运行时同时支持 DNN 和传统 ML 模型，并与不同硬件上的加速器（例如，NVidia GPU 上的 TensorRT、Intel 处理器上的 OpenVINO、Windows 上的 DirectML 等）集成。通过使用 ONNX 运行时，可以从大量的生产级优化、测试和不断改进中受益。

代码规范

MISC

框架	描述
PyTorch	PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks built on a tape-based autograd system
PyTorch Lightning	The goal of PyTorch Lightning is "You do the research. Lightning will do everything else". PyTorch Lightning was started by William Falcon while completing his Ph.D. AI research at NYU CILVR and Facebook AI Research, with the vision of making it a foundational part of everyone’s deep learning research code. The framework was designed for professional and academic researchers working in AI, making state of the art AI research techniques, such as TPU training, trivial.
ONNX	Open Neural Network Exchange (ONNX) is an open ecosystem that empowers AI developers to choose the right tools as their project evolves. ONNX provides an open source format for AI models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.
NCNN	ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployment and uses on mobile phones from the beginning of design. ncnn does not have third party dependencies. it is cross-platform, and runs faster than all known open source frameworks on mobile phone cpu. Developers can easily deploy deep learning algorithm models to the mobile platform by using efficient ncnn implementation, create intelligent APPs, and bring the artificial intelligence to your fingertips. ncnn is currently being used in many Tencent applications, such as QQ, Qzone, WeChat, Pitu and so on.
YACS	YACS was created as a lightweight library to define and manage system configurations, such as those commonly found in software designed for scientific experimentation. These "configurations" typically cover concepts like hyperparameters used in training a machine learning model or configurable model hyperparameters, such as the depth of a convolutional neural network.
Sphinx	Sphinx is a tool that makes it easy to create intelligent and beautiful documentation, written by Georg Brandl and licensed under the BSD license.

参考