Torch speaker
PyTorch Speaker 一个基于 PyTorch 编写的说话人识别科研工具包。
目录
PyTorch Speaker 简介
PyTorch Speaker 是一个基于 PyTorch 编写的说话人识别科研工具包。
说话人识别(Speaker Recognition, SRE)技术,也称为声纹识别(Voiceprint Recognition, VPR)技术属于生物特征识别技术的一种,是一项根据语音信号中反映说话人生理和行为特征的语音参数(“声纹”),自动识别说话人身份的技术。说话人识别本质上是一类模式识别问题。说话人识别按照具体场景和需求的不同可以分为如下图所示的3个子任务:
任务中文名称 | 任务英语名称 | 中文定义 | 英语定义 |
说话人辨认 | Speaker Identification | Speaker Identification that identify the true speaker from a set of candidates, | |
说话人确认 | Speaker Verification | Speaker Verification that tests if an alleged speaker is the true speaker. | |
说话人追踪 | Speaker Diarization | Speaker Diarization which addresses the problem of “who spoken and when”, which is a process of partitioning a conversation recording into several speech recordings, each of which belongs to a single speaker. |
项目特点
- 不依赖于Kaldi
- 支持离线手机端快速部署
- 丰富的数据可视化支持
与其他项目的比较
性能
网络结构 | 网络参数量 | 损失函数 | 是否有数据增强 | 训练数据集 | 测试数据集 | Equal Error Rate | DCF(10-2) | DCF(10-3) | yaml配置文件 |
项目结构
. ├── config/ # 存放yaml配置文件 ├── docs/ # 存放文档 ├── README.md ├── requirements.txt ├── scripts/ # 存放数据处理脚本 ├── setup.py ├── tools/ # 存放训练推理等脚本 └── torch_speaker/ # 模型pipline的主体实现 ├── backbone/ ├── data/ ├── loss/ ├── score/ ├── trainer.py └── utils/
快速安装与上手
安装
git clone cd torch_speaker pip install -r requirements.txt python setup.py develop
说话人数据准备和预处理
采用pandas构建datlist.csv来实现数据的准备。
Training
Evaluation
语音声学特征的提取
由于torchaudio库中存在一定的bug,且未来我计划采用小波分析的方法(wavelet)进行一些其他的实验,因此提取声学体征的代码也是完全采用PyTorch手动实现(没有依赖于其他第三方库)。
其中,值得注意的是:
- kaldi所实现的特征提取是offline的,采用PyTorch可以实现online的特征提取;
- Mel-Spectrogram 也叫Fbank,或是FilterBank,在Kaldi中叫Fbank中比较多,在TTS和VC中Mel-Spectrogram使用的比较多;
- 由我实现的特征提取代码,kaldi,librosa,torchaudio即便是在相同配置参数的情况下,所提取的到的结果都不同。
特征名 | 实现流程 | 代码存放位置 |
Spectrogram |
|
[link] |
Mel-Spectrogram |
|
[link] |
MFCC |
常用backbone与实现
backbone这个单词原意指的是人的脊梁骨,后来引申为支柱,核心的意思。在神经网络中,尤其是CV领域,一般先对图像进行特征提取(常见的有vggnet,resnet,谷歌的inception),这一部分是整个CV任务的根基,因为后续的下游任务都是基于提取出来的图像特征去做文章(比如分类,生成等等)。所以将这一部分网络结构称为backbone十分形象,仿佛是一个人站起来的支柱。
ResNet和其变种
TDNN和其变种
常用Loss
后端打分
data loader
数据增强
class imbalance sampler
评价指标计算
Equal Error Rate
minDCF
对抗样本攻击与防御
对抗样本攻击
BIM
PGD
对抗样本防御
工具代码和脚本
脚本名称 | 实现思路与流程 | 代码位置 |
读取waveform |
目前各类开源的工具中,语音数据的读取的方法实现主要有两种:
在本项目中,根据training和evaluation阶段的不同,对语音的读取策略也有所区别。 |
[link] |
读取超参数 | 超参数的读入参考了nanodet项目的实现,采用了yacs来实现对yaml文件超参数对读取。 | [link] |
Voice Activity Detection(VAD) | VAD采用PyWebrct实现Python多进程处理 | |
信噪比(SNR)计算 | ||
准确率(Accuracy)计算 |
可视化
MISC
PyTorch
PyTorch is a Python package that provides two high-level features:
- Tensor computation (like NumPy) with strong GPU acceleration
- Deep neural networks built on a tape-based autograd system
PyTorch Lightning
The goal of PyTorch Lightning is "You do the research. Lightning will do everything else".
PyTorch Lightning was started by William Falcon while completing his Ph.D. AI research at NYU CILVR and Facebook AI Research, with the vision of making it a foundational part of everyone’s deep learning research code. The framework was designed for professional and academic researchers working in AI, making state of the art AI research techniques, such as TPU training, trivial.
ONNX
Open Neural Network Exchange (ONNX) is an open ecosystem that empowers AI developers to choose the right tools as their project evolves. ONNX provides an open source format for AI models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.
NCNN
ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployment and uses on mobile phones from the beginning of design. ncnn does not have third party dependencies. it is cross-platform, and runs faster than all known open source frameworks on mobile phone cpu. Developers can easily deploy deep learning algorithm models to the mobile platform by using efficient ncnn implementation, create intelligent APPs, and bring the artificial intelligence to your fingertips. ncnn is currently being used in many Tencent applications, such as QQ, Qzone, WeChat, Pitu and so on.
YACS
YACS was created as a lightweight library to define and manage system configurations, such as those commonly found in software designed for scientific experimentation. These "configurations" typically cover concepts like hyperparameters used in training a machine learning model or configurable model hyperparameters, such as the depth of a convolutional neural network.