GluonAR: a Deep Learning Toolkit for Audio Recognition¶
Gluon Audio is a toolkit providing deep learning based audio recognition algorithm. The project is still under development, and only Chinese introduction will be provided.
GluonAR Introduction:¶
GluonAR is based on MXnet-Gluon, if you are new to it, please check out dmlc 60-minute crash course.
虽然名字叫GluonAR, 但是目前以及可以预见的时间内只有Text-Independent Speaker Recognition的内容.
- 已经实现的feature:
- 使用ffmpeg的pythonic binding
av
和librosa
做audio数据读取 - 模块支持
Hybridize()
. forward阶段不使用pysound, librosa, scipy, 效率更高, 提供快速训练和end-to-end部署的能力, 包括: - 基于
nd.contrib.fft
的短时傅里叶变换(STFTBlock
)和z-score block, 相比使用numpy和scipy预处理后载入GPU训练效率提高12%. MelSpectrogram
,DCT1D
,MFCC
,PowerToDB
- 1808.00158中提出的
SincBlock
- 基于
- 模块支持
- gluon风格的VOX数据集载入
- 类似人脸验证的Speaker Verification
- 使用频谱图训练声纹特征的例子, 在VOX1上的1:1验证acc: 0.941152+-0.004926
- 使用ffmpeg的pythonic binding
example:
import numpy as np
import mxnet as mx
import librosa as rosa
from gluonar.utils.viz import view_spec
from gluonar.nn.basic_blocks import STFTBlock
data = rosa.load(r"resources/speaker_recognition/speaker0_0.m4a", sr=16000)[0][:35840]
nd_data = mx.nd.array([data], ctx=mx.gpu())
stft = STFTBlock(35840, hop_length=160, win_length=400)
stft.initialize(ctx=mx.gpu())
# stft block forward
ret = stft(nd_data).asnumpy()[0][0]
spec = np.transpose(ret, (1, 0)) ** 2
view_spec(spec)
# stft in librosa
spec = rosa.stft(data, hop_length=160, win_length=400, window="hamming")
spec = np.abs(spec) ** 2
view_spec(spec)
输出:
STFTBlock | STFT in librosa |
---|---|
![]() |
![]() |
更多的例子请参考examples/
.
Install¶
Requirements¶
mxnet-1.5.0+, gluonfr, av, librosa, …
音频库的选择主要考虑数据读取速度, 训练过程中音频的解码相比图像解码会消耗更多时间, 实际测试librosa从磁盘加载一个aac编码的短音频 耗时是pyav的8倍左右.
- librosa
pip install librosa
- ffmpeg
# 下载ffmpeg源码, 进入根目录
./configure --extra-cflags=-fPIC --enable-shared
make -j
sudo make install
- pyav, 需要先安装ffmpeg
pip install av
- gluonfr
pip install git+https://github.com/THUFutureLab/gluon-face.git@master
Datasets¶
TIMIT¶
The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) Training and Test Data. Before using this dataset please follow the instruction on link.
A copy of this was uploaded to Google Drive by @philipperemy here.
Pretrained Models¶
Speaker Recognition¶
ResNet18 training with VoxCeleb¶
Download: Baidu, Google Drive
I followed the ideas in paper VoxCeleb2 1806.05622 to train this model, the differences between them:
GluonAR API¶
gluonar.data¶
This module provides popular audio recognition datasets.
Hint
Please refer to Datasets for the description of the datasets listed in this page, and how to download and extract them.
gluonar.data.transforms¶
This file includes various transformations that are critical to audio tasks.
API Reference¶
Some functions about audio processing and transform.
gluonar.model_zoo¶
Models for audio recognition
gluonar.model_zoo.get_model¶
Returns a pre-defined GluonAR model by name.
Hint
This is the recommended method for getting a pre-defined model.
get_model |
API Reference¶
Models for audio recognition
gluonar.nn¶
Neural Network Components.
Hint
Not every component listed here is HybridBlock, which means some of them are not hybridizable. However, we are trying our best to make sure components required during inference are hybridizable so the entire network can be exported and run in other languages.
For example, encoders are usually non-hybridizable but are only required during training. In contrast, decoders are mostly `HybridBlock`s.
Basic Blocks¶
Blocks that usually used in audio processing.
SincConv1D |
Sinc Conv Block from “Speaker Recognition from Raw Waveform with SincNet” paper. |
ZScoreNormBlock |
Zero Score Normalization Block |
STFTBlock |
Short-Time Fourier Transform Block. |
DCT1D |
Compute the Discrete Cosine Transform of input data. |
MelSpectrogram |
Compute a mel-scaled spectrogram. |
MFCC |
Mel-frequency cepstral coefficients (MFCCs) |
PowerToDB |
Convert a power spectrogram (amplitude squared) to decibel (dB) units. |
API Reference¶
Basic Blocks used in GluonAR.
-
class
gluonar.nn.basic_blocks.
SincConv1D
¶ Sinc Conv Block from “Speaker Recognition from Raw Waveform with SincNet” paper.
-
class
gluonar.nn.basic_blocks.
ZScoreNormBlock
¶ Zero Score Normalization Block
-
class
gluonar.nn.basic_blocks.
STFTBlock
¶ Short-Time Fourier Transform Block.
Parameters: - audio_length (int.) – target audio length.
- n_fft (int > 0 [scalar]) – length of the FFT window
- hop_length (int > 0 [scalar]) – number of samples between successive frames. See librosa.core.stft
- win_length (int <= n_fft [scalar]) – Each frame of audio is windowed by window().
The window will be of length win_length and then padded
with zeros to match n_fft.
If unspecified, defaults to
win_length = n_fft
. - window (string [shape=(n_fft,)]) – A window specification (string, tuple, or number), see scipy.signal.get_window
- center (boolean) – If True, the signal y is padded so that frame D[:, t] is centered at y[t * hop_length]. If False, then D[:, t] begins at y[t * hop_length]
- power (float > 0 [scalar]) – Exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power, etc.
- Inputs:
- x: the input audio signal, with shape
(batch_size, audio_length)
.
- x: the input audio signal, with shape
- Outputs:
- specs: specs tensor with shape
(batch_size, 1, num_frames, n_fft/2)
.
- specs: specs tensor with shape
Notes
The num_frames is calculated by
1+(len(y)-n_fft)/hop_length
when center is True, and different from librosa the output should be transposed before visualization.
-
class
gluonar.nn.basic_blocks.
DCT1D
¶ Compute the Discrete Cosine Transform of input data. This block is implemented as scipy.
DCT1D’s behavior is compute dct along last axis for any dimensions larger than 2.
Parameters: - mode ({1, 2, 3}, optional.) – Type of the DCT (see Notes). Default type is 2.
- N (int.) – Length of the transform. The required value is
N = x.shape[axis]
. - norm ({None, 'ortho'}, optional.) – Normalization mode (see Notes). Default is None.
Notes
Type II
There are several definitions of the DCT-II; use scipy definition following (for
norm=None
):N-1 y[k] = 2* sum x[n]*cos(pi*k*(2n+1)/(2*N)), 0 <= k < N. n=0
If
norm='ortho'
,y[k]
is multiplied by a scaling factor f:f = sqrt(1/(4*N)) if k = 0, f = sqrt(1/(2*N)) otherwise.
Which makes the corresponding matrix of coefficients orthonormal (
OO' = Id
).
-
class
gluonar.nn.basic_blocks.
MelSpectrogram
¶ Compute a mel-scaled spectrogram.
Parameters: - audio_length (int.) – target audio length.
- sr (number > 0 [scalar]) – sampling rate of audio
- n_fft (int > 0 [scalar]) – length of the FFT window
- hop_length (int > 0 [scalar]) – number of samples between successive frames. See librosa.core.stft
- power (float > 0 [scalar]) – Exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power, etc.
- others (additional arguments) – Mel filter bank parameters. See librosa.filters.mel for details.
-
class
gluonar.nn.basic_blocks.
MFCC
¶ Mel-frequency cepstral coefficients (MFCCs)
Parameters: - audio_length (int.) – target audio length.
- sr (number > 0 [scalar]) – sampling rate of y
- n_mfcc (int > 0 [scalar]) – number of MFCCs to return
- dct_type (None, or {1, 2, 3}) – Discrete cosine transform (DCT) type. Now only DCT type-2 is used.
- norm (None or 'ortho') – If dct_type is 2 or 3, setting norm=’ortho’ uses an ortho-normal DCT basis. Normalization is not supported for dct_type=1.
See also
librosa.melspectrogram
,scipy.fftpack.dct
-
class
gluonar.nn.basic_blocks.
PowerToDB
¶ Convert a power spectrogram (amplitude squared) to decibel (dB) units. This is modified from librosa.power_to_db, and make it be able to process batch input.
For input shape of (batch, channel, w, h) or (batch, w, h), this block will compute power to db along last 2 axis.
Parameters: - ref (float.) – The amplitude abs(S) is scaled relative to ref: 10 * log10(S / ref).
- amin (float > 0 [scalar]) – minimum threshold for abs(S) and ref
- top_db (float >= 0 [scalar]) – threshold the output at top_db below the peak:
max(10 * log10(S)) - top_db
gluonar.loss¶
Custom losses. Losses are subclasses of gluon.loss.SoftmaxCrossEntropyLoss which is a HybridBlock actually.
gluonar.loss.ArcLoss |
ArcLoss from “ArcFace: Additive Angular Margin Loss for Deep Face Recognition” paper. |
gluonar.loss.RingLoss |
Computes the Ring Loss from “Ring loss: Convex Feature Normalization for Face Recognition” paper. |
API Reference¶
Custom losses. Losses are subclasses of gluon.loss.SoftmaxCrossEntropyLoss which is a HybridBlock actually.
-
gluonar.loss.
get_loss
(name, **kwargs)[source]¶ Parameters: - name (str) – Loss name, check gluonar.loss for details.
- kwargs (str) – Params
Returns: The loss.
Return type: HybridBlock
-
class
gluonar.loss.
ArcLoss
¶ ArcLoss from “ArcFace: Additive Angular Margin Loss for Deep Face Recognition” paper.
Parameters: - classes (int.) – Number of classes.
- m (float.) – Margin parameter for loss.
- s (int.) – Scale parameter for loss.
- Outputs:
- loss: loss tensor with shape (batch_size,). Dimensions other than batch_axis are averaged out.
-
class
gluonar.loss.
RingLoss
¶ Computes the Ring Loss from “Ring loss: Convex Feature Normalization for Face Recognition” paper.
\[L = -\sum_i \log \softmax({pred})_{i,{label}_i} + \frac{\lambda}{2m} \sum_{i=1}^{m} (\Vert \mathcal{F}({x}_i)\Vert_2 - R )^2\]Parameters: lamda (float) – The loss weight enforcing a trade-off between the softmax loss and ring loss. - Outputs:
- loss: loss tensor with shape (batch_size,). Dimensions other than batch_axis are averaged out.