nnAudio

image
paper, doc

ABSTRACT

在本文中,我们提出了一种新的基于神经网络的音频处理框架nnAudio,该框架具有GPU支持,利用1D卷积神经网络进行时域到频域转换。由于速度快,它可以实时提取频谱图,而不需要在磁盘上存储任何频谱图。此外,该方法还允许在波形到频谱图转换层上进行反向传播,因此,转换过程可以被训练,进一步优化了针对神经网络训练的特定任务的波形到频谱图转换。所有的频谱图实现都是相对于输入长度的大O线性时间。然而,nnAudio利用了PyTorch的一维卷积神经网络的计算统一设备架构(CUDA),其短时傅里叶变换(STFT)、Mel谱图和常量q变换(CQT)实现比仅使用中央处理器(CPU)的其他实现快一个数量级。我们在三台使用NVIDIA gpu的不同机器上测试了我们的框架,我们的框架显著地将谱图提取时间从秒级(使用流行的python库librosa)减少到毫秒级,前提是音频记录具有相同的长度。当将nnAudio应用于可变输入音频长度时,使用librosa从MusicNet数据集中提取34种具有不同参数的频谱图类型平均需要11.5小时。nnAudio平均需要2.8小时,仍然比librosa快4倍。我们提出的框架在处理速度方面也优于现有的GPU处理库,如Kapre和torchaudio。

INTRODUCTION

频谱图作为音频信号的时频表示,自20世纪80年代[1]-[3]以来一直被用作神经网络模型的输入。不同类型的频谱图适合不同的应用。例如,Mel频谱图和Mel频率倒谱系数(MFCCs)是为语音相关设计的应用[4],[5],而常数-q变换最适合音乐相关应用[6],[7]。尽管最近在音频领域的端到端学习方面取得了进展,例如WaveNet[8]和SampleCNN[9],它们使原始音频数据的模型训练成为可能,但许多最近的出版物仍然使用频谱图作为各种应用程序的输入模型[10]。这些应用包括语音识别[11],[12],语音情绪检测[13],语音到语音翻译[14],语音增强[15],语音分离[16],唱歌语音转换[17],音乐标记[18],翻唱检测[19],旋律提取[20],复调音乐转录[21]。在原始音频数据上训练端到端模型的一个缺点是训练时间较长。正如Lee et al.[11]所指出的那样,使用原始音频数据作为输入的模型在训练时间上要多花四倍的时间,而与使用频谱图作为输入的类似模型相比,这种更长的训练时间只会产生略好的性能。

然而,使用频谱图作为输入也不是没有缺点。每段录音都可以使用不同的算法和参数转换成不同的频谱图。为了找到最适合特定任务的音频转换方法,可能需要反复试验。进行这些试错实验的通常方法是将音频剪辑转换为不同的频域表示形式,并将每种表示形式保存在硬盘上。在此之后,神经网络使用每种不同的表示进行训练,并选择性能最佳的模型。一旦确定了最佳频域表示,就可以进一步微调转换参数,如窗口大小和频率箱的数量,以获得更好的结果。

执行参数搜索以获得最佳谱图输入会产生两个主要问题。首先,需要大量的硬盘空间来存储由不同参数设置引起的不同频域表示。给定一个包含20GB音频记录的数据集(例如MusicNet[22]),如果想要使用不同参数的不同类型的频谱图进行实验,那么所得到的频谱图很容易占用高达1TB的硬盘空间。详细的案例研究将在V-A节讨论。其次,音频处理步骤通常与模型训练分开进行。为了将处理步骤和模型训练结合到一个连续的管道中,需要进行动态谱图提取。然而,现有的音频文件时频转换方法对于实时谱图提取来说速度太慢。上述大多数应用程序都使用librosa[23],这是一个流行的基于中央处理器(cpu)的python音频处理库。为了将librosa与神经网络模型结合使用,频谱图需要不断地从CPU传输到GPU,因为模型训练是在GPU上完成的。为了使这个过程更有效,最好有一个直接在GPU上处理频谱图的库。

基于gpu的谱图提取方法已经有了一些尝试。Tensorflow[24]有一个tf。在gpu上执行快速傅里叶变换(FFT)和短时傅里叶变换(STFT)的信号包。有一个名为Keras的高级API,供那些想要快速构建神经网络而无需使用Tensorflow会话的人使用。Kapre[25]是Keras版本的基于gpu的音频处理。与此类似,PyTorch[26]最近开发了torchaudio,但这个工具还没有完全实现在撰写本文时集成到PyTorch中。此外,torchaudio需要Libsox作为额外的依赖项,并且安装通常需要大量的故障排除[27];例如,torchaudio目前不兼容Windows 10[28]。在这三个工具中,只有Kapre和torchaudio支持音频到Mel的频谱图转换,但现有的库都不支持常量q变换(CQT)。此外,只有Kapre支持基于神经网络的信号处理,因为它是唯一支持时域到频域转换的可训练内核的实现。然而,由于其Tensorflow后端,Kapre不能与流行的机器学习库PyTorch集成。尽管GPU支持和可区分性,torchaudio和tf。信号不是基于神经网络的,这意味着在神经网络训练过程中没有可以学习或微调的可训练参数。虽然torch-stft是一个原生PyTorch函数,没有任何附加依赖,但只有STFT可用。

因此,为了弥补这一领域的差距,我们引入了一个快速、可微、可训练的基于神经网络的音频处理框架,称为nnAudio[29]。为了确保与最流行的机器学习库之一的完美集成,我们使用PyTorch构建了我们的频谱图提取方法。这样,我们的库可以用作PyTorch神经网络层,PyTorch中所有可用的功能,如数据增强,都可以与nnAudio一起使用。此外,与其他现有库相比,我们提出的框架包括扩展的功能,例如计算Mel谱图和常数q变换。更具体地说,我们使用一维卷积层来实现转换算法,这使得nnAudio中的频谱图提取成为一个可训练的过程(参见V-B节)。因此,在探索神经网络模型[30],[31]的不同输入表示时,nnAudio是有用的。由于我们提出的框架是基于神经网络的,音频处理可以集成到模型训练中,如图1(b)所示。也就是说,不需要像图1(b)所示的传统方法那样,分别进行音频处理和模型训练。nnAudio可以同时进行动态谱图提取和模型训练。在第IV-A节中,我们讨论了与图1(a)中的传统方法相比,这种方法的性能改进。

NEURAL NETWORK-BASED FRAMEWORK

A. SHORT-TIME FOURIER TRANSFORM (STFT)

We can compute the canonical DFT quickly on those platforms by expressing the vector multiplication in the DFT as a one-dimensional linear convolution operation.

Discrete linear convolution of a kernel \(h\) with a signal \(x\) is defined as follows,

\[(h\text{*}x)[n]=\sum_{m=0}^{M-1}x[n-m]h[m] \]

where \(M\) is the length of the kernel \(h\).

PyTorch defines a convolution function with a stride argument. The one dimensional convolution of \(x\) with \(h\) using a stride setting of \(k\), denoted by the symbol \(\text{*}^{k}\) is,

\[(h\text{*}^{k}x)[n]=\sum_{m=0}^{M-1}x[kn-m]h[m] \]

We can use convolution with stride to make fast GPU-based implementations of the STFT.

To do this, we take each basis vector of the DFT as the filter kernel \(h\), and compute the convolution with the input signal \(x\) once for each basis vector. We set the stride value according to the amount of overlap that we want to have between each DFT window.

The following expressions are the pair of convolution kernels that represent the real and imaginary components of the \(k^{th}\) DFT basis vector respectively,

\[h_{re}[k,n]=cos(2\pi k\frac{N-n-1}{N}) \]

\[h_{im}[k,n]=sin(2\pi k\frac{N-n-1}{N}) \]

We can implement the window smoothing efficiently by multiplying these window function elementwise with the filter kernels \(h_{i}\) and \(h_{r}\) before doing the convolution.

B. MEL SPECTROGRAM

The traditional frequency to Mel scale conversion is the one mentioned in O’Shaughnessy’s book [51], which was implemented in the HTK Speech Recognition toolkit [52], shown below,

\[m=2595log_{10}(1+\frac{f}{700}) \]

We refer to this form as ‘htk’ later on. Equation (14) shows another form that is being used in the Auditory Toolbox for MATLAB [53] and librosa (a python audio processing library) [23].

\[m= \begin{cases}\frac{3 f}{200}, & \text { if } 0 \mathrm{~Hz} \leq f \leq 1000 \mathrm{~Hz} \\ \frac{3000}{200}+\frac{27 \ln (f / 1000)}{\ln 6.4}, & \text { if } f \geq 1000 \mathrm{~Hz}\end{cases} \]

We obtain the STFT results using the PyTorch 1D convolutional neural network, and then we use Mel filter banks obtained from librosa. The values of the Mel filter banks are used to initialize the weights of a single-layer fully-connected neural network. Each time step of the magnitude STFT is fed forward into this fully connected layer initialized with Mel weights.

C.CONSTANT-Q TRANSFORM

1) A QUICK OVERVIEW OF THE CONSTANT-Q TRANSFORM (1992 VERSION)

challenges:

  • The frequency of a musical pitch doubles for every octave.
    • modify the frequencies of the basis functions of the discrete Fourier transform so that the centre frequencies of the bins form a geometric series
      -->non-orthogonal-->the relationship between input and output energy becomes much more complicated.
      -->there will be wide gaps between frequency bins.-->these gaps are so wide that high frequency tones lying between bins will not be detected at all.

the constant-Q transform, first proposed by Brown in 1991 as a modification of the discrete Fourier transform[6] where the window size \(N_{k_{cq}}\) scales inversely proportional to the centre frequency of the CQT bin \(k_{cq}\) to maintain a fixed number of cycles for sine and cosine within the window.

In the context of the CQT, Q is defined to be the number of cycles of oscillation in each basis vector. The corresponding equation for Q is,

\[Q=(2^{\frac{1}{b}}-1)^{-1} \]

where \(b\) is the number of bins per octave.
Once \(Q\) is known, we can calculate the window size \(N_{k_{cq}}\) for each bin \(k_{cq}\) by

\[N_{k_{cq}}=ceil(\frac{s}{f_{k_{cq}}})Q \]

The equation for CQT is very similar to the DFT, with the varying index k replaced by Q and fixed window size N replaced by varying window size \(N_{k{cq}}\),

\[X^{cq}[k_{cq}]=\sum_{n=0}^{N_{k_{cq}}-1}x[n]e^{-2\pi i Q\frac{n}{N_{k_{cq}}}} \]

CQT maintains a constant frequency resolution by keeping a constant Q value while the logarithmic frequency STFT has a varying Q.

2) CQT USING NEURAL NETWORKS

The naive implementation of CQT consists of looping through all of the kernels one by one, and calculating the dot-product between the kernel \(e^{-2\pi Q/N_{k}}\) and the input signal \(x\).This type of implementation, however, is not feasible for our 1D convolution approach. Most neural network frameworks only support a fixed kernel size across different channels for a 1D convolutional neural network.

Youngberg and Boll [57] first proposed the concept of CQT in 1978. Brown later proposed an efficient way to calculate CQT in 1992 [7]. The trick is to use Parseval’s
equation [45],

\[\sum_{n=1}^{N_{k}-1}a[n]b[n]=\frac{1}{N}\sum_{k=0}^{N-1}A[k]B[k] \]

where \(a[n]\) and \(b[n]\) are arbitrary functions in the time domain, and \(A[k]\) and \(B[k]\) are the frequency domain versions of \(a[n]\) and \(b[n]\), respectively.

If we define \(X[k]\) and \(Y[k]\) as the DFT of input and kernel \(e^{-2\pi Q/N_{k_{cq}}}\), respectively, then this approach converts both \(x[n]\) and \(e^{-2\pi Q/N_{k_{cq}}}\) to \(X[k]\) and \(Y[k]\), respectively, in the frequency domain, and subsequently multiplies them together to get the approximated CQT as,

\[\begin{aligned} X^{c q}\left[k_{c q}\right] & =\sum_{n=0}^{N_{k_{c q}}-1} x[n] \cdot e^{-2 \pi i Q^{\frac{n}{N_{k_{c q}}}}} \\ & =\frac{1}{N} \sum_{k=0}^{N-1} X[k] Y\left[k, k_{c q}\right] \end{aligned}\]

It should be noted that both \(X[k]\) and \(Y [k]\) are matrices containing complex numbers, and \(N\) is the longest window size for the CQT kernels, which is equal to the length of the kernel with the lowest frequency. Also, \(Y [k]\) is a sparse matrix in this case.
image
each time step of the STFT result \(X[k]\) is multiplied with the same CQT kernels \(Y [k, kcq]\).

In the end we obtain a CQT matrix \(X^{cq}[k_{cq}]\) with real and imaginary parts and the final CQT ouput is calculated using the element-wise magnitude \(absX^{cq}[k_{cq}]\).

major flaw: longest time domain CQT kernel window size is 54,727

3) DOWNSAMPLING

how to do downsampling with a neural network?
FIR

\[y[n]=\sum_{i=0}^{N}b_{i}x[n-i] \]

where \(x[n-i]\) is the input signal at time step \(n\), \(b_i\) is the FIR filter.

4) CONSTANT-Q TRANSFORM (2010 VERSION)

Since low frequency audio signals can be accurately represented with lower sample rates, we can compute the lower frequency components of the CQT more efficiently by downsampling the input and using correspondingly shorter filter kernels.
image

5) CQT WITH TIME DOMAIN KERNELS

image

EXAMPLE APPLICATIONS

A. EXPLORING DIFFERENT INPUT REPRESENTATIONS

four types of spectrograms are explored: linear frequency scale spectrogram (LinSpec), logarithmic,frequency scale spectrogram (LogSpec), Mel spectrogram(MelSpec), and CQT

For LinSpec and LogSpec, we want to explore five different sizes of Fourier kernels. For MelSpec, we will be exploring four different sizes of Fourier kernels, and for each of these kernels, the number of Mel filter banks will be varied. Finally, for CQT, ten different bins per octave will be examined. This means that there will be a total of 34 different input representations.

nnAudio is a useful tool for people to experiment with different spectrograms each with different parameters quickly without any pre-processing.

B. TRAINABLE TRANSFORMATION KERNELS

Because we implement STFT and MelSpec with a 1D convolutional neural network whereby the neuron weights correspond to the Fourier kernels and Mel filter banks, it is possible to further finetune these kernels and filter banks together with the model via gradient descent.

By allowing the neural network to further train or finetune the Mel filter banks and CQT kernels, we allow a richer spectrogram to be obtained. This provides the frequency prediction models, regardless of the network architecture, with more information so as to reach a lower MSE loss.

posted @ 2023-01-15 19:13  prettysky  阅读(278)  评论(0编辑  收藏  举报