 Research
 Open Access
 Published:
Nonlinear residual echo suppression based on dualstream DPRNN
EURASIP Journal on Audio, Speech, and Music Processing volume 2021, Article number: 35 (2021)
Abstract
The acoustic echo cannot be entirely removed by linear adaptive filters due to the nonlinear relationship between the echo and the farend signal. Usually, a postprocessing module is required to further suppress the echo. In this paper, we propose a residual echo suppression method based on the modification of dualpath recurrent neural network (DPRNN) to improve the quality of speech communication. Both the residual signal and the auxiliary signal, the farend signal or the output of the adaptive filter, obtained from the linear acoustic echo cancelation are adopted to form a dualstream for the DPRNN. We validate the efficacy of the proposed method in the notoriously difficult doubletalk situations and discuss the impact of different auxiliary signals on performance. We also compare the performance of the time domain and the timefrequency domain processing. Furthermore, we propose an efficient and applicable way to deploy our method to offtheshelf loudspeakers by finetuning the pretrained model with little recordedecho data.
Introduction
The acoustic echo is generated from the coupling between the loudspeaker and the microphone in fullduplex handsfree telecommunication systems or smart speakers. It severely deteriorates the quality of speech communication and significantly degrades the performance of automatic speech recognition (ASR) within the smart speakers. Typical linear acoustic echo cancelation (LAEC) methods use adaptive algorithms to identify the impulse response between the loudspeaker and the microphone [1]. Timedomain least mean square (LMS) algorithms [2, 3] are often employed in delaysensitive situations. Frequencydomain LMS algorithms are often utilized to guarantee both fast convergence speed and low computational load [2]. The frequencydomain adaptive Kalman filter (FDKF) [4] is also a commonly used method with several efficient variations proposed recently [5, 6].
The performance of LAEC methods severely degrades when nonlinear distortion is nonnegligible in the acoustic echo path [7]. Usually, a residual echo suppression (RES) module is required to further suppress the echo. The RES is usually conducted by estimating the spectrum of the residual echo based on the farend signal, filter coefficients, and the residual signal of LAEC [8–13]. However, it is difficult for the signalprocessingbased RES to balance well between the residual echo attenuation and the nearend speech distortion.
Recently, deep neural network (DNN) has been introduced into RES due to its powerful capability of modeling nonlinear systems, including the time domain and timefrequency (TF) domain methods. TFdomain methods adopt the shorttime Fourier transform (STFT) to extract spectral features. The fully connected network (FCN) was employed to exploit multipleinput signals in RES [14]. The bidirectional or unidirectional recurrent neutral network (RNN) was also introduced to RES [15–17]. These methods ignore the coupling between magnitude and phase and are unable to recover the phase information, leading to limited performance [18]. Inspired by the fully convolutional timedomain audio separation network (ConvTasNet) [18], we proposed a RES method based on the multistream ConvTasNet, where both the residual signal of the LAEC system and the output of the adaptive filter are adopted to form multiple streams [19]. The benefit of introducing the auxiliary signals into the network was validated by simulations. However, the model employs a complicated network structure and is not efficient enough to exploit the information of multiple streams, resulting in large number of parameters which restricts its practical application. Moreover, the benefit of multistreams is yet to be validated by experiments on offtheshelf loudspeakers.
Dualpath recurrent neural network (DPRNN) [20] was recently proposed for speech separation task and achieves the stateoftheart (SOTA) performance on WSJ02mix dataset. It utilizes an encoder module for feature extraction and employs RNNs for time series modeling. To overcome the inefficiency of RNN in modeling long sequences, DPRNN splits the long sequential input into smaller chunks and applies intra and interchunk operations iteratively. Compared with ConvTasNet, DPRNN shows superiority in both performance and parameter number [20]. Moreover, its RNNbased structure has advantages over ConvTasNet in memory consumption when processing online.
In this paper, we extend our previous work on multistream ConvTasNet. We adopt the residual signal of LAEC and the auxiliary signal to create two streams, and propose two DPRNNstructure networks in the time domain and TF domain respectively to effectively exploit their information. To validate the efficacy of our proposed RES methods, we compare them with several typical methods on both artificialecho dataset and recordedecho dataset. Furthermore, we regard the welltrained model on artificialecho dataset as a pretrained model and finetune it on recordedecho dataset. Different finetuning strategies are investigated to achieve a balance between the performance and the training cost.
Model description
Problem formulation
The AEC system with RES postfilter is depicted in Fig. 1, where x(n) is the farend signal, \( \hat{y}(n) \) is the output of the adaptive filter, and H(z) represents the echo path transfer function. The microphone signal d(n) consisting of the echo y(n), the nearend speech s(n), and background noise v(n) can be expressed as
The signal of the LAEC s_{AEC}(n) is given by subtracting the output of the adaptive filter \( \hat{y}(n) \) from the microphone signal d(n), with
where \( \hat{h}(n) \) denotes the adaptive filter and ∗ represents convolution operation. Due to the inevitable nonlinear feature in the echo path, the LAEC cannot perfectly attenuate the echo, and s_{AEC}(n) can be regarded as the mixture of the residual echo, background noise, and the nearend signal. The RES can be designed from the viewpoint of speech separation, but unlike the standard speech separation problem, the auxiliary information extracted from the adaptive filter can be exploited to improve the performance. In this paper, we employ s_{AEC}(n) together with an auxiliary signal, x(n) or \( \hat{y}(n) \), to construct a dualstream DPRNN (DSDPRNN).
Model design
Figure 2 outlines the structure of our proposed DPRNNbased RES method, which consists of two encoder modules, a suppression module, and a decoder module. The two encoder modules are used to extract features from s_{AEC}(n) and the auxiliary signal to form two streams, streams A and B, respectively. The suppression module suppresses the residual echo and recovers the nearend signal by exploiting the information of streams A and B. The decoder transforms the output of the suppression module into masks and converts the masked feature back to the waveform. The difference between the timedomain and the TFdomain methods mainly lies in the encoder and the decoder, while the structure of the suppression module is the same.
Figure 3 shows the structure of the encoder and the decoder in the timedomain method. The encoder takes a timedomain waveform u as input and converts it into a time series of Ndimensional representations using a 1D convolutional layer with a kernel size L and 50% overlap, followed by a ReLU activation function
where \( \boldsymbol{W}\in {\mathbb{R}}^{G\times N} \) with length G is the output of the operation. Then, W is transformed into Cdimensional representations by a fully connected layer and divided into T=2G/K−1 chunks of length K, where the overlap between chunks is 50%. All chunks are then stacked together to form a 3D tensor \( \mathcal{W}\in {\mathbb{R}}^{T\times K\times C} \). The decoder applies overlapadd operation to the output of suppression module \( {\mathcal{Y}}_s\in {\mathbb{R}}^{T\times K\times C} \), followed by a PReLU activation [21], to form the output \( \boldsymbol{Q}\in {\mathbb{R}}^{G\times C} \). Then, an Ndimensional fully connected layer with a ReLU activation is applied to Q to obtain the mask of W, and the estimation of clean speech’s representation \( \hat{\boldsymbol{S}} \) is obtained by
where \( {f}_{{\mathrm{FC}}_i},\kern1em i=1,2 \) represents the fully connected layer, f_{OA} represents the overlapadd operation, and ⊙ denotes the elementwise multiplication. A 1D transposed convolution layer is utilized to convert the masked representation back to waveform signal \( \hat{\boldsymbol{s}} \).
The intrachunk operation of DPRNN can also be applied in the frequency domain. Figure 4 shows the structure of the encoder and the decoder in the TFdomain method. We first obtain the TF representation \( \boldsymbol{Z}\in {\mathbb{C}}^{T^{\prime}\times F} \) by the STFT operation with a Qpoint Hamming window and 50% overlap, where F=Q/2+1 is the number of effective frequency bins. We concatenate the real and imaginary component of Z to form a 3D tensor \( \mathcal{Z}\in {\mathbb{R}}^{T^{\prime}\times F\times 2} \). The 3D representation \( {\mathcal{W}}^{\prime}\in {\mathbb{R}}^{T^{\prime}\times {K}^{\prime}\times {C}^{\prime }} \) is then obtained by a 2D convolutional layer with C^{′} output channel. The kernel size is 5×5 and the stride is 1×2, where \( {K}^{\prime }=\frac{F3}{2} \) is the number of downsampled frequency bins. The frame length, the chunk size, and the feature dimension T^{′},K^{′},C^{′} correspond to T,K,C in the timedomain encoder respectively, and the output is further processed by the same suppression module. The decoder takes the output of suppression module s′ as input and successively applies two fully connected layers, followed by a PReLU and a ReLU activation respectively, to form the output \( {\mathcal{Q}}^{\prime}\in {\mathbb{R}}^{T^{\prime}\times {K}^{\prime}\times {C}^{\prime }} \). Then, ^{′} is processed by two independent 2D transposed convolutional layers, called Trans Conv_A and Trans Conv_P, with the kernel size 5×5 and the stride 1×2. Trans Conv_A with a ReLU activation function is utilized to estimate the mask of TF bins. Trans Conv_P followed by a normalization operation for each TF bin is employed to estimate the real part and imaginary part of the phase information. Finally, the spectrogram of the output signal \( {\hat{S}}^{\prime } \) is estimated by
where \( {f}_{\mathrm{TC}}^A \), \( {f}_{\mathrm{TC}}^P \), and Norm represent the functions of Trans Conv_A, Trans Conv_P, and the normalization operation for each TF bin respectively. We use \( {1}^{I_1\times {I}_2\times \dots \times {I}_M} \), ∘, and ×_{i} to denote an allones tensor, the outer product, and the mode i product [22]. The outer product between the tensor \( \mathcal{\mathscr{H}}\in {\mathbb{R}}^{I_1\times {I}_2\times \dots \times {I}_M} \) and the vector \( \boldsymbol{g}\in {\mathbb{R}}^J \) is defined as
The mode i product between the tensor and the matrix \( \boldsymbol{D}\in {\mathbb{R}}^{I_M\times J} \) is defined as
Similar to the operation in [23], and in Eqs. 8 and 9 act as the amplitude mask and the phase prediction result respectively. After that, an inverse STFT operation is applied to convert \( {\hat{S}}^{\prime } \) back to the waveform signal \( {\hat{\boldsymbol{s}}}^{\prime } \).
The suppression module consists of six DSDPRNN blocks, each of which contains two dualstream RNN (DSRNN) blocks corresponding to intrachunk and interchunk processing respectively. Figure 5 presents the structure of the proposed DSRNN block, where each stream is successively processed by an RNN layer, a fully connected layer, and a normalization layer. The RNN layer in each intrachunk block is a bidirectional RNN layer applied along the chunk dimension with C/2 output channels for each direction, while the RNN layer in each interchunk block is a unidirectional RNN layer with C output channels and is applied along the frame dimension. Let \( {\mathcal{V}}_i^0\in {\mathbb{R}}^{T\times K\times C} \) denote the input tensors of stream i, then the output of the RNN layer \( {\mathcal{V}}_i^1 \) can be expressed as
where \( {f}_{{\mathrm{RNN}}_i} \) represents the function of the RNN layer. The feature in \( {\mathcal{V}}_A^1 \) and \( {\mathcal{V}}_B^1 \) is then mixed by
where \( \boldsymbol{\alpha}, \boldsymbol{\beta} \in {\mathbb{R}}^C \) are trainable parameters. The output \( {\mathcal{V}}_i^2 \) is concatenated to the corresponding raw input \( {\mathcal{V}}_i^0 \) and then processed by a fully connected layer with C output channels. \( {\mathcal{V}}_i^3 \) is obtained with a residual connection and can be formulated as
where [·,·] represents the concatenation operation. The concatenation and projection are applied along the chunk dimension in intrachunk blocks, and these operations are also applied along the feature dimension in interchunk blocks. The output \( {\mathcal{V}}_i^4 \) of the DSRNN block is then obtained by a normalization layer to \( {\mathcal{V}}_i^3 \), except for those in the last DSDPRNN block where \( {\mathcal{V}}_i^3 \) serves as the output
where \( {f}_{{\mathrm{Norm}}_i} \) denotes the function of the normalization layer. The features of streams A and B are processed iteratively by the intrachunk and the interchunk DSRNN blocks, and the output of stream A in the last DSDPRNN block is regarded as the output of the suppression module. We use Group Normalization [24] with a group number of 2. The input feature of the normalization layer \( \mathcal{X}\in {\mathbb{R}}^{T\times K\times C} \) is first divided into two groups as
and the output is formulated as
with
and
where the subscripts l,k,c denote the index of the 3D tensor, \( {\boldsymbol{\gamma}}^i,{\boldsymbol{\beta}}^i\in {\mathbb{R}}^{C/2} \) are trainable parameters, and ε is a small constant for numerical stability.
Training target
We choose the maximization of the scaleinvariant sourcetonoise ratio (SISNR) [18] as the training target
where \( \hat{\boldsymbol{s}},\boldsymbol{s} \) are the estimated and the target clean sources respectively, <·,·> represents the dot product of vectors, and s denotes the l_{2} norm of s.
Experiments
Dataset
Unlike telecommunication systems, where the farend signal is usually speech, music often acts as the “farend” signal for smart loudspeakers. Therefore, we use both speech and music as the farend signal, and the nearend signal is speech. We choose LibriSpeech [25] as the speech dataset and MUSAN [26] as the music dataset. We randomly choose 225, 25, and 40 different speakers from LibriSpeech, and 497, 48, and 115 pieces of music from MUSAN for training, validation, and test respectively. The audio data is sampled at 16 kHz and split into 4s segments. Totally, we use 26,556, 1083, and 920 segments of 4s speech and 101,956, 1083, and 920 segments of 4s music for training, validation, and test respectively.
The clipping function and the sigmoidal function, although not precise models for the actual nonlinearity of the loudspeakers, are commonly utilized numerical models in many previous works on nonlinear acoustic suppression [15, 17]. Thus, the clipping function, sigmoidal function, and convolution operation are successively applied to the farend signal to generate the simulated echo. The clipping function is either softclipping or hardclipping function [27]
where x_{max}=Θ·max(abs(x(n))) determines the maximum value of the clipping function. Three types of softclipping and three types of hardclipping functions are utilized with the parameter Θ set to 0.6, 0.8, and 0.9.
We also use the sigmoidal function [28] to approximate the nonlinearity of a loudspeaker
where the parameter (a_{p},a_{n}) is chosen from {(4,3), (4,1), (2,3), (1,3), (3,3), (1,1)}.
For the convolution operation, we construct 40, 3, and 7 simulated rooms for training, validation, and test respectively. The length and width of these rooms are randomly chosen from [3, 8] m and the height is randomly chosen from [2.5, 4.5] m. The reverberation time T_{60} is randomly chosen from [200, 400] ms. Image method [29] is employed to generate 10 room impulse responses (RIRs) for each room, resulting in 400, 30, and 70 RIRs for training, validation, and test respectively.
The frequencydomain Kalman filter [4] acts as the LAEC to generate the residual echo, and the mean of its echo attenuation on the artificialecho dataset is about 17.0 dB. To obtain the simulated s_{AEC}(n), we add both the clean speech signal and the colored noise to the residual echo. The inversefrequencypower of the colored noise [30] is randomly chosen between 0 and 2. For the training and validation set, the signaltoecho ratio (SER) (before processing of LAEC) is randomly chosen from {−14.2,−16.2,−18.2,−20.2} dB and the colored noise is added with the signaltonoise ratio (SNR) randomly chosen from {30, 20, 10} dB. For the test set, the SER is −18.2 dB and the SNR is 20 dB.
In total, we generate 106,224 segments of speech residual echo and 101,956 segments of music residual echo for training, 1083 segments of speech residual echo and 1083 segments of music residual echo for validation, and 920 segments of speech residual echo and 920 segments of music residual echo for test.
The approach to generate the artificial nonlinear echo is only a rough approximation for simulating the nonlinearity of the loudspeaker. To evaluate the performance of our model in practical applications, we also record echo signals from offtheshelf loudspeakers using the microphone, AcousticSensing CHZ221. A pair of EDIFIER R12U (ER) loudspeakers and a pair of LOYFUN LF501 (LL) loudspeakers are used to record the echo signals in a office with room size 6 m × 6 m × 3.2 m. The recording environment is shown in Fig. 6. For each loudspeaker model, we obtain 10,800 segments of 4s recordedecho signals (5400 segments of speech and music respectively) from one loudspeaker for training and 1840 segments (920 segments of speech and music respectively) from another loudspeaker of the same kind for test. The mean of the LAEC’s echo attenuation on the recordedecho dataset is about 24.3 dB. For the training set, the SER of ER echo is randomly chosen from {−18.2,−20.2,−22.2,−24.2} dB, the SER of LL echo is randomly chosen from {−22.2,−24.2,−26.2,−28.2} dB, and the colored noise is added with the SNR randomly chosen from {30, 20, 10} dB. For the test set, the SER of ER echo is −22.2 dB, the SER of LL echo is −26.2 dB, and the SNR is 20 dB. It should be noted that the recordedecho training set is only used in the finetuning stage.
Experiment configuration
We control the parameter number and processing delay in the timedomain and the TFdomain methods for a fair comparison. For the timedomain method, the number of filters N, kernel size L, chunk size K, and feature dimension B in the encoder are 256, 8, 100, and 128 respectively. For the TFdomain method, the frame length Q, the number of downsampled frequency bins K^{′}, and feature dimension B^{′} in the encoder are 400, 99, and 128 respectively. Thus, the tensor of the encoder in the timedomain and the TFdomain methods are of the dimension T×100×128 and T×99×128 respectively. The gated recurrent unit (GRU) [31] is used as the RNN layer.
The model is trained by the Adam optimizer [32] for 80 epochs, with each epoch containing 26,556 pairs of training data and each batch containing 8 pairs. The initial learning rate is set to 0.001 and is halved every time the validation loss is not improved in two successive epochs. We apply l_{2} norm gradient clipping with a maximum of 5. Pytorch is employed for model implementation and four Nvidia GeForce GTX 1080Ti are used for training.
Evaluation metrics
We use three metrics for performance evaluation: the perceptual evaluation of speech quality (PESQ) [33], the signaltodistortion ratio (SDR) [34, 35], and the shorttime objective intelligibility (STOI) [36]. The echo return loss enhancement (ERLE) of the DNNbased methods in singletalk situations has been shown to be of a sufficiently high number in the previous work [19]. In this paper, we pay particular attention to RES performance in the most difficult lowSER doubletalk situations, and the PESQ, SDR, and STOI are regarded to be better choices than the ERLE since they can more effectively evaluate the processed nearend speech quality. Furthermore, the desired signal is the nearend speech in most AEC scenarios, while the interference might be either speech from the far end, as in common communication applications, or music played by the smart speakers. Therefore, we use the PESQ instead of the perceptual evaluation of audio quality as an objective metric to measure the quality of the processed nearend speech.
Results and discussions
Performance comparison
We compare the proposed methods with some typical DNNbased RES methods to validate the efficiency of our model. In the following comparison, we name our proposed methods as DSDPRNN. We further use the suffix “t” and “f” to represent the timedomain and the TFdomain methods respectively and use the suffix “x” and “y” to distinguish between the models in which x(n) or \( \hat{y}(n) \) is used as the auxiliary signal. The LSTMbased model (LSTM) [17] and the multistream ConvTasNet model (MSTasNet) are utilized for comparison. The models [14, 15] which have shown significantly inferior performance in our previous work [19] are ignored in this comparison.
The total number of trainable parameters and the multiplyaccumulate operations per second (MACCPs) of these models is shown in Table 1. The model size of our proposed methods is only 1/5 of the model size of MSTasNet, and the computation cost is also slightly lower.
The time latency of MSTasNet is set to 410 samples for a fair comparison. The performance in terms of PESQ, SDR, and STOI is shown in Table 2. The DSDPRNN methods outperform the LSTM and the MSTasNet in all artificialecho conditions, validating that our proposed methods provide an efficient way to exploit the information of dualstream. For recorded echo, the advantage of the DSDPRNN methods over MSTasNet is less obvious, but their generalization capability in practical applications is still validated. The comparison between the timedomain and the TFdomain methods shows that the former tends to achieve slightly better SDR scores, while the latter has slightly better performance in terms of PESQ and STOI. Furthermore, we observe that the methods with the auxiliary signal \( \hat{y}(n) \) achieve better performance in the attenuation of recorded echo, implying their better generalization capability compared with the methods using x(n) as the auxiliary signal.
Finetuning for offtheshelf loudspeakers
Though the proposed methods generalize well to real loudspeakers, better performance can be expected by training on echo recorded from loudspeakers. The welltrained model in the artificialecho dataset can be regarded as a pretrained model and then finetuned by the recordedecho dataset in practice. We only test the performance of the DSDPRNN with the auxiliary signal \( \hat{y}(n) \). The purpose of the finetuning is to improve the performance under limited supplementary training data. We have tried the finetuning on the suppression module, but found that the model overfits severely with small amount of recorded data. Thus, we propose two strategies to finetune the model by mainly retraining the decoder. (1) Train the decoder module only and freeze the other parameters. (2) Train the decoder and the last DSDPRNN block and freeze the other parameters. We conduct two experiments in the finetuning stage for cross validation. In each experiment, we only use 12h echo signals from one loudspeaker as the training set. The batch size is set to 16 and the exponentialdecay strategy is used to halve the learning rate every 1350 steps. The finetuning stage uses two Nvidia GeForce GTX 1080Ti and takes only about 3 h for training since the partly frozen parameters reduce the computational complexity for training and the size of the recorded training data is far below the size of the artificial echo. We use “Time” and “TF” to distinguish the timedomain and the TFdomain DSDPRNN methods and use the suffix “1”, “2” to represent the models using the above two finetuning strategies respectively. The performance of the pretrained model is presented with no suffix as benchmark. Compared to strategy 2, the training time in the finetuning stage of strategy 1 decreases by 14% and the memory cost is reduced by half.
The performances of the proposed methods after finetuning with the ER echo dataset and the LL echo dataset are shown in Tables 3 and 4 respectively. In artificialecho conditions, the performance degrades slightly after finetuning, and similar results are observed using both the finetuning strategies. The test results of the model finetuned using the recorded training dataset from the same kind of loudspeaker are highlighted by blue font. The efficacy of both finetuning strategies can be seen, and strategy 2 has significantly better performance when the model is finetuned by the training dataset from the same kind of loudspeaker. It also should be noted that the performance improves slightly even when the model is finetuned with training data from different loudspeakers, indicating the generalization capability of the finetuning method. Considering that only a very limited data is required in the finetuning stage, this scheme is easy to be applied to any offtheshelf loudspeakers.
Conclusion
In this paper, we propose efficient RES methods in both the time domain and the TF domain on the modification of DPRNN. We adopt the residual signal and the auxiliary signal extracted from the LAEC system to form dualstream for the DPRNN. Experiments validate the efficacy of the proposed methods in doubletalk situations compared with several typical RES methods. Furthermore, we propose an efficient and applicable way to improve the performance on offtheshelf loudspeakers by regarding the welltrained model on artificialecho dataset as a pretrained model, and finetuning it on recordedecho dataset. Two finetuning strategies are evaluated in experiments, showing that the finetuning strategy of training the decoder and the last DSSPRNN block achieves more effective echo suppression on the recordedecho dataset.
Availability of data and materials
The source codes of the network are released on https://github.com/Moyun/DSDPRNN, and exemplary audio samples are available online at https://github.com/Moyun/dsdprnnsamples. Further materials are also available from the corresponding author upon request.
Abbreviations
 ASR:

Automatic speech recognition
 ConvTasNet:

Fully convolutional timedomain audio separation network
 DNN:

Deep neutral network
 DPRNN:

Dualpath recurrent neural network
 DSDPRNN:

Dualstream dualpath recurrent neural network
 DSRNN:

Dualstream recurrent neural network
 ER:

Edifier R12U
 ERLE:

Echo return loss enhancement
 FCN:

Fully connected network
 FDKF:

Frequencydomain adaptive Kalman filter
 GRU:

Gated recurrent unit
 LAEC:

Linear acoustic echo cancelation
 LL:

Loyfun LF501
 LMS:

Least mean square
 LSTM:

Long shortterm memory
 MACCPs:

Multiplyaccumulate operations per second
 PESQ:

Perceptual evaluation of speech quality
 RES:

Residual echo suppression
 RIR:

Room impulse response
 RNN:

Recurrent neural network
 SDR:

Signaltodistortion ratio
 SER:

Signaltoecho ratio
 SNR:

Signaltonoise ratio
 SOTA:

Stateoftheart
 STFT:

Shorttime Fourier transform
 STOI:

Shorttime objective intelligibility
 TF:

Timefrequency
References
 1
E. Hänsler, G. U. Schmidt, Handsfree telephones–joint control of echo cancellation and postfiltering. Signal Process.80(11), 2295–2305 (2000).
 2
S. S. Haykin, Adaptive Filter Theory (Prentice Hall, New Jersey, 2002).
 3
F. Albu, H. K. Kwan, in 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512), 3. Combined echo and noise cancellation based on GaussSeidel pseudo affine projection algorithm, (2004), p. 505.
 4
G. Enzner, P. Vary, Frequencydomain adaptive kalman filter for acoustic echo control in handsfree telephones. Signal Process.86(6), 1140–1156 (2006).
 5
F. Yang, G. Enzner, J. Yang, Frequencydomain adaptive Kalman filter with fast recovery of abrupt echopath changes. IEEE Signal Process. Lett.24(12), 1778–1782 (2017).
 6
W. Fan, K. Chen, J. Lu, J. Tao, Effective improvement of undermodeling frequencydomain Kalman filter. IEEE Signal Process. Lett.26(2), 342–346 (2019).
 7
A. N. Birkett, R. A. Goubran, in Proceedings of 1995 Workshop on Applications of Signal Processing to Audio and Accoustics. Limitations of handsfree acoustic echo cancellers due to nonlinear loudspeaker distortion and enclosure vibration effects, (1995), pp. 103–106.
 8
S. Gustafsson, R. Martin, P. Vary, Combined acoustic echo control and noise reduction for handsfree telephony. Signal Process.64(1), 21–32 (1998).
 9
E. A. P. Habets, S. Gannot, I. Cohen, P. C. W. Sommen, Joint dereverberation and residual echo suppression of speech signals in noisy environments. IEEE Trans. Audio Speech Lang. Process.16(8), 1433–1451 (2008).
 10
N. K. Desiraju, S. Doclo, M. Buck, T. Wolff, Online estimation of reverberation parameters for late residual echo suppression. IEEE Trans. Audio Speech Lang. Process.28:, 77–91 (2020).
 11
S. Gustafsson, R. Martin, P. Jax, P. Vary, A psychoacoustic approach to combined acoustic echo cancellation and noise reduction. IEEE Trans. Speech Audio Process.10(5), 245–256 (2002).
 12
A. S. Chhetri, A. C. Surendran, J. W. Stokes, J. C. Platt, in Proc. IWAENC, 5. Regressionbased residual acoustic echo suppression, (2005).
 13
M. L. Valero, E. Mabande, E. A. P. Habets, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Signalbased late residual echo spectral variance estimation, (2014), pp. 5914–5918.
 14
G. Carbajal, R. Serizel, E. Vincent, E. Humbert, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Multipleinput neural networkbased residual echo suppression, (2018), pp. 231–235.
 15
H. Zhang, D. Wang, Deep learning for acoustic echo cancellation in noisy and doubletalk scenarios. Training. 161(2), 322 (2018).
 16
F. Kuech, W. Kellermann, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, 1. Nonlinear residual echo suppression using a power filter model of the acoustic echo path, (2007), pp. 73–76.
 17
C. Zhang, X. Zhang, in Proc. Interspeech. A robust and cascaded acoustic echo cancellation based on deep learning, (2020), pp. 3940–3944.
 18
Y. Luo, N. Mesgarani, Convtasnet: surpassing ideal timefrequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang. Process.27(8), 1256–1266 (2019).
 19
H. Chen, T. Xiang, K. Chen, J. Lu, in Proc. Interspeech. Nonlinear residual echo suppression based on multistream ConvTasNET, (2020), pp. 3959–3963.
 20
Y. Luo, Z. Chen, T. Yoshioka, in ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Dualpath rnn: efficient long sequence modeling for timedomain singlechannel speech separation (IEEE, 2020), pp. 46–50.
 21
K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of the IEEE International Conference on Computer Vision. Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification, (2015), pp. 1026–1034.
 22
A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, H. A. PHAN, Tensor decompositions for signal processing applications: from twoway to multiway component analysis. IEEE Signal Process. Mag.32(2), 145–163 (2015). https://doi.org/10.1109/MSP.2013.2297439.
 23
D. Yin, C. Luo, Z. Xiong, W. Zeng, Phasen: a phaseandharmonicsaware speech enhancement network. arXiv preprint arXiv:1911.04697 (2019).
 24
Y. Wu, K. He, in Proceedings of the European Conference on Computer Vision (ECCV). Group normalization, (2018), pp. 3–19.
 25
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Librispeech: an ASR corpus based on public domain audio books (IEEE, 2015), pp. 5206–5210.
 26
D. Snyder, G. Chen, D. Povey, Musan: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015).
 27
S. Malik, G. Enzner, Statespace frequencydomain adaptive filtering for nonlinear acoustic echo cancellation. IEEE Trans. Audio Speech Lang. Process.20(7), 2065–2079 (2012). https://doi.org/10.1109/TASL.2012.2196512.
 28
D. Comminiello, M. Scarpiniti, L. A. AzpicuetaRuiz, J. ArenasGarcía, A. Uncini, in 2017 25th European Signal Processing Conference (EUSIPCO). Full proportionate functional link adaptive filters for nonlinear acoustic echo cancellation, (2017), pp. 1145–1149.
 29
E. A. Lehmann, A. M. Johansson, Diffuse reverberation model for efficient imagesource simulation of room impulse responses. IEEE Trans. Audio Speech Lang. Process.18(6), 1429–1439 (2009).
 30
N. J. Kasdin, Discrete simulation of colored noise and stochastic processes and 1/f/sup/spl alpha//power law noise generation. Proc. IEEE. 83(5), 802–827 (1995).
 31
K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
 32
D. P. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 33
A. W. Rix, J. G. Beerends, M. P. Hollier, A. P. Hekstra, in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2. Perceptual evaluation of speech quality (PESQ)a new method for speech quality assessment of telephone networks and codecs, (2001), pp. 749–7522.
 34
E. Vincent, R. Gribonval, C. Févotte, Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process.14(4), 1462–1469 (2006).
 35
C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, D. P. Ellis, C. C. Raffel, in In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR. mir_eval: a transparent implementation of common MIR metrics (Citeseer, 2014).
 36
C. H. Taal, R. C. Hendriks, R. Heusdens, J. Jensen, in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. A shorttime objective intelligibility measure for timefrequency weighted noisy speech, (2010), pp. 4214–4217.
Acknowledgements
This work was supported by the National Science Foundation with grant no. 11874219.
Author information
Affiliations
Contributions
H.C., G.C., and J.L. analyzed the DNNbased RES method. H.C. implemented the method. H.C., K.C., and J.L. conducted the experiments. H.C. and J.L. drafted the manuscript. All authors have reviewed the results and the final manuscript. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chen, H., Chen, G., Chen, K. et al. Nonlinear residual echo suppression based on dualstream DPRNN. J AUDIO SPEECH MUSIC PROC. 2021, 35 (2021). https://doi.org/10.1186/s13636021002218
Received:
Accepted:
Published:
Keywords
 Residual echo suppression
 Dualpath recurrent neural network
 Dualstream