2024
Boris Rubenchik and Elior Hadad and Eli Tzirkel and Ethan Fetaya and Sharon Gannot,
"Low-latency single-microphone speaker separation with temporal convolutional networks using speaker representations",
in International Workshop on Acoustic Signal Enhancement (IWAENC)
Aalborg, Denmark . sep . 2024
Yemini, Yochai and Shamsian, Aviv and Bracha, Lior and Gannot, Sharon and Fetaya, Ethan,
"LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading",
in The 12th International Conference on Learning Representations (ICLR) 2024 Lip-to-speech involves generating a natural-sounding speech synchronized with a soundless video of a person talking. Despite recent advances, current methods still cannot produce high-quality speech with high levels of intelligibility for challenging and realistic datasets such as LRS3. In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained automatic speech recognition (ASR ) serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary. Moreover, our experiments show that the inclusion of the text modality plays a major role in the intelligibility of the produced speech, readily perceptible while listening, and is empirically reflected in the substantial reduction of the word error rate ( WER ) metric. We demonstrate the effectiveness of LipVoicer through human evaluation, which shows that it produces more natural and synchronized speech signals compared to competing methods. Finally, we created a demo showcasing LipVoicer’s superiority in producing natural, synchronized, and intelligible speech, providing additional evidence of its effectiveness. Project page: https://lipvoicer.github.io
2023
A. Schwartz, E. Hadad, S. Gannot, and S. E. Chazan,
"Array configuration mismatch in deep DOA estimation: Towards robust training",
in IEEE Workshop on Applications of Signal Pro-cessing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2023. Deep direction of arrival (DOA) models commonly require a perfect match between the array configurations in the training and test stages and consequently cannot be applied to unfamiliar microphone array constellations. In this paper, we present a deep DOA estimation method that circumvents this requirement. In our approach, we first cast the DOA estimation as a classification problem in each time-frequency (TF) bin, thus facilitating the localization of multiple concurrent speakers. We utilize a high-resolution spatial image, based on a narrow-band variant of the steered response power phase transform (SRP-PHAT) processor, as an input feature. The model is trained with simulated data using a single microphone array configuration in various acoustic conditions. In the test stage, the algorithm is applied with unfamiliar microphone array constellations, namely with a different number of microphones and inter-distances. An elaborated experimental study with real-life room impulse response (RIR) recordings demonstrates the effectiveness of the proposed input feature and the training scheme. Our approach achieves comparable results in familiar microphone array constellations and, more importantly, can accurately estimate the DOA of multiple concurrent speakers even with unfamiliar microphone arrays.
A. Eisenberg, S. Gannot, and S. E. Chazan,
"A two-stage speaker extraction algorithm un-der adverse acoustic conditions using a single-microphone",
in 31st European Signal Processing
Conference (EUSIPCO), Helsinki, Finland, Sep. 2023. In this work, we present a two-stage method for speaker extraction under reverberant and noisy conditions. Given a reference signal of the desired speaker, the clean, but the still reverberant, desired speaker is first extracted from the noisy-mixed signal. In the second stage, the extracted signal is further enhanced by joint dereverberation and residual noise and interference reduction. The proposed architecture comprises two sub-networks, one for the extraction task and the second for the dereverberation task. We present a training strategy for this architecture and show that the performance of the proposed method is on par with other state-of-the-art (SOTA) methods when applied to the WHAMR! dataset. Furthermore, we present a new dataset with more realistic adverse acoustic conditions and show that our method outperforms the competing methods when applied to this dataset as well. Index Terms—Speaker extraction, Dereverberation
D. Sherman, G. Hazan, and S. Gannot,
"Study of speech emotion recognition using BLSTM with attention",
in 31st European Signal Processing Conference (EUSIPCO), Helsinki, Finland,
Sep. 2023. We present a study of a neural network-based method for speech emotion recognition that uses audio-only features. In the studied scheme, the acoustic features are extracted from the audio utterances and fed to a neural network that consists of convolutional neural networks (CNN) layers, bidirectional long short-term memory (BLSTM) combined with an attention mechanism layer, and a fully-connected layer. To illustrate and analyze the classification capabilities of the network, we used the t-distributed stochastic neighbor embedding (t-SNE) method. We evaluate our model using Ryerson audio-visual dataset of emotional speech and song (RAVDESS) and interactive emotional dyadic motion capture (IEMOCAP) datasets achieving weighted accuracy (WA) of 80% and 66%, respectively.
H. Kafri, M. Olivieri, F. Antonacci, M. Moradi, A. Sarti, and S. Gannot,
"GRAD-CAM-inspired interpretation of nearfield acoustic holography using physics-informed explainable neural net-work",
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Rhodes Island, Greece, Jun. 2023. The interpretation and explanation of decision-making processes of neural networks are becoming a key factor in the deep learning field. Although several approaches have been presented for classification problems, the application to regression models needs to be further investigated. In this manuscript we propose a Grad-CAM-inspired approach for the visual explanation of neural network architecture for regression problems. We apply this methodology to a recent physics-informed approach for Nearfield Acoustic Holography, called Kirchhoff-Helmholtz-based Convolutional Neural Network (KHCNN) architecture. We focus on the interpretation of KHCNN using vibrating rectangular plates with different boundary conditions and violin top plates with complex shapes. Results highlight the more informative regions of the input that the network exploits to correctly predict the desired output. The devised approach has been validated in terms of NCC and NMSE using the original input and the filtered one coming from the algorithm.
Y. Hu, S. Gannot, and T. D. Abhayapala,
"Generalized relative harmonic coefficients",
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, Jun. 2023. In literature, sound source localization under the far- and near-field scenarios are mostly addressed as independent tasks using different approaches. This causes a tedious task to detect the type of sound-field, whereas in practice there may not be a clear boundary between the far- and near-field soundfield. In contrast, this paper proposes a multi-channel feature denoted generalized relative harmonic coefficients (generalized RHC) in the spherical harmonics domain, which can equally localize both far- and near-field sound source without requiring any adjustments. We derive the analytical expression of this feature and summarize its unique properties, which facilitate two single-source directional-of-arrival estimators: (i) using a full grid search over the directional space; and (ii) a closed-form solution without any grid search. Experimental study in realistic noisy and reverberant environments under both near-field and far-field conditions validates the efficacy of the proposed algorithm.
2022
O. Shmaryahu and S. Gannot,
"On the importance of acoustic reflections in beamforming",
in International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 2022. Acoustic reflections are known to limit the capabilities of traditional beamformers (BFs), which are based on zero-order steering vector, to extract a desired source and to suppress interference signals from noisy measurements. To alleviate these performance limitations, echo-aware BFs that take into account the acoustic reflections of the source and interfering signals, were introduced more than two decades ago. In this paper, 1 we propose a systematic methodology to analyze the performance of these BFs, highlighting the importance of the acoustic reflections in the BF design. Under this methodology, we redefine beampatterns to consider the entire reflection pattern, while the direction of arrivals (DOAs) of the sources are merely used as an indication of the positions of the sources that impinge the array from a circle around the microphone array. We further define measures of the quality of the BFs, namely the beampattern shape, the width of the main beam, the directivity, the null depth, and the signal-to-interference ratio (SIR) improvement. Using this methodology, we are able to clearly demonstrate the advantages of echo-aware BFs over traditional BFs that only consider the direct-arrival of the sources in their design.
E. Hadad, S. Doclo, S. Nordholm, and S. Gannot,
"Pareto optimal binaural MVDR beamformer with controllable interference suppression",
in International Workshop on Acoustic Signal En-hancement (IWAENC), Sep. 2022. The objective of binaural multi-microphone speech enhancement algorithms can be viewed as a multi-criteria design problem as there are several requirements to be met. When applying distortion less beamforming, it is necessary to suppress interfering sources and ambient background noise, and to extract an undistorted replica of the target source. In the binaural versions, it is also important to preserve the binaural cues of the target and the interference sources. In this paper, we propose a unified Pareto optimization framework for binaural distortion less beamformers, which is achieved by defining a multi-objective problem (MOP) to control the amount of interference suppression and noise reduction simultaneously. The derivation is given for the multi-interference case by introducing separate mean squared error (MSE) cost functions for each of the respective interference sources and the background noise. A Pareto optimal set of solutions is provided for any set of parameters. The performance of the proposed method in a noisy and reverberant environment is presented, demonstrating the impact of the trade-off parameters using real-signal recordings.
A. Eisenberg, S. Gannot, and S. E. Chazan,
"Single microphone speaker extraction using unified time-frequency Siamese-Unet",
in 30th European Signal Processing Conference (EUSIPCO),
Aug. 2022, pp. 762–766. In this paper we present a unified time-frequency method for speaker extraction in clean and noisy conditions. Given a mixed signal, along with a reference signal, the common approaches for extracting the desired speaker are either applied in the time-domain or in the frequency-domain. In our approach, we propose a Siamese-Unet architecture that uses both representations. The Siamese encoders are applied in the frequency-domain to infer the embedding of the noisy and reference spectra, respectively. The concatenated representations are then fed into the decoder to estimate the real and imaginary components of the desired speaker, which are then inverse-transformed to the time-domain. The model is trained with the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) loss to exploit the time-domain information. The time-domain loss is also regularized with frequency-domain loss to preserve the speech patterns. Experimental results demonstrate that the unified approach is not only very easy to train, but also provides superior results as compared with state-of-the-art (SOTA) Blind Source Separation (BSS) methods, as well as commonly used speaker extraction approach.
Y. Hu and S. Gannot,
"Comparison of learning-based DOA estimation between SH domain features",
in 30th European Signal Processing Conference (EUSIPCO), Aug. 2022, pp. 329–333 Accurate direction-of-arrival (DOA) estimation in noisy and reverberant environments is a long-standing challenge in the field of acoustic signal processing. One of the promising research directions utilizes the decomposition of the multi-microphone measurements into the spherical harmonics (SH) domain. This paper presents an evaluation and comparison of learning-based single-source DOA estimation using two recently introduced SH domain features denoted relative harmonic coefficients (RHC) and relative modal coherence (RMC), respectively. Both features were shown to be independent of the time-varying source signal even in reverberant environments, thus facilitating training with synthesized, continuously active, noise signal rather than with speech signal. The inspected features are fed into a convolutional neural network, trained as a DOA classifier. Extensive validations confirm that the RHC-based method outperforms the RMC-based method, especially under unfavorable scenarios with severe noise and reverberation.
Y. Hu and S. Gannot,
"Closed-form single source direction-of-arrival estimator using first-order relative harmonic coefficients.",
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022, pp. 726–730 The relative harmonic coefficients (RHC), recently introduced as a multi-microphone spatial feature, demonstrates promising performance when applied to direction-of-arrival (DOA) estimation. All existing RHC-based DOA estimators suffer from a resolution limitation due to the inherent grid-based search. In contrast, this paper utilizes the first-order RHC to propose a closed-form DOA estimator by deriving a direction vector, which points towards to the desired source direction. Two objective metrics, namely localization accuracy and algorithm complexity, are adopted for the evaluation and comparison with existing RHC-based and intensity based localization approaches, in both simulated and real-life environments.
N. Raviv, O. Schwartz, and S. Gannot,
"Low resources online single-microphone speech en-hancement with harmonic emphasis",
in IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), May 2022, pp. 8807–8811. In this paper, we propose a deep neural network (DNN)-based single-microphone speech enhancement algorithm characterized by a short latency and low computational resources. Many speech enhancement algorithms suffer from low noise reduction capabilities between pitch harmonics, and in severe cases, the harmonic structure may even be lost. Recognizing this drawback, we propose a new weighted loss that emphasizes pitch-dominated frequency bands. For that, we propose a method, applied only at the training stage, to detect these frequency bands. The proposed method is applied to speech signals contaminated by several noise types, and in particular, typical domestic noise drawn from ESC-50 and DE-MAND databases, demonstrating its applicability to ‘stay-at-home’ scenarios.
2021
Y. Hu, P. Samarasinghe, S. Gannot, and T. Abhayapala,
"Evaluation and comparison of three source direction-of-arrival estimators using relative harmonic coecients",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Toronto, Ontario, Canada, Jun. 2021. A spherical harmonics domain source feature called relative harmonic coefficients (RHC) has recently been applied to address the source direction-of-arrival (DOA) estimation problem. This paper presents a compact evaluation and comparison between two existing RHC based DOA estimators: (i) a method using a full grid search over the two-dimensional (2-D) directional space, (ii) a decoupled estimator which uses one-dimensional (1-D) search to separately localize the source’s elevation and azimuth. We also propose a new estimator using a gradient descent search over the 2-D directional grid space. Extensive experiments in both simulated and real-life environments are conducted to examine and analyze the performance of all the underlying DOA estimators. Two objective metrics, including localization accuracy and algorithm complexity, are adopted for an evaluation and comparison between all estimators.
G. F. Miller, A. Brendel, W. Kellermann, and S. Gannot,
"Misalignment recognition in acoustic sensor networks using a semi-supervised source estimation method and Markov random fields",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Toronto, Ontario, Canada, Jun. 2021. In this paper, we consider the problem of acoustic source localization by acoustic sensor networks (ASNs) using a promising, learning-based technique that adapts to the acoustic environment. In particular, we look at the scenario when a node in the ASN is displaced from its position during training. As the mismatch between the ASN used for learning the localization model and the one after a node displacement leads to erroneous position estimates, a displacement has to be detected and the displaced nodes need to be identified. We propose a method that considers the disparity in position estimates made by leave-one-node-out (LONO) sub-networks and uses a Markov random field (MRF) framework to infer the probability of each LONO position estimate being aligned, misaligned or unreliable while accounting for the noise inherent to the estimator. This probabilistic approach is advantageous over naïve detection methods, as it outputs a normalized value that encapsulates conditional information provided by each LONO sub-network on whether the reading is in misalignment with the overall network. Experimental results confirm that the performance of the proposed method is consistent in identifying compromised nodes in various acoustic conditions.
A. Eisenberg, B. Schwartz, and S. Gannot,
"Online blind audio source separation using recursive expectation-maximization",
in Interspeech, Brno, The Czech Republic, 2021. In this paper, we present a multiple-speaker direction of arrival (DOA) tracking algorithm with a microphone array that utilizes the recursive EM (REM) algorithm proposed by Cappé and Moulines. In our model, all sources can be located in one of a predefined set of candidate DOAs. Accordingly, the received signals from all microphones are modeled as Mixture of Gaussians (MoG) vectors in which each speaker is associated with a corresponding Gaussian. The localization task is then formulated as a maximum likelihood (ML) problem, where the MoG weights and the power spectral density (PSD) of the speakers are the unknown parameters. The REM algorithm is then utilized to estimate the ML parameters in an online manner, facilitating multiple source tracking. By using Fisher-Neyman factorization, the outputs of the minimum variance distortionless response (MVDR)-beamformer (BF) are shown to be sufficient statistics for estimating the parameters of the problem at hand. With that, the terms for the E-step are significantly simplified to a scalar form. An experimental study demonstrates the benefits of the using proposed algorithm in both a simulated data-set and real recordings from the acoustic source localization and tracking (LOCATA) data-set.
S. E. Chazan, J. Goldberger, and S. Gannot,
"Speech enhancement with mixture of deep experts with clean clustering pre-training",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Toronto, Ontario, Canada, Jun. 2021. In this study we present a mixture of deep experts (MoDE) neural-network architecture for single microphone speech enhancement. Our architecture comprises a set of deep neural networks (DNNs), each of which is an ‘expert’ in a different speech spectral pattern such as phoneme. A gating DNN is responsible for the latent variables which are the weights assigned to each expert’s output given a speech segment. The experts estimate a mask from the noisy input and the final mask is then obtained as a weighted average of the experts’ estimates, with the weights determined by the gating DNN. A soft spectral attenuation, based on the estimated mask, is then applied to enhance the noisy speech signal. As a byproduct, we gain reduction at the complexity in test time. We show that the experts specialization allows better robustness to unfamiliar noise types.
R. Opochinsky, G. Chechik, and S. Gannot,
"Deep ranking-based DOA tracking algorithm",
submitted to 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 2021. We propose a semi-supervised localization approach based on deep generative modeling with variational autoencoders (VAE). Localization in reverberant environments remains a challenge, which machine learning (ML) has shown promise in addressing. Even with large data volumes, the number of labels available for supervised learning in reverberant environments is usually small. We address this issue by perform semi-supervised learning (SSL) with convolutional VAEs. The VAE is trained to generate the phase of relative transfer functions (RTFs), in parallel with a DOA classifier, on both labeled and unlabeled RTF samples. The VAE-SSL approach is compared with SRP-PHAT and fully-supervised CNNs. We find that VAE-SLL can outperform both SRP-PHAT and CNN in label-limited scenarios.
2020
A. Bross, B. Laufer-Goldshtein, and S. Gannot,
"Multiple speaker localization using mixture of Gaussian model with manifold-based centroids",
in 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 2020. A data-driven approach for multiple speakers localization in reverberant enclosures is presented. The approach combines semi-supervised learning on multiple manifolds with unsupervised maximum likelihood estimation. The relative transfer functions (RTFs) are used in both stages of the proposed algorithm as feature vectors, which are known to be related to source positions. The microphone positions are not known. In the training stage, a nonlinear, manifold-based, mapping between RTFs and source locations is inferred using single-speaker utterances. The inference procedure utilizes two RTF datasets: A small set of RTFs with their associated position labels; and a large set of unlabelled RTFs. This mapping is used to generate a dense grid of localized sources that serve as the centroids of a Mixture of Gaussians (MoG) model, used in the test stage of the algorithm to cluster RTFs extracted from multiple-speakers utterances. Clustering is applied by applying the expectation-maximization (EM) procedure that relies on the sparsity and intermittency of the speech signals. A preliminary experimental study, with either two or three overlapping speakers in various reverberation levels, demonstrates that the proposed scheme achieves high localization accuracy compared to a baseline method using a simpler propagation model.
Y. Hu, T. Abhayapala, P. N. Samarasinghe, and S. Gannot,
"Decoupled direction-of-arrival estimations using relative harmonic coefficients,",
in 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 2020. Traditional source direction-of-arrival (DOA) estimation algorithms generally localize the elevation and azimuth simultaneously, requiring an exhaustive search over the two-dimensional (2-D) space. By contrast, this paper presents two decoupled source DOA estimation algorithms using a recently introduced source feature called the relative harmonic coefficients. They are capable to recover the source’s elevation and azimuth separately, since the elevation and azimuth components in the relative harmonic coefficients are decoupled. The proposed algorithms are highlighted by a large reduction of computational complexity, thus enable a direct application for sound source tracking. Simulation results, using both a static and moving sound source, confirm the proposed methods are computationally efficient while achieving competitive localization accuracy.
J. Cmejla, T. Kounovsky, S. Gannot, Z. Koldovsky, and P. Tandeitnik,
"MIRaGe: Multichannel database of room impulse responses measured on high-resolution cube-shaped grid",
in 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 2020. We introduce a database of multi-channel recordings performed in an acoustic lab with adjustable reverberation time. The recordings provide detailed information about room acoustics for positions of a source within a confined area. In particular, the main positions correspond to 4104 vertices of a cube-shaped dense grid within a 46 × 36 × 32 cm volume. The database can serve for simulations of a real-world situations and as a tool for detailed analyses of beampatterns of spatial processing methods. It could be used also for training and testing of mathematical models of the acoustic field.
Y. Laufer and S. Gannot,
"A Bayesian hierarchical model for blind audio source separation",
in 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 2020. This paper presents a fully Bayesian hierarchical model for blind audio source separation in a noisy environment. Our probabilistic approach is based on Gaussian priors for the speech signals, Gamma hyperpriors for the speech precisions and a Gamma prior for the noise precision. The time-varying acoustic channels are modelled with a linear-Gaussian state-space model. The inference is carried out using a variational Expectation-Maximization (VEM) algorithm, leading to a variant of the multi-speaker multichannel Wiener filter (MCWF) to separate and enhance the audio sources, and a Kalman smoother to infer the acoustic channels. The VEM speech estimator can be decomposed into two stages: A multi-speaker linearly constrained minimum variance (LCMV) beamformer followed by a variational multi-speaker postfilter. The proposed algorithm is evaluated in a static scenario using recorded room impulse responses (RIRs) with two reverberation levels, showing superior performance compared to competing methods.
M. Bianco, , P. Gerstoft, and S. Gannot,
"Semi-supervised source localization with deep generative modeling",
in 30th Machine Learning for Signal Processing (MLSP), Aalto University, Espoo, Finland, Sep. 2020. We propose a semi-supervised localization approach based on deep generative modeling with variational autoencoders (VAE). Localization in reverberant environments remains a challenge, which machine learning (ML) has shown promise in addressing. Even with large data volumes, the number of labels available for supervised learning in reverberant environments is usually small. We address this issue by perform semi-supervised learning (SSL) with convolutional VAEs. The VAE is trained to generate the phase of relative transfer functions (RTFs), in parallel with a DOA classifier, on both labeled and unlabeled RTF samples. The VAE-SSL approach is compared with SRP-PHAT and fully-supervised CNNs. We find that VAE-SLL can outperform both SRP-PHAT and CNN in label-limited scenarios.
A. Eisenberg, B. Schwartz, and S. Gannot,
"Blind audio source separation using two expectation-maximization algorithms",
in 30th Machine Learning for Signal Processing (MLSP), Aalto University, Espoo, Finland, Sep. 2020. The problem of multi-microphone blind audio source separation in noisy environment is addressed. The estimation of the acoustic signals and the associated parameters is carried out using the expectation-maximization algorithm. Two separation algorithms are developed using either deterministic representation or stochastic Gaussian distribution for modelling the speech signals. Under the deterministic model, the speech sources are estimated in the M-step by applying in parallel multiple minimum variance distortionless response (MVDR) beamformers, while under the stochastic model, the speech signals are estimated in the E-step by applying in parallel multiple multichannel Wiener filters (MCWF). In the simulation study, we generated a large dataset of microphone signals, by convolving speech signals, with overlapping activity patterns, by measured acoustic impulse responses. It is shown that the proposed methods outperform a baseline method in terms of speech quality and intelligibility.
Y. Laufer and S. Gannot,
"A Bayesian hierarchical mixture of Gaussian model for multi-speaker DOA estimation and separation",
in 30th Machine Learning for Signal Processing (MLSP), Aalto University, Espoo, Finland, Sep. 2020. In this paper we propose a fully Bayesian hierarchical model for multi-speaker direction of arrival (DoA) estimation and separation in noisy environments, utilizing the W-disjoint orthogonality property of the speech sources. Our probabilistic approach employs a mixture of Gaussians formulation with centroids associated with a grid of candidate speakers’ DoAs. The hierarchical Bayesian model is established by attributing priors to the various parameters. We then derive a variational Expectation-Maximization algorithm that estimates the DoAs by selecting the most probable candidates, and separates the speakers using a variant of the multichannel Wiener filter that takes into account the responsibility of each candidate in describing the received data. The proposed algorithm is evaluated using real room impulse responses from a freely-available database, in terms of both DoA estimates accuracy and separation scores. It is shown that the proposed method outperforms competing methods.
E. Hadad and S. Gannot,
"Maximum likelihood multi-speaker direction of arrival estimation utilizing a weighted histogram",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Barcelona, Spain, May 2020. In this contribution, a novel maximum likelihood (ML) based direction of arrival (DOA) estimator for concurrent speakers in a noisy reverberant environment is presented. The DOA estimation task is formulated in the short-time Fourier transform (STFT) in two stages. In the first stage, a single local DOA per time-frequency (TF) bin is selected, using the W-disjoint orthogonality property of the speech signal in the STFT domain. The local DOA is obtained as the maximum of the narrow-band likelihood localization spectrum at each TF bin. In addition, for each local DOA, a confidence measure is calculated, determining the confidence in the local estimate. In the second stage, the wide-band localization spectrum is calculated using a weighted histogram of the local DOA estimates with the confidence measures as weights. Finally, the wide-band DOA estimation is obtained by selecting the peaks in the wide-band localization spectrum. The results of our experimental study demonstrate the benefit of the proposed algorithm in a reverberant environment as compared with the classical steered response power phase transform (SRP-PHAT) algorithm.
Y. Yemini, S. E. Chazan, J. Goldberger, and S. Gannot,
"A composite DNN architecture for speech enhancement",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Barcelona, Spain, May 2020. In speech enhancement, the use of supervised algorithms in the form of deep neural networks (DNNs) has become tremendously popular in recent years. The target function of the DNN (and the associated estimators) is often either a masking function applied to the noisy spectrum, or the clean log-spectrum. In this work, we show that both separate cost functions are unsuitable for dealing with narrowband noise, and propose a new composite estimator in the log-spectrum domain. The new technique relies on a single DNN that outputs both a masking function and an estimated log-spectrum. Both outputs are used for the composite enhancement. The proposed estimator demonstrates superior performance for speech utterances contaminated by additive narrowband noise, while maintaining the enhancement quality of the baseline algorithms for wideband noise.
Y. Opochinsky, S. E. Chazan, S. Gannot, and J. Goldberger,
"K-autoencoders deep clustering",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Barcelona, Spain, May 2020. In this study we propose a deep clustering algorithm that extends the k-means algorithm. Each cluster is represented by an autoencoder instead of a single centroid vector. Each data point is associated with the autoencoder which yields the minimal reconstruction error. The optimal clustering is found by learning a set of autoencoders that minimize the global reconstruction mean-square error loss. The network architecture is a simplified version of a previous method that is based on mixture-of-experts. The proposed method is evaluated on standard image corpora and performs on par with state-of-the-art methods which are based on much more complicated network architectures.
Y. Hu, P. Samarasinghe, T. Abhayapala, and S. Gannot,
"Unsupervised multiple source localization using relative harmonic coefficient",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Barcelona, Spain, May 2020. This paper presents an unsupervised multi-source localization algorithm using a recently introduced feature called the relative harmonic coefficients. We derive a closed-form expression of the feature and briefly summarize its unique properties. We then exploit this feature to develop a single-source frame/bin detector which simplifies the challenging problem of multiple source localization into a single source localization problem. We show that the underlying method is suitable for localization using overlapped, disjoint as well as simultaneous multi-source recordings. Experimental results in both simulated and real-life reverberant environments confirm improved localization accuracy of the proposed method in comparison with the existing state-of-art approach.
O. Schwartz, E. Habets, and S. Gannot,
"Low complexity NLMS for multiple loudspeaker acoustic echo canceller using relative loudspeaker transfer functions",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Barcelona, Spain, May 2020. Speech signals captured by a microphone mounted to a smart soundbar or speaker are inherently contaminated by echos. Modern smart devices are usually characterized by low computational capabilities and low memory resources; in these cases, a low-complexity acoustic echo canceller (AEC) may be preferred even though a tolerable degradation in the cancellation occurs. In principle, devices with multiple loudspeakers need an individual AEC for each loudspeaker because the transfer function (TF) from each loudspeaker to the microphone must be estimated. In this paper, we present an normalized least mean square (NLMS) algorithm for a multi-loudspeaker case using relative loudspeaker transfer functions (RLTFs). In each iteration, the RLTFs between each loudspeaker and the reference loudspeaker are estimated first, and then the primary TF between the reference loudspeaker and the microphone. Assuming loudspeakers that are close to each other, the RLTFs can be estimated using fewer coefficients w.r.t. the primary TF, yielding a reduction of 3:4 in computational complexity and 1:2 in memory usage. The algorithm is evaluated using both simulated and real room impulse responses (RIRs) of two loudspeakers with a reverberation time set to 0.3 s and several distances between the loudspeakers.
Bianco, Michael J. and Gannot, Sharon and Fernandez-Grande, Efren and Gerstoft, Peter,
"Semi-supervised source localization in reverberant environments using deep generative modeling",
The Journal of the Acoustical Society of America We present a method for acoustic source localization in reverberant environments based on semi-supervised machine learning (ML) with deep generative models. Source localization in the presence of reverberation remains a major challenge, which recent ML techniques have shown promise in addressing. Despite often large data volumes, the number of labels available for supervised learning in reverberant environments is usually small. In semi-supervised learning, ML systems are trained using many examples with only few labels, with the goal of exploiting the natural structure of the data. We use variational autoencoders (VAEs), which are generative neural networks (NNs) that rely on explicit probabilistic representations, to model the latent distribution of reverberant acoustic data. VAEs consist of an encoder NN, which maps complex input distributions to simpler parametric distributions (e.g., Gaussian), and a decoder NN which approximates the training examples. The VAE is trained to generate the phase of relative transfer functions (RTFs) between two microphones in reverberant environments, in parallel with a DOA classifier, on both labeled and unlabeled RTF samples. The performance this VAE-based approach is compared with conventional and ML-based localization in simulated and real-world scenarios.
2019
A. Brendel, B. Laufer-Goldshtein, S. Gannot, and W. Kellermann,
"Learning-based acoustic source localization using directional spectra",
in IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), Le Gosier in Guadeloupe, French West Indies, Dec. 2019. This paper proposes to use directional spectra as new features for manifold-learning-based acoustic source localization. We claim that directional spectra not only contain directional information, but are rather discriminative for different positions in a reverberant enclosure. We use these proposed features to build a manifold-learning-based localization algorithm which is applied to single-array localization as well as to Acoustic Sensor Network (ASN) localization. The performance of the proposed algorithm is benchmarked by comprehensive experiments carried out in a simulated environment, with comparison to a blind approach based on triangulation, as well as by Gaussian Process Regression (GPR)-based localization.
K. Weisberg and S. Gannot,
"Multiple speaker tracking using coupled HMM in the STFT domain",
in IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), Le Gosier in Guadeloupe, French West Indies, Dec. 2019. We present a multi-microphone multi-speaker direction of arrival (DOA) tracking algorithm. In the proposed algorithm, the DOA values are discretized to a set of candidate DOAs. Accordingly, and following the W-disjoint orthogonality (WDO) property of the speech signal, each time-frequency (TF) bin in the short-time Fourier transform (STFT) domain is associated with a single DOA candidate. The conditional probability of each TF observation given its corresponding DOA association, is modeled as a multivariate complex-Gaussian distribution, with the power spectral density (PSD) of each source an unknown parameter. By applying the Fisher-Neyman factorization, it can be shown that this conditional probability is proportional to the signal-to-noise ratio (SNR) at the outputs of minimum variance distortionless response (MVDR)-beamformers (BFs), directed towards all candidate DOAs. We model these observations as either a frequency-wise parallel Hidden Markov Model (HMM) or as a coupled HMM with coupling between adjacent frequency bins. The posterior probability of these associations is inferred by applying an extended FB (FB) algorithm, and the actual DOAs can be inferred from this posterior. An experimental study demonstrates the benefits of the proposed algorithm using both a simulated dataset and real recordings drawn from the acoustic source localization and tracking (LOCATA) dataset.
S. E. Chazan, S. Gannot, and J. Goldberger,
"Deep clustering based on a mixture of autoencoders",
in IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Pittsburgh, PA, USA, Oct. 2019. In this paper we propose a Deep Autoencoder Mixture Clustering (DAMIC) algorithm based on a mixture of deep autoencoders where each cluster is represented by an autoencoder. A clustering network transforms the data into another space and then selects one of the clusters. Next, the autoencoder associated with this cluster is used to reconstruct the data-point. The clustering algorithm jointly learns the nonlinear data representation and the set of autoencoders. The optimal clustering is found by minimizing the reconstruction loss of the mixture of autoencoder network. Unlike other deep clustering algorithms, no regularization term is needed to avoid data collapsing to a single point. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.
R. Opochinsky, B. Laufer, S. Gannot, and G. Chechik,
"Deep Ranking-Based sound source localization",
in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2019. Sound source localization is a cumbersome task in challenging reverberation conditions. Recently, there is a growing interest in developing learning-based localization methods. In this approach, acoustic features are extracted from the measured signals and then given as input to a model that maps them to the corresponding source positions. Typically, a massive dataset of labeled samples from known positions is required to train such models. Here, we present a novel weakly-supervised deep-learning localization method that exploits only a few labeled (anchor) samples with known positions, together with a larger set of unlabeled samples, for which we only know their relative physical ordering. We design an architecture that uses a stochastic combination of triplet-ranking loss for the unlabeled samples and physical loss for the anchor samples, to learn a nonlinear deep embedding that maps acoustic features to an azimuth angle of the source. The combined loss can be optimized effectively using standard gradient-based approach. Evaluating the proposed approach on simulated data, we demonstrate its significant improvement over two previous learning-based approaches for various reverberation levels, while maintaining consistent performance with varying sizes of labeled data.
J. R. Jensen, U. Saqib, and S. Gannot,
"An EM method for multichannel TOA and DOA estimation of acoustic echoes",
in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2019. The time-of-arrivals (TOAs) of acoustic echoes is a prerequisite in, e.g., room geometry estimation and localization of acoustic reflectors, which can be an enabling technology for autonomous robots and drones. However, solving these problems alone using TOAs introduces the difficult problem of echolabeling. Moreover, it is typically suggested to estimate the TOAs by estimating the room impulse response, and finding the peaks of it, but this approach is vulnerable against noise (e.g., ego noise). We therefore propose an expectation-maximization (EM) method for estimating both the TOAs and direction-of-arrivals (DOAs) of acoustic echoes using a loudspeaker and a uniform circular array (UCA). Our results show that this approach is more robust against noise compared to the traditional peak finding approach. Moreover, they show that the TOA and DOA information can be combined to estimate wall positions directly without considering echolabeling.
Y. Soussana and S. Gannot,
"Variational inference for DOA estimation in reverberant conditions",
in27th European Signal Processing Conference (EUSIPCO), A Coruna, Spain, Sep.2019. A concurrent speaker direction of arrival (DOA) estimator in a reverberant environment is presented. The reverberation phenomenon, if not properly addressed, is known to degrade the performance of DOA estimators. In this paper, we investigate a variational Bayesian (VB) inference framework for clustering time-frequency (TF) bins to candidate angles. The received microphone signals are modelled as a sum of anechoic speech and the reverberation component. Our model relies on Gaussian prior for the speech signal and Gamma prior for the speech precision. The noise covariance matrix is modelled by a time-invariant full-rank coherence matrix multiplied by time-varying gain with Gamma prior as well. The benefits of the presented model are verified in a simulation study using measured room impulse responses.
N. Cohen, G. Hazan, B. Schwartz, and S. Gannot,
"An EM algorithm for joint Dual-Speaker sep-aration and dereverberation",
in 2019 27th European Signal Processing Conference (EUSIPCO), A Coruna, Spain, Sep. 2019. The scenario of a mixture of two speakers captured by a microphone array in a noisy and reverberant environment is considered. If the problems of source separation and dereverberation are treated separately, performance degradation may result. It is well-known that the performance of blind source separation (BSS) algorithms degrades in the presence of reverberation, unless reverberation effects are properly addressed (leading to the so-called convolutive BSS algorithms). Similarly, the performance of common dereverberation algorithms will severely degrade if an interference signal is also captured by the same microphone array. The aim of the proposed method is to jointly separate and dereverberate the two speech sources, by extending the Kalman expectation-maximization for dereverberation (KEMD) algorithm, previously proposed by the authors. A statistical model is attributed to this scenario, using the convolutive transfer function (CTF) approximation, and the expectation-maximization (EM) scheme is applied to obtain a maximum likelihood (ML) estimate of the parameters. In the expectation step, the separated clean signals are extracted from the observed data by the application of a Kalman Filter, utilizing the parameters that were estimated in the previous iteration. The maximization step updates the parameters estimation according to the Estep output. Simulation results shows that the proposed method improves both the separation of the signals and their overall quality.
S. E. Chazan, H. Hammer, G. Hazan, J. Goldberger, and S. Gannot,
"Multi-Microphone speaker separation based on deep DOA estimation",
in2019 27th European Signal Processing Conference (EUSIPCO), A Coruna, Spain, Sep. 2019. In this paper, we present a multi-microphone speech separation algorithm based on masking inferred from the speakers direction of arrival (DOA). According to the W-disjoint orthogonality property of speech signals, each time-frequency (TF) bin is dominated by a single speaker. This TF bin can therefore be associated with a single DOA. In our procedure, we apply a deep neural network (DNN) with a U-net architecture to infer the DOA of each TF bin from a concatenated set of the spectra of the microphone signals. Separation is obtained by multiplying the reference microphone by the masks associated with the different DOAs. Our proposed deep direction estimation for speech separation (DDESS) method is inspired by the recent advances in deep clustering methods. Unlike already established methods that apply the clustering in a latent embedded space, in our approach the embedding is closely associated with the spatial information, as manifested by the different speakers’ directions of arrival.
K. Weisberg, S. Gannot, and O. Schwartz,
"An online multiple-speaker DOA tracking using the Cappe-Moulines recursive expectation-maximization algorithm",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), 2019, pp. 656–660. In this paper, we present a multiple-speaker direction of arrival (DOA) tracking algorithm with a microphone array that utilizes the recursive EM (REM) algorithm proposed by Cappé and Moulines. In our model, all sources can be located in one of a predefined set of candidate DOAs. Accordingly, the received signals from all microphones are modeled as Mixture of Gaussians (MoG) vectors in which each speaker is associated with a corresponding Gaussian. The localization task is then formulated as a maximum likelihood (ML) problem, where the MoG weights and the power spectral density (PSD) of the speakers are the unknown parameters. The REM algorithm is then utilized to estimate the ML parameters in an online manner, facilitating multiple source tracking. By using Fisher-Neyman factorization, the outputs of the minimum variance distortionless response (MVDR)-beamformer (BF) are shown to be sufficient statistics for estimating the parameters of the problem at hand. With that, the terms for the E-step are significantly simplified to a scalar form. An experimental study demonstrates the benefits of the using proposed algorithm in both a simulated data-set and real recordings from the acoustic source localization and tracking (LOCATA) data-set.
A. Brendel, B. Laufer-Goldshtein, S. Gannot, R. Talmon, and W. Kellermann,
"Localization of an unknown number of speakers in adverse acoustic conditions using reliability information and diarization",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), 2019, pp. 7898–7902. This paper investigates localization of an arbitrary number of simultaneously active speakers in an acoustic enclosure. We propose an algorithm capable of estimating the number of speakers, using reliability information to obtain robust estimation results in adverse acoustic scenarios and estimating individual probability distributions describing the position of each speaker using convex geometry tools. To this end, we start from an established algorithm for localization of acoustic sources based on the EM algorithm. There, the estimation of the number of sources as well as the handling of reverberation has not been addressed sufficiently. We show improvement in the localization of a higher number of sources and in the robustness in adverse conditions including interference from competing speakers, reverberation and noise.
2018
Y. Laufer and S. Gannot,
"A Bayesian hierarchical model for speech dereverberation",
in International conference on the science of electrical engineering (ICSEE), Eilat, Israel, Dec. 2018. In this paper, the problem of speech dereverberation in a noiseless scenario is addressed in a hierarchical Bayesian framework. Our probabilistic approach relies on a Gaussian model for the early speech signal combined with a multichannel Gaussian model for the relative early transfer function (RETF). The late reverberation is modelled as a Gaussian additive interference, and the speech and reverberation precisions are modelled with Gamma distribution. We derive a variational Expectation-Maximization (VEM) algorithm which uses a variant of the multichannel Wiener filter (MCWF) to infer the early speech component while suppressing the late reverberation. The proposed algorithm was evaluated using real room impulse responses (RIRs) recorded in our acoustic lab with a reverberation time set to 0.36 s and 0.61 s. It is shown that a significant improvement is obtained with respect to the reverberant signal, and that the proposed algorithm outperforms a baseline algorithm. In terms of channel alignment, a superior channel estimate is demonstrated.
O. Schwartz, A. David, O. Shahen-Tov, and S. Gannot,
"Multi-microphone voice activity detector based on steered-response power output entropy",
in International conference on the science of electrical engineering (ICSEE), Eilat, Israel, Dec. 2018. Voice activity detection (VAD), namely determining whether a speech signal is active or inactive, and single talk detector (STD), namely detecting that only one speaker is active, are important building blocks in many speech processing applications. A speaker-localization stage (such as the steered response power (SRP)) is often concurrently implemented on the same device.In this paper, the spatial properties of the SRP are utilized for improving the performance of both the voice activity detector (VAD) and the STD. We propose to measure the entropy at the SRP output and compare with the typical entropy of noise-only frames. This feature utilizes spatial information and may therefore become advantageous in nonstationary noise environments. The STD can then be implemented by determining local minimum values of the entropy measure of the SRP.The proposed VAD was tested for a single speaker with two cases, directional background noise with changing level and with a background music source. The proposed STD was tested using real recordings of two concurrent speakers.
E. Hadad and S. Gannot,
"Multi-speaker direction of arrival estimation using SRP-PHAT algorithm with a weighted histogram",
in International conference on the science of electrical engineering (ICSEE), Eilat, Israel, Dec. 2018. A direction of arrival (DOA) estimator for concurrent speakers in a reverberant environment is presented. The DOA estimation task is formulated in the short-time Fourier transform (STFT) in two stages. In the first stage, a single narrow-band DOA per time-frequency (T-F) is selected, since the speech sources are assumed to exhibit disjoint activity in the STFT domain. The narrow-band DOA is obtained as the maximum of the narrow-band steered response power phase transform (SRP-PHAT) localization spectrum at that T-F bin. In addition, for each narrow-band DOA, a quality measure is calculated, which provides the confidence in the estimated decision. In the second stage, the wide-band localization spectrum is calculated using a weighted histogram of the narrow-band DOAs with the quality measures as weight. Finally, the wide-band DOA estimation is obtained by selecting the peaks in the wide-band localization spectrum. The results of our experimental study demonstrate the benefit of the proposed algorithm as compared to the wide-band SRP-PHAT algorithm in a reverberant environment.
A. Adler, O. Schwartz, and S. Gannot,
"A weighted multichannel Wiener filter and its decomposition to LCMV beamformer and post-filter for source separation and noise reduction",
in International conference on the science of electrical engineering (ICSEE), Eilat, Israel, Dec. 2018.,
best paper award Speech enhancement and source separation are well-known challenges in the context of hands-free communication and automatic speech recognition. The multichannel Wiener filter (MCWF) that satisfies the minimum mean square error (MMSE) criterion, is a fundamental speech enhancement tool. However, it can suffer from speech distortion, especially when the noise level is high. The speech distortion weighted multichannel Wiener filter (SDW-MWF) was therefore proposed to control the tradeoff between noise reduction and speech distortion for the single-speaker case. In this paper, we generalize this estimator and propose a method for controlling this tradeoff in the multi-speaker case. The proposed estimator is decomposed into two successive stages: 1) a multi-speaker linearly constrained minimum variance (LCMV), which is solely determined by the spatial characteristics of the speakers; and 2) a multi-speaker Wiener postfilter (PF), which is responsible for reducing the residual noise. The proposed PF consists of several controlling parameters that can almost independently control the tradeoff between the distortion of each speaker and the total noise reduction.
A. Barnov, V. B. Bracha, S. Markovich-Golan, and S. Gannot,
"Spatially robust GSC beamforming with controlled white noise gain",
in International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan, Sep. 2018. Adaptive beamforming is widely used for speech enhancement in telephony and speech recognition applications. We focus on scenarios with a single desired speaker in a non-stationary environmental noise. Many modern beamformers are designed using the desired speaker transfer function (TF), or the respective relative transfer function (RTF). If the relative source position is fixed, tracking the RTF can be avoided. On top of reducing the computational complexity’ this may also circumvent the beamformer from enhancing competing sources. In this work, to target such applications, we propose a technique for obtaining a spatially robust generalized sidelobe canceler (GSC) beamformer with controlled white noise gain (WNG). The proposed implementation will introduce robustness to mismatch between the assumed and actual RTFs while maintaining sufficiently large WNG. It allows for high flexibility in shaping the desired response, while maintaining low computational complexity.
S. E. Chazan, S. Gannot, and J. Goldberger,
"Attention-based neural network for joint diarization and speaker extraction",
in International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan, Sep. 2018. Multi-microphone, DNN-based, speech enhancement and speaker separation/extraction algorithms have recently gained increasing popularity. The enhancement capabilities of spatial processor can be very high, provided that all its building blocks are accurately estimated. Data-driven estimation approaches can be very attractive since they do not rely on accurate statistical models, which is usually unavailable. However, training a DNN with multi-microphone data is a challenging task, due to inevitable differences between the train and test phases. In this work, we present an estimation procedure for controlling a linearly-constrained minimum variance (LCMV) beamformer for speaker extraction and noise reduction. We propose an attention-based DNN for speaker diarization that is applicable to the task at hand. In the proposed scheme, each microphone signal propagates through a dedicated DNN and an attention mechanism selects the most informative microphone. This approach has the potential of mitigating the mismatch between training and test phases and can therefore lead to an improved speaker extraction performance.
S. Markovich-Golan and S. Gannot,
"A probability distribution model for the relative transfer function in a reverberant environment",
in International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan, Sep. 2018. The relative transfer function (RTF) is a generalization of the delay-based array manifold, which is applicable to reverberant environments with multiple reflections. Beamformers that utilize RTF are known to outperform the simpler beamforming techniques that use delay-based steering vectors. Adopting established models of the acoustic transfer functions and utilizing recent contributions which derive the probability distribution of the ratio of independent complex-Gaussian random variables, we derive a probability distribution model for the RTF. The model is verified and compared to the empirical distribution in multiple Monte-Carlo experiments.
B. Laufer, R. Talmon, and S. Gannot,
"Diarization and separation based on a Data-Driven simplex",
in The 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, Sep. 2018. Separation of underdetermined speech mixtures, where the number of speakers is greater than the number of microphones, is a challenging task. Due to the intermittent behaviour of human conversations, typically, the instantaneous number of active speakers does not exceed the number of microphones, namely the mixture is locally (over-)determined. This scenario is addressed in this paper using a dual stage approach: diarization followed by separation. The diarization stage is based on spectral decomposition of the correlation matrix between different time frames. Specifically, the spectral gap reveals the overall number of speakers, and the computed eigenvectors form a simplex of the activity of the speakers across time. In the separation stage, the diarization results are utilized for estimating the mixing acoustic channels, as well as for constructing an unmixing scheme for extracting the individual speakers. The performance is demonstrated in a challenging scenario with six speakers and only four microphones. The proposed method shows perfect recovery of the overall number of speakers, close to perfect diarization accuracy, and high separation capabilities in various reverberation conditions.
O. Schwartz and S. Gannot,
"Recursive Expectation-Maximization algorithm for online Multi-Microphone noise reduction",
in The 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, Sep. 2018. Speech signals, captured by a microphone array mounted to a smart loudspeaker device, can be contaminated by ambient noise. In this paper, we present an online multichannel algorithm, based on the recursive EM (REM) procedure, to suppress ambient noise and enhance the speech signal. In the E-step of the proposed algorithm, a multichannel Wiener filter (MCWF) is applied to enhance the speech signal. The MCWF parameters, that is, the power spectral density (PSD) of the anechoic speech, the steering vector, and the PSD matrix of the noise, are estimated in the M-step. The proposed algorithm is specifically suitable for online applications since it uses only past and current observations and requires no iterations. To evaluate the proposed algorithm we used two sets of measurements. In the first set, static scenarios were generated by convolving speech utterances with real room impulse responses (RIRs) recorded in our acoustic lab with reverberation time set to 0.16 s and several signal to directional noise ratio (SDNR) levels. The second set was used to evaluate dynamic scenarios by using real recordings acquired by CEVA “smart and connected” development platform. Two practical use cases were evaluated: 1) estimating the steering vector with a known noise PSD matrix and 2) estimating the noise PSD matrix with a known steering vector. In both use cases, the proposed algorithm outperforms baseline multichannel denoising algorithms.
S. E. Chazan, J. Goldberger, and S. Gannot,
"LCMV beamformer with DNN-based multichannel concurrent speakers detector",
in The 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, Sep. 2018. Application of the linearly constrained minimum variance (LCMV) beamformer (BF) to speaker extraction tasks in real-life scenarios necessitates a sophisticated control mechanism to facilitate the estimation of the noise spatial cross-power spectral density (cPSD) matrix and the relative transfer function (RTF) of all sources of interest. We propose a deep neural network (DNN)-based multichannel concurrent speakers detector (MCCSD) that utilizes all available microphone signals to detect the activity patterns of all speakers. Time frames classified as no active speaker frames will be utilized to estimate the cPSD, while time frames with a single detected speaker will be utilized for estimating the associated RTF. No estimation will take place during concurrent speaker activity. Experimental results show that the multi-channel approach significantly improves its single-channel counterpart.
O. Ernst, S. E. Chazan, S. Gannot, and J. Goldberger,
"Speech dereverberation using fully convolutional networks",
in The 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, Sep. 2018. MathWorks Speech derverberation using a single microphone is addressed in this paper. Motivated by the recent success of the fully convolutional networks (FCN) in many image processing applications, we investigate their applicability to enhance the speech signal represented by short-time Fourier transform (STFT) images. We present two variations: a “U-Net” which is an encoder-decoder network with skip connections and a generative adversarial network (GAN) with U-Net as generator, which yields a more intuitive cost function for training. To evaluate our method we used the data from the REVERB challenge, and compared our results to other methods under the same conditions. We have found that our method outperforms the competing methods in most cases.
S. Markovich-Golan, S. Gannot, and W. Kellermann,
"Performance analysis of the Covariance-Whitening and the Covariance-Subtraction methods for estimating the relative transfer function",
in The 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, Sep. 2018. Estimation of the relative transfer functions (RTFs) vector of a desired speech source is a fundamental problem in the design of data-dependent spatial filters. We present two common estimation methods, namely the covariance-whitening (CW) and the covariance-subtraction (CS) methods. The CW method has been shown in prior work to outperform the CS method. However, thus far its performance has not been analyzed. In this paper, we analyze the performance of the CW and CS methods and show that in the cases of spatially white noise and of uniform powers of desired speech source and coherent interference over all microphones, the CW method is superior. The derivations are validated by comparing them to their empirical counterparts in Monte Carlo experiments. In fact, the CW method outperforms the CS method in all tested scenarios, although there may be rare scenarios for which this is not the case.
A. Brendel, S. Gannot, and W. Kellermann,
"Localization of multiple simultaneously active speakers in an acoustic sensor network",
in IEEE 10th Sensor Array and Multichannel Signal Processing Workshop (SAM), Sheffield, United Kingdom (Great Britain), Jul. 2018. This paper addresses the localization of an unknown number of acoustic sources in an enclosure. We extend a well established algorithm for localization of acoustic sources, which is based on the Expectation Maximization (EM) algorithm for clustering of phase differences by a Gaussian mixture model. Supporting a more appropriate probabilistic model for spherical data such as direction of arrival or phase differences, the von Mises distribution is used to derive a localization algorithm for multiple simultaneously active sources. Experiments with simulated room impulse responses confirm the superiority of the proposed algorithm to the existing method in terms of localization performance.
X. Li, B. Mourgue, L. Girin, S. Gannot, and R. P. Horaud,
"Online localization of multiple moving speakers in reverberant environments",
in IEEE 10th Sensor Array and Multichannel This paper addresses the problem of online multiple moving speakers localization in reverberant environments. The direct-path relative transfer function (DP-RTF), as defined by the ratio between the first taps of the convolutive transfer function (CTF) of two microphones, encodes the inter-channel direct-path information and is thus used as a localization feature being robust against reverberation. The CTF estimation is based on the cross-relation method. In this work, the recursive least-square method is proposed to solve the cross-relation problem, due to its relatively low computational cost and its good convergence rate. The DP-RTF feature estimated at each time-frequency bin is assumed to correspond to a single speaker. A complex Gaussian mixture model is used to assign each observed feature to one among several speakers. The recursive expectation-maximization algorithm is adopted to update online the model parameters. The method is evaluated with a new dataset containing multiple moving speakers, where the ground-truth speaker trajectories are recorded with a motion capture system.
S. E. Chazan, S. Gannot, and J. Goldberger,,
"Training strategies for deep latent models and applications to speech presence probability estimation",
in The 14th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), Guildford, UK, Jul. 2018. In this study we address models with latent variable in the context of neural networks. We analyze a neural network architecture, mixture of deep experts (MoDE), that models latent variables using the mixture of expert paradigm. Learning the parameters of latent variable models is usually done by the expectation-maximization (EM) algorithm. However, it is well known that back-propagation gradient-based algorithms are the preferred strategy for training neural networks. We show that in the case of neural networks with latent variables, the back-propagation algorithm is actually a recursive variant of the EM that is more suitable for training neural networks. To demonstrate the viability of the proposed MoDE network it is applied to the task of speech presence probability estimation, widely applicable to many speech processing problem, e.g. speaker diarization and separation, speech enhancement and noise reduction. Experimental results show the benefits of the proposed architecture over standard fully-connected networks with the same number of parameters.
S. E. Chazan, J. Goldberger, and S. Gannot,
"DNN-based concurrent speakers detector and its application to speaker extraction with LCMV beamforming",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Calgary, Alberta, Canada, Apr. 2018. In this paper, we present a new control mechanism for LCMV beamforming. Application of the LCMV beamformer to speaker separation tasks requires accurate estimates of its building blocks, e.g. the noise spatial cross-power spectral density (cPSD) matrix and the relative transfer function (RTF) of all sources of interest. An accurate classification of the input frames to various speaker activity patterns can facilitate such an estimation procedure. We propose a DNN-based concurrent speakers detector (CSD) to classify the noisy frames. The CSD, trained in a supervised manner using a DNN, classifies noisy frames into three classes: 1) all speakers are inactive – used for estimating the noise spatial cPSD matrix; 2) a single speaker is active – used for estimating the RTF of the active speaker; and 3) more than one speaker is active – discarded for estimation purposes. Finally, using the estimated blocks, the LCMV beamformer is constructed and applied for extracting the desired speaker from a noisy mixture of speakers.
B. Laufer-Goldshtein, R. Talmon, I. Cohen, and S. Gannot,
"Multi-view source localization based on power ratios",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Calgary, Alberta, Canada, Apr. 2018. Despite attracting significant research efforts, the problem of source localization in noisy and reverberant environments remains challenging. Novel learning-based methods attempt to solve the problem by modelling the acoustic environment from the observed data. Typically, appropriate feature vectors are defined, and then used for constructing a model, which maps the extracted features to the corresponding source positions. In this paper, we focus on localizing a source using a distributed network with several arrays of unidirectional microphones. We introduce new feature vectors, which utilize the special characteristic of unidirectional microphones, receiving different parts of the reverberated speech. The new features are computed locally for each array, using the power-ratios between its measured signals, and are used to construct a local model, representing the unique view point of each array. The models of the different arrays, conveying distinct and complementing structures, are merged by a Multi-View Gaussian Process (MVGP), mapping the new features to their corresponding source positions. Based on this unifying model, a Bayesian estimator is derived, exploiting the relations conveyed by the covariance terms of the MVGP. The resulting localizer is shown to be robust to noise and reverberation, utilizing a computationally efficient feature extraction.
X. Li, S. Gannot, L. Girin, and R. Horaud,
"Multisource MINT using convolutive transfer function",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Calgary, Alberta, Canada, Apr. 2018. The multichannel inverse filtering method, i.e. multiple input/output inverse theorem (MINT), is widely used. However, it is usually performed in the time domain, and based on the long room impulse responses, thus it has a high computational complexity and a large number of near-common zeros. In this paper, we propose to perform MINT in the short-time Fourier transform (STFT) domain, in which the time-domain filter is approximated by the convolutive transfer function. The oversampled STFT is used to avoid frequency aliasing, which however leads to a common zero region in the subband frequency response due to the frequency response of the STFT window. A new inverse filtering target function concerning the STFT window is proposed to overcome this problem. In addition, unlike most studies using MINT for single source dereverberation, the multisource MINT is proposed for both source separation and dereverberation.
Y. Laufer and S. Gannot,
"A Bayesian hierarchical model for speech enhancement",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Calgary, Alberta, Canada, Apr. 2018. This paper addresses the problem of blind adaptive beamforming using a hierarchical Bayesian model. Our probabilistic approach relies on a Gaussian prior for the speech signal and a Gamma hyperprior for the speech precision, combined with a multichannel linear-Gaussian state-space model for the possibly time-varying acoustic channel. Furthermore, we assume a Gamma prior for the ambient noise precision. We present a variational Expectation-Maximization (VEM) algorithm that employs a variant of multi-channel Wiener filter (MCWF) to estimate the sound source and a Kalman smoother to estimate the acoustic channel of the room. It is further shown that the VEM speech estimator can be decomposed into two stages: A multichannel minimum variance distortionless response (MVDR) beamformer and a subsequent single-channel variational postfilter. The proposed algorithm is evaluated in terms of speech quality, for a static scenario with recorded room impulse responses (RIRs). It is shown that a significant improvement is obtained with respect to the noisy signal, and that the proposed algorithm outperforms a baseline algorithm. In terms of channel alignment, a superior channel estimate is demonstrated compared to the causal Kalman filter.
2017
D. Kounades-Bastian, R. P. Horaud, L. Girin, X. Alameda-Pineda, and S. Gannot,
"Exploiting the intermittency of speech for joint separation and diarization of speech signals",
in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2017. Natural conversations are spontaneous exchanges involving two or more people speaking in an intermittent manner. Therefore one expects such conversation to have intervals where some of the speakers are silent. Yet, most (multichannel) audio source separation (MASS) methods consider the sound sources to be continuously emitting on the total duration of the processed mixture. In this paper we propose a probabilistic model for MASS where the sources may have pauses. The activity of the sources is modeled as a hidden state, the diarization state, enabling us to activate/de-activate the sound sources at time frame resolution. We plug the diarization model within the spatial covariance matrix model proposed for MASS in [1], and obtain an improvement in performance over the state of the art when separating mixtures with intermittent speakers.
S. E. Chazan, J. Goldberger, and S. Gannot,
"Deep recurrent mixture of experts for speech enhancement",
in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2017. Deep neural networks (DNNs) have recently became a viable methodology for single microphone speech enhancement. The most common approach, is to feed the noisy speech features into a fully-connected DNN to either directly enhance the speech signal or to infer a mask which can be used for the speech enhancement. In this case, one network has to deal with the large variability of the speech signal. Most approaches also discard the speech continuity. In this paper, we propose a deep recurrent mixture of experts (DRMoE) architecture that addresses these two issues. In order to reduce the large speech variability, we split the network into a mixture of networks (denoted experts), each of which specializes in a specific and simpler task and a gating network. The time-continuity of the speech signal is taken into account by implementing the experts and the gating network as a recurrent neural network (RNN). Experimental study shows that the proposed algorithm produces higher objective measurements scores compared to both a single RNN and a deep mixture of experts (DMoE) architectures.
O. Shwartz, A. Plinge, E. Habets, and S. Gannot,
"Blind microphone geometry calibration using one reverberant speech event",
in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2017. A novel approach to calibrate the geometry of microphones using a single sound event is proposed. A variant of the expectation-maximization algorithm is employed to estimate the spatial coherence matrix of the reverberant sound field directly from the microphone signals. By matching the spatial coherence to theoretical models, the pairwise microphone distances are estimated. From this, the overall geometry is computed. Simulations and lab recordings are used to show that the proposed method outperforms a related approach that assumes a perfectly diffused sound field.
D. Y. Levin, S. Markovich-Golan, and S. Gannot,
"Distributed LCMV beamforming: Considerations of spatial topology and local preprocessing",
in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2017. A linearly constrained minimum variance (LCMV) beamformer aims to completely remove interference and optimize the signal-to-noise ratio (SNR). We examine an array geometry consisting of multiple sub-arrays. Our analysis shows that the increased intersensor distance typical of such setups is beneficial for the task of signal separation. Another unique feature of distributed arrays is the necessity of sharing information from different locations, which may pose a burden in terms of power and bandwidth resources. We discuss a scheme with minimalistic transmission requirements involving a preprocessing operation at each sub-array node. Expressions for the penalties due to preprocessing with local parameters are derived and corroborated with computer simulations.
D. Cherkassky, S. E. Chazan, J. Goldberger, and S. Gannot,
"Successive relative transfer function identification using single microphone speech enhancement",
in The 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, Aug. 2017. A distortionless speech extraction in a reverberant environment can be achieved by an application of a beamforming algorithm, provided that the relative transfer functions (RTFs) of the sources and the covariance matrix of the noise are known. In this contribution, we consider the RTF identification challenge in a multi-source scenario. We propose a successive RTF identification (SRI), based on a sole assumption that sources become successively active. The proposed algorithm identifies the RTF of the ith speech source assuming that the RTFs of all other sources in the environment and the power spectral density (PSD) matrix of the noise were previously estimated. The proposed RTF identification algorithm is based on the neural network Mix-Max (NN-MM) single microphone speech enhancement algorithm, followed by a least-squares (LS) system identification method. The proposed RTF estimation algorithm is validated by simulation.
A. Malek, S. E. Chazan, I. Malka, V. Tourbabin, J. Goldberger, E. Tzirkel-Hancock, and S. Gannot,
"Speaker extraction using LCMV beamformer with DNN-based SPP and RTF identification scheme",
in The 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, Aug. 2017. The linearly constrained minimum variance (LCMV)-beamformer (BF) is a viable solution for desired source extraction from a mixture of speakers in a noisy environment. The performance in terms of speech distortion, interference cancellation and noise reduction depends on the estimation of a set of parameters. This paper presents a new mechanism to update the parameters of the LCMV-BF. A new speech presence probability (SPP)-based voice activity detector (VAD) controls the noise covariance matrix update, and a speaker position identifier (SPI) procedure controls the relative transfer functions (RTFs) update. A postfilter is then applied to the BF output to further attenuate the residual noise signal. A series of experiments using real-life recordings confirm the speech enhancement capabilities of the proposed algorithm.
O. Schwartz, Y. Dorfan, M. Taseska, E. A. Habets, and S. Gannot,
"DOA estimation in noisy environment with unknown noise power using the EM algorithm",
in The 5th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), San-Francisco, CA, USA, Mar. 2017. A direction of arrival (DOA) estimator for concurrent speakers in a noisy environment with unknown noise power is presented. Spatially colored noise, if not properly addressed, is known to degrade the performance of DOA estimators. In our contribution, the DOA estimation task is formulated as a maximum likelihood (ML) problem, which is solved using the expectation-maximization (EM) procedure. The received microphone signals are modelled as a sum of the speech and noise components. The noise power spectral density (PSD) matrix is modelled by a time-invariant full-rank coherence matrix multiplied by the noise power. The PSDs of the speech and noise components are estimated as part of the EM procedure. The benefit of the presented algorithm in a simulated noisy environment using measured room impulse responses is demonstrated.
C. Evers, Y. Dorfan, S. Gannot, and P. A. Naylor,
"Source tracking using moving microphone arrays for robot audition",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), New-Orleans, LA, USA, Mar. 2017. Intuitive spoken dialogues are a prerequisite for human-robot interaction. In many practical situations, robots must be able to identify and focus on sources of interest in the presence of interfering speakers. Techniques such as spatial filtering and blind source separation are therefore often used, but rely on accurate knowledge of the source location. In practice, sound emitted in enclosed environments is subject to reverberation and noise. Hence, sound source localization must be robust to both diffuse noise due to late reverberation, as well as spurious detections due to early reflections. For improved robustness against reverberation, this paper proposes a novel approach for sound source tracking that constructively exploits the spatial diversity of a microphone array installed in a moving robot. In previous work, we developed speaker localization approaches using expectation-maximization (EM) approaches and using Bayesian approaches. In this paper we propose to combine the EM and Bayesian approach in one framework for improved robustness against reverberation and noise.
D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, and R. Horaud,
"An EM algorithm for joint source separation and diarization of multichannel convolutive speech mixtures",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), New-Orleans, LA, USA, Mar. 2017. We present a probabilistic model for joint source separation and diarisation of multichannel convolutive speech mixtures. We build upon the framework of local Gaussian model (LGM) with non-negative matrix factorization (NMF). The diarisation is introduced as a temporal labeling of each source in the mix as active or inactive at the short-term frame level. We devise an EM algorithm in which the source separation process is aided by the diarisation state, since the latter indicates the sources actually present in the mixture. The diarisation state is tracked with a Hidden Markov Model (HMM) with emission probabilities calculated from the estimated source signals. The proposed EM has separation performance comparable with a state-of-the-art LGM NMF method, while outperforming a state-of-the-art speaker diarisation pipeline.
E. Hadad, D. Marquardt, W. Pu, S. Gannot, S. Doclo, Z.-Q. Luo, I. Merks, and T. Zhang,
"Comparison of two binaural beamforming approaches for hearing aids",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), New-Orleans, LA, USA, Mar. 2017. Beamforming algorithms in binaural hearing aids are crucial to improve speech understanding in background noise for hearing impaired persons. In this study, we compare and evaluate the performance of two recently proposed minimum variance (MV) beamforming approaches for binaural hearing aids. The binaural linearly constrained MV (BLCMV) beamformer applies linear constraints to maintain the target source and mitigate the interfering sources, taking into account the reverberant nature of sound propagation. The inequality constrained MV (ICMV) beamformer applies inequality constraints to maintain the target source and mitigate the interfering sources, utilizing estimates of the direction of arrivals (DOAs) of the target and interfering sources. The similarities and differences between these two approaches is discussed and the performance of both algorithms is evaluated using simulated data and using real-world recordings, particularly focusing on the robustness to estimation errors of the relative transfer functions (RTFs) and DOAs. The BLCMV achieves a good performance if the RTFs are accurately estimated while the ICMV shows a good robustness to DOA estimation errors.
O. Schwartz, S. Braun, S. Gannot, and E. A. Habets,
"Source separation, dereverberation and noise reduction using LCMV beamformer and postfilter",
in The 13th International Conference on Latent Variable Analysis and Signal Separation (LVA-ICA), Grenoble, France, Feb. 2017. The problem of source separation, dereverberation and noise reduction using a microphone array is addressed in this paper. The observed speech is modeled by two components, namely the early speech (including the direct path and some early reflections) and the late reverberation. The minimum mean square error (MMSE) estimator of the early speech components of the various speakers is derived, which jointly suppresses the noise and the overall reverberation from all speakers. The overall time-varying level of the reverberation is estimated using two different estimators, an estimator based on a temporal model and an estimator based on a spatial model. The experimental study consists of measured acoustic transfer functions (ATFs) and directional noise with various signal-to-noise ratio levels. The separation, dereverberation and noise reduction performance is examined in terms of perceptual evaluation of speech quality (PESQ) and signal-to-interference plus noise ratio improvement.
B. Laufer-Goldshtein, R. Talmon, and S. Gannot,
"Speaker tracking on multiple-manifolds with distributed microphones",
in The 13th International Conference on Latent Variable Analysis and Signal Separation (LVA-ICA), Grenoble, France, Feb. 2017.
Speaker tracking in a reverberant enclosure with an ad hoc network of multiple distributed microphones is addressed in this paper. A set of prerecorded measurements in the enclosure of interest is used to construct a data-driven statistical model. The function mapping the measurement-based features to the corresponding source position represents complex unknown relations, hence it is modelled as a random Gaussian process. The process is defined by a covariance function which encapsulates the relations among the available measurements and the different views presented by the distributed microphones. This model is intertwined with a Kalman filter to capture both the smoothness of the source movement in the time-domain and the smoothness with respect to patterns identified in the set of available prerecorded measurements. Simulation results demonstrate the ability of the proposed method to localize a moving source in reverberant conditions.
2016
Y. Dorfan, O. Schwartz, B. Schwartz, E. A. Habets, and S. Gannot,
"Multiple DOA estimation and blind source separation using expectation-maximization algorithm",
in International conference on the science of electrical engineering (ICSEE), Eilat, Israel, Nov. 2016. A blind source separation technique in noisy environment is proposed based on spectral masking and minimum variance distortionless response (MVDR) beamformer (BF). Formulating the maximum-likelihood of the direction of arrivals (DOAs) and solving it using the expectation-maximization, enables the extraction of the masks and the associated MVDR BF as byproducts. The proposed direction of arrival estimator uses an explicit model of the ambient noise, which results in more accurate DOA estimates and good blind source separation. The experimental study demonstrates both the DOA estimation results and the separation capabilities of the proposed method using real room impulse responses in diffuse noise field.
E. Hadad, D. Marquardt, S. Doclo, and S. Gannot,
"Comparison of binaural multichannel Wiener filters with binaural cue preservation of the interfering source",
in International conference on the science of electrical engineering (ICSEE), Eilat, Israel, Nov. 2016. An important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of the sources, in addition to noise reduction. The binaural multichannel Wiener filter (MWF) preserves the binaural cues of the target but distorts the noise binaural cues. To optimally benefit from binaural unmasking and to preserve the spatial impression for the hearing aid user, two extensions of the binaural MWF have therefore been proposed, namely, the MWF with partial noise estimation (MWF-N) and MWF with interference reduction (MWF-IR). In this paper, the binaural cue preservation of these extensions is analyzed theoretically. Although both extensions are aimed at incorporating the binaural cue preservation of the interferer in the binaural MWF cost function, their properties are different. For the MWF-N, while the binaural cues of the target are preserved, there is a tradeoff between the noise reduction and the preservation of the binaural cues of the interferer component. For the MWF-IR, while the binaural cues of the interferer are preserved, those of the target may be slightly distorted. The theoretical results are validated by simulations using binaural hearing aids, demonstrating the capabilities of these beamformers in a reverberant environment.
D. Y. Levin and S. Gannot,
"A statistical model for room impulse responses encompassing early and late reflections",
in International conference on the science of electrical engineering (ICSEE), Eilat, Israel, Nov. 2016. B. Laufer-Goldshtein, R. Talmon, and S. Gannot,
"A real life experimental study on semi-supervised source localization based on manifold regularization",
in International conference on the science of electrical engineering (ICSEE), Eilat, Israel, Nov. 2016. Recently, we have presented a semi-supervised approach for sound source localization based on manifold regularization. The idea is to estimate the function that maps each relative transfer function (RTF) to its corresponding position. The estimation is based on an optimization problem which takes into consideration the geometric structure of the RTF samples, which is empirically deduced from prerecorded training measurements. The solution is appropriately constrained to be smooth, meaning that similar RTFs are mapped to close positions. In this paper, we conduct a comprehensive experimental study with real-life recordings to examine the algorithm performance in actual noisy and reverberant conditions. The influence of the amount of training data as well as changes in the environmental conditions are also being examined. We show that the algorithm attains accurate localization in such challenging conditions.
A. Barnov, A. Cohen, M. Agmon, V. B. Bracha, S. Markovich-Golan, and S. Gannot,
"A dynamic TF-GSC beamformer for distributed arrays with dual-resolution speech-presence-probability estimators",
in International conference on the science of electrical engineering (ICSEE), Eilat, Israel, Nov. 2016. The problem of speech enhancement using a distributed microphones array in a dynamic scenario where speaker, noise and microphone arrays are free to move is considered. The transfer function generalized sidelobe canceler (TF-GSC) spatial filter [1] which optimizes the minimum variance distortionless response (MVDR) criterion is used for enhancing the desired speech signal. A novel speech presence probability (SPP) estimator is proposed based on [2]. By using a dual-resolution SPP, the proposed estimator is able to detect noise dominant frequencies during speech, and thus improve noise tracking capability. We test the proposed algorithm in real dynamic scenarios, and demonstrate its consistent signal to noise ratio (SNR) improvement using a distributed microphone array consisting of 2 devices and 4 microphones.
S. E. Chazan, S. Gannot, and J. Goldberger,
"A phoneme-based pre-training approach for deep neural network with application to speech enhancement",
in International Workshop on Acoustic Signal Enhancement (IWAENC), Xi'an, China, Sep. 2016. In this study, we present a new phoneme-based deep neural network (DNN) framework for single microphone speech enhancement. While most speech enhancement algorithms overlook the phoneme structure of the speech signal, our proposed framework comprises a set of phoneme-specific DNNs (pDNNs), one for each phoneme, together with an additional phoneme-classification DNN (cDNN). The cDNN is responsible for determining the posterior probability that a specific phoneme was uttered. Concurrently, each of the pDNNs estimates a phoneme-specific speech presence probability (pSPP). The speech presence probability (SPP) is then calculated as a weighted averaging of the phoneme-specific pSPPs, with the weights determined by the posterior phoneme probability. A soft spectral attenuation, based on the SPP, is then applied to enhance the noisy speech signal. We further propose a compound training procedure, where each pDNN is first pre-trained using the phoneme labeling and the cDNN is trained to classify phonemes. Since these labels are unavailable in the test phase, the entire network is then trained using the noisy utterance, with the cDNN providing phoneme classification. A series of experiments in different noise types verifies the applicability of the new algorithm to the task of speech enhancement. Moreover, the proposed scheme outperforms other schemes that either do not consider the phoneme structure or use simpler training methodology.
O. Schwartz, Y. Dorfan, E. A.P., and S. Gannot,
"Multi-speaker doa estimation in reverberation conditions using expectation-maximization",
in International Workshop on Acoustic Signal Enhancement (IWAENC), Xi'an, China, Sep. 2016. A novel direction of arrival (DOA) estimator for concurrent speakers in reverberant environment is presented. Reverberation, if not properly addressed, is known to degrade the performance of DOA estimators. In our contribution, the DOA estimation task is formulated as a maximum likelihood (ML) problem, which is solved using the expectation-maximization (EM) procedure. The received microphone signals are modelled as a sum of anechoic and reverberant components. The reverberant components are modelled by a timeinvariant coherence matrix multiplied by time-varying reverberation power spectral density (PSD). The PSDs of the anechoic speech and reverberant components are estimated as part of the EM procedure. It is shown that the DOA estimates, obtained by the proposed algorithm, are less affected by reverberation than competing algorithms that ignore the reverberation. Experimental study demonstrates the benefit of the presented algorithm in reverberant environment using measured room impulse responses (RIRs).
S. Markovich-Golan, D. Y. Levin, and S. Gannot,
"Performance analysis of a dual microphone superdirective beamformer and approximate expressions for the near-field propagation regime",
in International Workshop on Acoustic Signal Enhancement (IWAENC), Xi'an, China, Sep. 2016. A linear array of sensors with small spacing (compared to the wavelength) can be processed with superdirective beamforming. Specifically when applying minimum variance distortionless response (MVDR) weights designed for a diffuse noise-field, high gains are attainable in theory. A classical result relating to the far-field regime states that the gain with respect to diffuse noise (i.e., the directivity factor) for a source in the endfire direction may approach the number of sensors squared (N 2 ). However, as the wavelength increases, the beamformer encounters increasingly severe robustness issues. Results pertaining to the near-field regime are less well known. In this paper we analyze MVDR beamforming in a generic dual-microphone array scenario. Our analysis is not restricted to the far-field regime. We derive precise expressions for the directivity factor and the white-noise gain, as well as simplified approximations for the near- and far-field regimes. We show that in the near-field regime the directivity factor approaches infinity as the wavelength increases, and that the white-noise gain depends only on the ratio between the distance from the source to the distance between sensors. These properties of the beamformer (BF) behave differently than in the far-field regime.
X. Li, R. Horaud, L. Girin, and S. Gannot,
"Voice activity detection based on statistical likelihood ratio with adaptive thresholding",
in International Workshop on Acoustic Signal Enhancement (IWAENC), Xi'an, China, Sep. 2016. Statistical likelihood ratio test is a widely used voice activity detection (VAD) method, in which the likelihood ratio of the current temporal frame is compared with a threshold. A fixed threshold is always used, but this is not suitable for various types of noise. In this paper, an adaptive threshold is proposed as a function of the local statistics of the likelihood ratio. This threshold represents the upper bound of the likelihood ratio for the non-speech frames, whereas it remains generally lower than the likelihood ratio for the speech frames. As a result, a high non-speech hit rate can be achieved, while maintaining speech hit rate as large as possible.
S. Braun, B. Schwartz, S. Gannot, and E. A.P. Habets,
"Late reverberation PSD estimation for single-channel dereverberation using relative convolutive transfer functions",
in International Workshop on Acoustic Signal Enhancement (IWAENC), Xi'an, China, Sep. 2016. The estimation accuracy of the late reverberation power spectral density (PSD) is of paramount importance in single-channel frequency-domain dereverberation algorithms. In this domain, the reverberant signal can be modeled by the convolution of an early speech component and a relative convolutive transfer function (RCTF). In this work, the RCTF coefficients are modeled by a first-order Markov chain, which is well-suited to model time-varying scenarios. The RCTF coefficients are estimated online by a Kalman filter and are then used to compute the late reverberation PSD, which is used in a spectral enhancement filter to achieve dereverberation and noise reduction. It is shown that the proposed reverberation PSD estimator yields similar performance to other estimators, which impose a model on the reverberant tail and which depend on additional information like the reverberation time and the direct-to-reverberation ratio.
Y. Dorfan, C. Evers, S. Gannot, and P. A. Naylor,
"Speaker localization with moving microphone arrays",
in The 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, Aug. 2016. Speaker localization algorithms often assume static location for all sensors. This assumption simplifies the models used, since all acoustic transfer functions are linear time invariant. In many applications this assumption is not valid. In this paper we address the localization challenge with moving microphone arrays. We propose two algorithms to find the speaker position. The first approach is a batch algorithm based on the maximum likelihood criterion, optimized via expectation-maximization iterations. The second approach is a particle filter for sequential Bayesian estimation. The performance of both approaches is evaluated and compared for simulated reverberant audio data from a microphone array with two sensors.
Y. Biderman, B. Rafaely, S. Gannot, and S. Doclo,
"Efficient relative transfer function estimation framework in the spherical harmonics domain",
in The 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, Aug. 2016. In acoustic conditions with reverberation and coherent sources, various spatial filtering techniques, such as the linearly constrained minimum variance (LCMV) beamformer, require accurate estimates of the relative transfer functions (RTFs) between the sensors with respect to the desired speech source. However, the time-domain support of these RTFs may affect the estimation accuracy in several ways. First, short RTFs justify the multiplicative transfer function (MTF) assumption when the length of the signal time frames is limited. Second, they require fewer parameters to be estimated, hence reducing the effect of noise and model errors. In this paper, a spherical microphone array based framework for RTF estimation is presented, where the signals are transformed to the spherical harmonics (SH)-domain. The RTF time-domain supports are studied under different acoustic conditions, showing that SH-domain RTFs are shorter compared to conventional space-domain RTFs.
O. Shwartz, S. Gannot, and E. Habets,
"Joint estimation of late reverberant and speech power spectral densities in noisy environments using frobenius norm",
in The 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, Aug. 2016. Various dereverberation and noise reduction algorithms require power spectral density estimates of the anechoic speech, reverberation, and noise. In this work, we derive a novel multichannel estimator for the power spectral densities (PSDs) of the reverberation and the speech suitable also for noisy environments. The speech and reverberation PSDs are estimated from all the entries of the received signals power spectral density (PSD) matrix. The Frobenius norm of a general error matrix is minimized to find the best fitting PSDs. Experimental results show that the proposed estimator provides accurate estimates of the PSDs, and is outperforming competing estimators. Moreover, when used in a multi-microphone noise reduction and dereverberation algorithm, the estimated reverberation and speech PSDs are shown to provide improved performance measures as compared with the competing estimators.
E. Hadad, S. Doclo, and S. Gannot,
"A generalized binaural MVDR beamformer with interferer relative transfer function preservation",
in The 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, Aug. 2016. In addition to interference and noise reduction, an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of both the target and the undesired sound sources. For directional sources, this can be achieved by preserving the relative transfer function (RTF). The recently proposed binaural minimum variance distortionless response (BMVDR) beamformer preserves the RTF of the target, but typically distorts the RTF of the interfering sources. Recently, two extensions of the BMVDR beamformer were proposed preserving the RTFs of both the target and the interferer, namely, the binaural linearly constrained minimum variance (BLCMV) and the BMVDR-RTF beamformers. In this paper, we generalize the BMVDR-RTF to trade off interference reduction and noise reduction. Three special cases of the proposed beamformer are examined, either maximizing the signal-to-interference-and-noise ratio (SINR), the signal-to-noise ratio (SNR), or the signal-to-interference ratio (SIR). Experimental validations in an office environment validate our theoretical results.
A. Plinge and S. Gannot,
"Multi-microphone speech enhancement informed by auditory scene analysis",
in IEEE 9th Sensor Array and Multichannel Signal Processing Workshop (SAM), Rio de Janeiro, Brazil, Jul. 2016. A multitude of multi-microphone speech enhancement methods is available. In this paper, we focus our attention to the well-known minimum variance distortionless response (MVDR) beamformer, due to its ability to preserve distortionless response towards the desired speaker while minimizing the output noise power. We explore two alternatives for constructing the steering vectors towards the desired speech source. One is only using the direct path of the speech propagation in the form of delay-only filters, while the other is using the entire room impulse response (RIR). All beamforming methods requires some control information to be able to accomplish the task of enhancing a desired speech signal. In this paper, an acoustic event detection method using biologically-inspired features is employed. It can interpret the auditory scene by detecting the presence of different auditory objects. This is employed to control the estimation procedures used by beamformer. The resulting system provides a blind method of speech enhancement that can improve intelligibility independently of any additional information. Experiments with real recordings show the practical applicability of the method. Significant gain in fwSNRseg is achieved. Compared to using the direct path only, the use of the entire RIR proves beneficial.
O. Schwartz, S. Gannot, and E. A. Habets,
"Joint maximum likelihood estimation of late reverberant and speech power spectral density in noisy environments",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Shanghai, China, Mar. 2016. An estimate of the power spectral density (PSD) of the late reverberation is often required by dereverberation algorithms. In this work, we derive a novel multichannel maximum likelihood (ML) estimator for the PSD of the reverberation that can be applied in noisy environments. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. As a closed-form solution for the maximum likelihood estimator is unavailable, a Newton method for maximizing the ML criterion is derived. Experimental results show that the proposed estimator provides an accurate estimate of the PSD, and outperforms competing estimators. Moreover, when used in a multi-microphone dereverberation and noise reduction algorithm, the best performance in terms of the log-spectral distance is achieved when employing the proposed PSD estimator.
X. Li, L. Girin, R. Horaud, and S. Gannot,
"Noise power spectral density estimation based on regional statistics",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Shanghai, China, Mar. 2016. E. Hadad, D. Marquardt, S. Doclo, and S. Gannot,
"Extensions of the binaural mwf with interference reduction preserving the binaural cues of the interfering source",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Shanghai, China, Mar. 2016. Recently, an extension of the binaural multichannel Wiener filter (BMWF), referred to as BMWF-IRo, was presented in which an interference rejection constraint was added to the BMWF cost function. Although the BMWF-IRo aims to entirely suppress the interfering source, residual interfering sources (as well as unconstrained noise sources) are undesirably perceived as impinging the array from the desired source direction. In this paper, we propose two extensions of the BMWF-IRo that address this issue by preserving the spatial impression of the interfering source. In the first extension, the binaural cues of the interfering source are preserved, while those of the desired source may be slightly distorted. In the second extension, the binaural cues of both the desired and interfering sources are preserved. Simulation results show that the noise reduction performance of both proposed extensions is comparable to the BMWF-, IRo.
D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, and R. Horaud,
"An inverse-Gamma source variance prior with factorized parameterization for audio source separation",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Shanghai, China, Mar. 2016. In this paper we present a new statistical model for the power spectral density (PSD) of an audio signal and its application to multichannel audio source separation (MASS). The source signal is modeled with the local Gaussian model (LGM) and we propose to model its variance with an inverse-Gamma distribution, whose scale parameter is factorized as a rank-1 model. We discuss the interest of this approach and evaluate it in a MASS task with underdetermined convolutive mixtures. For this aim, we derive a variational EM algorithm for parameter estimation and source inference. The proposed model shows a benefit in source separation performance compared to a state-of-the-art LGM NMF-based technique.
D. Marquardt, E. Hadad, S. Gannot, and S. Doclo,
"Incorporating relative transfer function preservation into the binaural multi-channel wiener lter",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Shanghai, China, Mar. 2016. Besides noise reduction, an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of all sound sources. For the desired speech source and an interfering source, e.g., competing speaker, this can be achieved by preserving their relative transfer functions (RTFs). It has been shown that the binaural multi-channel Wiener filter (MWF) preserves the RTF of the desired speech source, but typically distorts the RTF of the interfering source. To this end, in this paper we propose an extension of the binaural MWF, i.e. the binaural MWF with RTF preservation (MWF-RTF) aiming to preserve the RTF of the interfering source. Analytical expressions for the performance of the binaural MWF and the MWF-RTF in terms of noise reduction and binaural cue preservation are derived, using which their performance is thoroughly compared. Simulation results using binaural behind-the-ear impulse responses measured in a reverberant environment validate the derived analytical expressions, showing that the MWF-RTF yields a better performance than the binaural MWF in terms of the signal-to-interference ratio and binaural cue preservation of the interfering source, while the overall noise reduction performance is slightly degraded.
B. Laufer-Goldshtein, R. Talmon, and S. Gannot,
"Manifold-based Bayesian inference for semi-supervised source localization",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Shanghai, China, Mar. 2016. Sound source localization is addressed by a novel Bayesian approach using a data-driven geometric model. The goal is to recover the target function that attaches each acoustic sample, formed by the measured signals, with its corresponding position. The estimation is derived by maximizing the posterior probability of the target function, computed on the basis of acoustic samples from known locations (labelled data) as well as acoustic samples from unknown locations (unlabelled data). To form the posterior probability we use a manifold-based prior, which relies on the geometric structure of the manifold from which the acoustic samples are drawn. The proposed method is shown to be analogous to a recently presented semi-supervised localization approach based on manifold regularization. Simulation results demonstrate the robustness of the method in noisy and reverberant environments.
2015
W. S. Woods, E. Hadad, I. Merks, B. Xu, S. Gannot, and T. Zhang,
"A real-world recording database for ad hoc microphone arrays",
in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2015. We report on a recently-recorded database for use in processing of ad hoc microphone constellations. Twenty-four microphones were positioned in various locations at a central table in a large room, and their outputs were recorded while 4 target talkers at the table both read from a list of sentences in a constrained way and also maintained a natural conversation for several minutes. This was done in the quiet and in the presence of 8, 24, and 56 other simultaneous talkers surrounding the central table at various distances. We also recorded without the 4 target talkers active in each of these conditions, and used a loudspeaker to measure impulse responses to the microphones from various positions in the room. We provide details of the recording setup and demonstrate use of this database via an application of linearly constrained minimum variance beam-forming. The database will become available to researchers in the field.
B. Schwartz, S. Gannot, and E. A. Habets,
"An online dereverberation algorithm for hearing aids with binaural cues preservation",
in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2015. A dereverberation method of a single speaker for binaural hearing aids is proposed. Thanks to binaural cues, listeners are capable of localizing sound sources even in reverberant enclosures. As the aim of dereverberation algorithms is the reduction of the sound reflections, it is changing the binaural cues of the reverberant signal. A recently proposed algorithm estimates both the early speech component and the room impulse response (RIR) in an online fashion. In this paper, we develop a binaural extension of this algorithm which enables a tradeoff between the amount of dereverberation and the preservation of the binaural cues of the reverberant signal. The method is tested using a database of binaural RIRs in different reverberation levels and source-listener distances. It is shown that the proposed method enables the tradeoff between improvement in the frequency-weighted signal to noise ratio (WSNR) scores and the preservation of the cues.
O. Schwartz, S. Braun, S. Gannot, and E. A. Habets,
"Maximum likelihood estimation of the late reverberant power spectral density in noisy environments",
in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2015. An estimate of the power spectral density (PSD) of the late reverberation is often required by dereverberation algorithms. In this work, we derive a novel multichannel maximum likelihood (ML) estimator for the PSD of the reverberation that can be applied in noisy environments. The direct path is first blocked by a blocking matrix and the output is considered as the observed data. Then, the ML criterion for estimating the reverberation PSD is stated. As a closed-form solution for the maximum likelihood estimator (MLE) is unavailable, a Newton method for maximizing the ML criterion is derived. Experimental results show that the proposed estimator provides an accurate estimate of the PSD, and is outperforming competing estimators. Moreover, when used in a multi-microphone noise reduction and dereverberation algorithm, the estimated reverberation PSD is shown to provide improved performance measures as compared with the competing estimators.
D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, and R. P. Horaud,
"A variational EM algorithm for the separation of moving sound sources",
in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2015.,
best student paper award This paper addresses the problem of separation of moving sound sources. We propose a probabilistic framework based on the complex Gaussian model combined with non-negative matrix factorization. The properties associated with moving sources are modeled using time-varying mixing filters described by a stochastic temporal process. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the mixing filters. The sound sources are separated by means of Wiener filters, built from the estimators provided by the proposed VEM algorithm. Preliminary experiments with simulated data show that, while for static sources we obtain results comparable with the baseline method [1], in the case of moving source our method outperforms a piece-wise version of the baseline method.
Y. Dorfan, D. Cherkassky, and S. Gannot,
"Speaker localization and separation using distributed expectation-maximization",
in 23rd European Signal Processing Conference (EUSIPCO), Nice, France, Aug. 2015. A network of microphone pairs is utilized for the joint task of localizing and separating multiple concurrent speakers. The recently presented incremental distributed expectation-maximization (IDEM) is addressing the first task, namely detection and localization. Here we extend this algorithm to address the second task, namely blindly separating the speech sources. We show that the proposed algorithm, denoted distributed algorithm for localization and separation (DALAS), is capable of separating speakers in reverberant enclosure without a priori information on their number and locations. In the first stage of the proposed algorithm, the IDEM algorithm is applied for blindly detecting the active sources and to estimate their locations. In the second stage, the location estimates are utilized for selecting the most useful node of microphones for the subsequent separation stage. Separation is finally obtained by utilizing the hidden variables of the IDEM algorithm to construct masks for each source in the relevant node.
A. Deleforge, S. Gannot, and W. Kellermann,
"Towards a generalization of relative transfer functions to more than one source",
in 23rd European Signal Processing Conference (EUSIPCO), Nice, France, Aug. 2015. We propose a natural way to generalize relative transfer functions (RTFs) to more than one source. We first prove that such a generalization is not possible using a single multichannel spectro-temporal observation, regardless of the number of microphones. We then introduce a new transform for multichannel multi-frame spectrograms, i.e., containing several channels and time frames in each time-frequency bin. This transform allows a natural generalization which satisfies the three key properties of RTFs, namely, they can be directly estimated from observed signals, they capture spatial properties of the sources and they do not depend on emitted signals. Through simulated experiments, we show how this new method can localize multiple simultaneously active sound sources using short spectro-temporal windows, without relying on source separation.
X. Li, R. P. Horaud, L. Girin, and S. Gannot,
"Local relative transfer function for sound source localization",
in 23rd European Signal Processing Conference (EUSIPCO), Nice, France, Aug. 2015. The relative transfer function (RTF), i.e. the ratio of acoustic transfer functions between two sensors, can be used for sound’ source localization / beamforming based on a microphone array. The RTF is usually defined with respect to a unique reference sensor. Choosing the reference sensor may be a difficult task, especially for dynamic acoustic environment and setup. In this paper we propose to use a locally normalized RTF, in short local-RTF, as an acoustic feature to characterize the source direction. Local-RTF takes a neighbor sensor as the reference channel for a given sensor. The estimated local-RTF vector can thus avoid the bad effects of a noisy unique reference and have smaller estimation error than conventional RTF estimators. We propose two estimators for the local-RTF and concatenate the values across sensors and frequencies to form a high-dimensional vector which is utilized for source localization. Experiments with real-world signals show the interest of this approach.
D. Cherkassky, S. Markovich-Golan, and S. Gannot,
"Performance analysis of MVDR beamformer in WASN with sampling rate osets and blind synchronization",
in 23rd European Signal Processing Conference (EUSIPCO), Nice, France, Aug. 2015. In wireless acoustic sensor networks (WASNs), sampling rate offsets (SROs) between nodes are inevitable, and recognized as one of the challenges that have to be resolved for a coherent array processing. A simplified free-space propagation is considered with a single desired source impinging a WASNs from the far-field and contaminated by a diffuse noise. In this paper, we analyze the theoretical performance of a fixed superdirective beamformer (SDBF) in presence of SROs. The SDBF performance loss due to SROs is manifested as a distortion of the nominal beampattern and an excess noise power at the output of the beamformer. We also propose an iterative algorithm for SROs estimation. The theoretical results are validated by simulation.
B. Laufer, R. Talmon, and S. Gannot,
"A study on manifolds of acoustic responses",
in Latent Variable Analysis and Independent Component Analysis (LVA ICA), Liberec, Czech Republic, Aug. 2015. The construction of a meaningful metric between acoustic responses which respects the source locations, is addressed. By comparing three alternative distance measures, we verify the existence of the acoustic manifold and give an insight into its nonlinear structure. From such a geometric view point, we demonstrate the limitations of linear approaches to infer physical adjacencies. Instead, we introduce the diffusion framework, which combines local and global processing in order to find an intrinsic nonlinear embedding of the data on a low-dimensional manifold. We present the diffusion distance which is related to the geodesic distance on the manifold. In particular, simulation results demonstrate the ability of the diffusion distance to organize the samples according to the source direction of arrival (DOA).
X. Li, L. Girin, R. Horaud, and S. Gannot,
"Estimation of relative transfer function in the presence of stationary noise based on segmental power spectral density matrix subtraction",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Brisbane, Australia, Apr. 2015. This paper addresses the problem of relative transfer function (RTF) estimation in the presence of stationary noise. We propose an RTF identification method based on segmental power spectral density (PSD) matrix subtraction. First multiple channel microphone signals are divided into segments corresponding to speech-plus-noise activity and noise-only. Then, the subtraction of two segmental PSD matrices leads to an almost noise-free PSD matrix by reducing the stationary noise component and preserving non-stationary speech component. This noise-free PSD matrix is used for single speaker RTF identification by eigenvalue decomposition. Experiments are performed in the context of sound source localization to evaluate the efficiency of the proposed method.
S. Markovich-Golan and S. Gannot,
"Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Brisbane, Australia, Apr. 2015. Microphone array processing utilize spatial separation between the desired speaker and interference signal for speech enhancement. The transfer functions (TFs) relating the speaker component at a reference microphone with all other microphones, denoted as the relative TFs (RTFs), play an important role in beamforming design criteria such as minimum variance distortionless response (MVDR) and speech distortion weighted multichannel Wiener filter (SDW-MWF). Two common methods for estimating the RTF are surveyed here, namely, the covariance subtraction (CS) and the covariance whitening (CW) methods. We analyze the performance of the CS method theoretically and empirically validate the results of the analysis through extensive simulations. Furthermore, empirically comparing the methods performances in various scenarios evidently shows thats the CW method outperforms the CS method.
O. Schwartz, S. Gannot, and E. A. P. Habets,
"Nested generalized sidelobe canceller for joint dereverberation and noise reduction",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Brisbane, Australia, Apr. 2015. Speech signal is often contaminated by both room reverberation and ambient noise. In this contribution, we propose a nested generalized sidelobe canceller (GSC) beamforming structure, comprising an inner and an outer GSC beamformers (BFs), that decouple the speech dereverberation and the noise reduction operations. The BFs are implemented in the short-time Fourier transform (STFT) domain. Two alternative reverberation models are adopted. In the first, used in the inner GSC, reverberation is assumed to comprise a coherent early component and a late reverberant component. In the second, used in the outer GSC, the influence of the entire acoustic transfer function (ATF) is modeled as a convolution along the frame index in each frequency. Unlike other BF designs for this problem that must be updated in each time-frame, the proposed BF is time-invariant in static scenarios. Experiments with both simulated and recorded environments verify the effectiveness of the proposed structure.
E. Hadad, D. Marquardt, S. Doclo, and S. Gannot,
"Binaural multichannel Wiener filter with directional interference rejection",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Brisbane, Australia, Apr. 2015. In this paper we consider an acoustic scenario with a desired source and a directional interference picked up by hearing devices in a noisy and reverberant environment. We present an extension of the binaural multichannel Wiener filter (BMWF), by adding an interference rejection constraint to its cost function, in order to combine the advantages of spatial and spectral filtering while mitigating directional interferences. We prove that this algorithm can be decomposed into the binaural linearly constrained minimum variance (BLCMV) algorithm followed by a single channel Wiener post-filter. The proposed algorithm yields improved interference rejection capabilities, as compared with the BMWF. Moreover, by utilizing the spectral information on the sources, it is demonstrating better SNR measures, as compared with the BLCMV.
2014
E. Hadad, D. Fishman, E. Hadad, and S. Gannot,
"A study of 3d audio rendering by headphones",
in The IEEE 28th Convention of IEEE Israel (IEEEI), Eilat, Israel, Dec. 2014. An efficient implementation of a three-dimensional audio rendering system (3D-ARS) system over headphones is presented and its ability to render natural spatial sound is analyzed. In its most straightforward implementation spatial rendering is achieved by convolving a monophonic signal with the Head related transfer function (HRTF). Several methods were proposed in the literature to improve the naturalness of the spatial sound and the ability of the headphones’ wearer to localize sound sources. Among these methods, externalization, by incorporation of room reflections, personalization to the anthropometric attributes of the user, and the introduction of head movements, are known to yield improved performance. This work provides a unified and flexible platform incorporating the various optional components together with software tools to statistically analyze their contribution. Preliminary statistical analysis suggests that the additional components indeed contribute to the overall localization ability of the user.
D. Cherkassky and S. Gannot,
"Blind synchronization in wireless sensor networks with application to speech enchantment",
in International Workshop on Acoustic Signal Enhancement 2014 (IWAENC 2014), Antibes - Juan les Pins, France, Sep. 2014. The sampling rate offset (SRO) phenomenon in wireless acoustic sensor network (WASN) is considered in this work. The use of different clock sources in each node results in a drift between the nodes’ signals. The aim of this work is to estimate these SROs and to re-synchronize the network, enabling coherent multi-microphone processing. First, the link between SRO and the Doppler effect is driven. Then, a wideband correlation processor for SRO estimation, which is equivalent to continuous wavelet transform (CWT), is proposed. Finally, the node synchronization is achieved by re-sampling the signals at each node. Experimental study using an actual WASN, demonstrates the ability of the proposed algorithm to re-synchronize the network and to regain the performance loss due to SRO.
J. Cao, A. W. H. Khong, and S. Gannot,
"On the performance of widely linear quaternion based MVDR beamformer for an acoustic vector sensor",
in International Workshop on Acoustic Signal Enhancement 2014 (IWAENC 2014), Antibes - Juan les Pins, France, Sep. 2014. Widely linear model has recently been used for signal processing applications due to its ability to achieve better performance than conventional linear filtering for non-circular complex random variables (CRVs) and improper quaternion random variables (QRVs). In this paper, we study the time-domain widely linear quaternion model based minimum variance distortionless response beamformer (WL-QMVDR) for a single acoustic vector sensor (AVS) and analyze its performance in general through the use of beampatterns. We verify by simulation results that the estimated output of the WL-QMVDR is identical to the conventional linear model based MVDR beamformer when applied to an AVS in the non-reverberant and ideal sensor response scenario.
E. Hadad, F. Heese, P. Vary, and S. Gannot,
"Multichannel audio database in various acoustic environments",
in International Workshop on Acoustic Signal Enhancement 2014 (IWAENC 2014), Antibes - Juan les Pins, France, Sep. 2014. In this paper we describe a new multichannel room impulse responses database. The impulse responses are measured in a room with configurable reverberation level resulting in three different acoustic scenarios with reverberation times RT 60 equals to 160 ms, 360 ms and 610 ms. The measurements were carried out in recording sessions of several source positions on a spatial grid (angle range of -90° to 90° in 15° steps with 1 m and 2 m distance from the microphone array). The signals in all sessions were captured by three microphone array configurations. The database is accompanied with software utilities to easily access and manipulate the data. Besides the description of the database we demonstrate its use in spatial source separation task.
B. Schwartz, S. Gannot, and E. Habets,
"LPC-based speech dereverberation using Kalman-EM algorithm",
in International Workshop on Acoustic Signal Enhancement 2014 (IWAENC 2014), Antibes - Juan les Pins, France, Sep. 2014. An algorithm for multichannel speech dereverberation is proposed that simultaneously estimates the clean signal, the linear prediction (LP) parameters of speech, and the acoustic parameters of the room. The received signals are processed in short segments to reduce the algorithm latency, and several expectation-maximization (EM) iterations are carried out on each segment to improve the signal estimation. In the expectation step, the fixed-lag Kalman smoother (FLKS) is applied to extract the clean signal from the data utilizing the estimated parameters. In the maximization step, the LP and room pa-rameters are updated using the output of the FLKS. Experimental results show that multiple EM iterations and the application of the LP model improve the quality of the output signal.
M. Taseska, S. Markovich-Golan, E. Habets, and S. Gannot,
"Near-field source extraction using speech presence probabilities for ad hoc microphone arrays",
in International Workshop on Acoustic Signal Enhancement 2014 (IWAENC 2014), Antibes - Juan les Pins, France, Sep. 2014. Ad hoc wireless acoustic sensor networks (WASNs) hold great potential for improved performance in speech processing applications, thanks to better coverage and higher diversity of the received signals. We consider a multiple speaker scenario where each of the WASN nodes, an autonomous system comprising of sensing, processing and communicating capabilities, is positioned in the near-field of one of the speakers. Each node aims at extracting its nearest speaker while suppressing other speakers and noise. The ad hoc network is characterized by an arbitrary number of speakers/nodes with uncontrolled microphone constellation. In this paper we propose a distributed algorithm which shares information between nodes. The algorithm requires each node to transmit a single audio channel in addition to a soft time-frequency (TF) activity mask for its nearest speaker. The TF activity masks are computed as a combination of estimates of a model-based speech presence probability (SPP), direct to reverberant ratio (DRR) and direction of arrival (DOA) per TF bin. The proposed algorithm, although sub-optimal compared to the centralized solution, is superior to the single-node solution.
D. Marquardt, E. Hadad, S. Gannot, and S. Doclo,
"Optimal binaural LCMV beamformers for combined noise reduction and binaural cue preservation",
in International Workshop on Acoustic Signal Enhancement 2014 (IWAENC 2014), Antibes - Juan les Pins, France, Sep. 2014. Besides noise reduction an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of both desired and undesired sound sources. Recently, the binaural Linearly Constrained Minimum Variance (BLCMV) beamformer has been proposed that aims to preserve the desired speech component and suppress the undesired directional interference component while preserving the binaural cues of both components. Since the performance of the BLCMV beamformer highly depends on the amount of interference rejection determined by the interference rejection parameter, in this paper we propose several performance criteria to optimize the interference rejection parameters for the left and the right hearing aid. Experimental results show how the performance of the BLCMV beamformer is affected by the different optimal parameter combinations.
Y. Dorfan, G. Hazan, and S. Gannot,
"Multiple acoustic sources localization using distributed Expectation-Maximization algorithm",
in The 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), Nancy, France, May 2014,
best student paper award. The challenge of localizing number of concurrent acoustic sources in reverberant enclosures is addressed in this paper. We formulate the localization task as a maximum likelihood (ML) parameter estimation problem, and develop a distributed expectation-maximization (DEM) procedure, based on the Incremental EM (IEM) framework. The algorithm enables localization of the speakers without a center point. Unlike direction search, localization is a distributed task in nature, since the sensors must be spatially deployed. Taking advantage of the distributed constellation of the sensors we propose a distributed algorithm that enables multiple processing nodes and considers communication constraints between them. The proposed DEM has surprising advantages over conventional expectation-maximization (EM) schemes. Firstly, it is less sensitive to initial conditions. Secondly, it converges much faster than the conventional EM. The proposed algorithm is tested by an extensive simulation study.
J. Málek, D. Botka, Z. Koldovský, and S. Gannot,
"Methods to learn bank of filters steering nulls toward potential positions of a target source",
in The 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), Nancy, France, May 2014. In signal enhancement applications, a reference signal which provides information about interferences and noise is desired. It can be obtained via a multichannel filter that performs a spatial null in the target position, a so-called target-cancelation filter. The filter must adapt to the target position, which is difficult when noise is active. When the target location is confined to a small area, a solution could be based on preparing a bank of target-cancelation filters for potential positions of the target. In this paper, we propose two methods to learn such banks from noise-free recordings. We show by experiments that learned banks have practical advantages compared to banks that were prepared manually by collecting filters for selected positions.
D. Cherkassky and S. Gannot,
"Multichannel Wiener filter performance analysis in presence of mis-modeling",
in IEEE International Conference on Audio and Acoustic Signal Processing (ICASSP), Florence, Italy, May 2014. A randomly positioned microphone array is considered in this work. In many applications, the locations of the array elements are known up to a certain degree of random mismatch. We derive a novel statistical model for performance analysis of the multi-channel Wiener filter (MWF) beamformer under random mismatch in sensors location. We consider the scenario of one desired source and one interfering source arriving from the far-field and impinging on a linear array. A theoretical model for predicting the MWF mean squared error (MSE) for a given variation in sensors location is developed and verified by simulations. It is postulated that the probability density function (p.d.f) of the MSE of the MWF obeys Γ distribution. This claim is verified empirically by simulations.
2013
K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E.A.P. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, S. Gannot, and B. Raj, ,
"The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech",
in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2013. Recently, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques, and automatic speech recognition (ASR) techniques robust to reverberation. To evaluate state-of-the-art algorithms and obtain new insights regarding potential future research directions, we propose a common evaluation framework including datasets, tasks, and evaluation metrics for both speech enhancement and ASR techniques. The proposed framework will be used as a common basis for the REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge. This paper describes the rationale behind the challenge, and provides a detailed description of the evaluation framework and benchmark results.
B. Laufer, R. Talmon, and S. Gannot,
"Relative transfer function modeling for supervised source localization",
in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2013. Speaker localization is one of the most prevalent problems in speech processing. Despite significant efforts in the last decades, high reverberation level still limits the performance of localization algorithms. Furthermore, using conventional localization methods, the information that can be extracted from dual microphone measurements is restricted to the time difference of arrival (TDOA). Under far-field regime, this is equivalent to either azimuth or elevation angles estimation. Full description of speaker’s coordinates necessitates several microphones. In this contribution we tackle these two limitations by taking a manifold learning perspective for system identification. We present a training-based algorithm, motivated by the concept of diffusion maps, that aims at recovering the fundamental controlling parameters driving the measurements. This approach turns out to be more robust to reverberation, and capable of recovering the speech source location using merely two microphones signals.
K. Reindl, S. Markovich-Golan, H. Barfuss, S. Gannot, and W. Kellermann,
"Geometrically constrained TRINICON-based relative transfer function estimation in underdetermined scenarios",
in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, Oct. 2013. Speech extraction in a reverberant enclosure using a linearly-constrained minimum variance (LCMV) beamformer usually requires reliable estimates of the relative transfer functions (RTFs) of the desired source to all microphones. In this contribution, a geometrically constrained (GC)-TRINICON concept for RTF estimation is proposed. This approach is applicable in challenging multiple-speaker scenarios and in underdetermined situations, where the number of simultaneously active sources outnumbers the number of available microphone signals. As a most practically relevant and distinctive feature, this concept does not require any voice-activity-based control mechanism. It only requires coarse reference information on the target direction of arrival (DoA). The proposed GC-TRINICON method is compared to a recently proposed subspace method for RTF estimation relying on voice-activity control. Experimental results confirm the effectiveness of GC-TRINICON in realistic conditions.
J. Malek, Z. Koldovský, S. Gannot, and P. Tichavský,
"Informed generalized sidelobe canceler utilizing sparsity of speech signals",
in IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Southampton, UK, Sept. 22-25 2013. This report proposes a novel variant of the generalized sidelobe canceler. It assumes that a set of prepared relative transfer functions (RTFs) is available for several potential positions of a target source within a confined area. The key problem here is to select the correct RTF at any time, even when the exact position of the target is unknown and interfering sources are present. We propose to select the RTF based on l p -norm, p ≤ 1, measured at the blocking matrix output in the frequency domain. Subsequent experiments show that this approach significantly outperforms previously proposed methods for selection when the target and interferer signals are speech signals.
R. Talmon, I. Cohen, S. Gannot, and R. Coifman,
"Graph-Based bayesian approach for transient interference suppression",
in 21st European Signal Processing Conference (EUSIPCO), Marrakech, Morocco, Sep. 2013. In this paper, we present a method for transient interference suppression. The main idea is to learn the intrinsic geometric structure of the transients instead of relying on estimates of noise statistics. The transient interference structure is captured via a parametrization of a graph constructed from the measurements. This parametrization is viewed as an empirical model for transients and is used for building a filter that extracts transients from noisy speech. We present a model-based supervised algorithm, in which the graph-based empirical model is constructed in advance from training recordings, and then extended to new incoming measurements. This paper extends previous studies and presents a new Bayesian approach for empirical model extension that takes into account both the structure of the transients as well as the dynamics of speech signals.
B. Schwartz, S. Gannot, and E. Habets,
"Multi-Microphone speech dereverberation using Expectation-Maximization and Kalman smoothing",
in 21st European Signal Processing Conference (EUSIPCO), Marrakech, Morocco, Sep. 2013. Speech signals recorded in a room are commonly degraded by reverberation. In most cases, both the speech signal and the acoustic system of the room are unknown. In this paper, a multi-microphone algorithm that simultaneously estimates the acoustic system and the clean signal is proposed. An expectation-maximization (EM) scheme is employed to iteratively obtain the maximum likelihood (ML) estimates of the acoustic parameters. In the expectation step, the Kalman smoother is applied to extract the clean signal from the data utilizing the estimated parameters. In the maximization step, the parameters are updated according to the output of the Kalman smoother. Experimental results show a significant dereverberation capabilities of the proposed algorithm with only low speech distortion.
R. Talmon and S. Gannot,
"Relative transfer function identi cation on manifolds for supervised GSC beamformers",
in 21st European Signal Processing Conference (EUSIPCO), Marrakech, Morocco, Sep. 2013. Identification of a relative transfer function (RTF) between two microphones is an important component of multichannel hands-free communication systems in reverberant and noisy environments. In this paper, we present an RTF identification method on manifolds for supervised generalized sidelobe canceler beamformers. We propose to learn the manifold of typical RTFs in a specific room using a novel extendable kernel method, which relies on common manifold learning approaches. Then, we exploit the extendable learned model and propose a supervised identification method that relies on both the a priori learned geometric structure and the measured signals. Experimental results show significant improvements over a competing method that relies merely on the measurements, especially in noisy conditions.
D. Levin, E. Habets, and S. Gannot,
"Robust beamforming using sensors with nonidentical directivity patterns",
in The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, May 2013. The optimal weights for a beamformer that provide maximum directivity, are often found to be severely lacking in terms of robustness. Although an ideal implementation of the beamformer with these weights provides high directivity, minor perturbations of the weights or of sensor placement cause severe degradation. Therefore, a robustness constraint is often imposed during the beamformer’s design stage. The classical method of diagonal loading is commonly used for this purpose. There are known results in this field which pertain to an array consisting of sensors with identical directivity-patterns and orientations. We extend these results to account for sensors with nonidentical directivity patterns, and sensors which share placement errors. We show that in such cases, modification of the classical loading scheme to incorporate nonidentical diagonal elements and off-diagonal elements is beneficial.
2012
S. Markovich-Golan, S. Gannot, and I. Cohen,
"A weighted multichannel Wiener filter for multiple sources scenarios",
in The IEEE 27th Convention of IEEE Israel (IEEEI), Eilat, Israel, Nov. 2012,
best student paper award. The scenario of P speakers received by an M microphone array in a reverberant enclosure is considered. We extend the single source speech distortion weighted multichannel Wiener filter (SDW-MWF) to deal with multiple speakers. The mean squared error (MSE) is extended by introducing P weights, each controlling the distortion of one of the sources. The P weights enable further control in the design of the beamformer (BF). Two special cases of the proposed BF are the SDW-MWF and the linearly constrained minimum variance (LCMV)-BF. We provide a theoretical analysis for the performance of the proposed BF. Finally, we exemplify the ability of the proposed method to control the tradeoff between noise reduction (NR) and distortion levels of various speakers in an experimental study.
F. Heese, E. Hadad, M. Schäfer, S. Markovich-Golan, P. Vary, and S. Gannot,
"Comparison of supervised and semi-supervised beamformers using real audio recordings",
in The IEEE 27th Convention of IEEE Israel (IEEEI), Eilat, Israel, Nov. 2012. In this contribution two different disciplines for designing microphone array beamformers are explored. On the one hand a fixed beamformer based on numerical near field optimization is employed. On the other hand an adaptive beamformer algorithm based on the linearly constrained minimum variance (LCMV) method is applied. For the evaluation, an audio-database for microphone array impulse responses and audio recordings (speech and noise) was created. Different acoustic scenarios were constructed, consisting of various audio sources (desired speaker, interfering speaker and directional noise) distributed around the microphone array at different angles and distances. The algorithms were compared based on both objective measure (signal-to-noise, signal-to-interference and speech distortion, and subjective tests (assessment of sonograms and informal listening tests).
E. Hadad, S. Gannot, and S. Doclo,,
"Binaural linearly constrained minimum variance beamformer for hearing aid applications",
in The International Workshop on Acoustic Signal En- hancement (IWAENC), Aachen, Germany, Sep. 2012. In many cases hearing impaired persons suffer from hearing loss in both ears, necessitating two hearing apparatuses. In such cases, the applied speech enhancement algorithms should be capable of preserving the, so called, binaural cues. In this paper, a binaural extension of the linearly constrained minimum variance (LCMV) beamformer is proposed. The proposed algorithm, denoted binaural linearly constrained minimum variance (BLCMV) beamformer, is capable of extracting desired speakers while suppressing interference speakers. The BLCMV maintains the binaural cues of both the desired and the interference sources in the constrained space. The ability of preserving the binaural cues makes the BLCMV beamformer particularly suitable for hearing aid applications. It is further proposed to obtain a reduced complexity implementation by sharing common blocks in both sides of the hearing aid device. The performance of the proposed method, in terms of imposed distortion, interference cancellation and cue preservation, is verified by an extensive experimental study using signals recorded by a dummy head in an actual room.
S. Markovich-Golan, S. Gannot, and I. Cohen,
"Distributed GSC beamforming using the relative transfer function",
in The European Signal Processing Conference (EUSIPCO), Bucharest, Romania, Aug. 2012,
invited paper. A speech enhancement algorithm in a noisy and reverberant enclosure for a wireless acoustic sensor network (WASN) is derived. The proposed algorithm is structured as a two stage beamformers (BFs) scheme, where the outputs of the first stage are transmitted in the network. Designing the second stage BF requires estimating the desired signal components at the transmitted signals. The contribution here is twofold. First, in spatially static scenarios, the first stage BFs are designed to maintain a fixed response towards the desired signal. As opposed to competing algorithms, where the response changes and repeated estimation thereof is required. Second, the proposed algorithm is implemented in a generalized side-lobe canceler (GSC) form, separating the treatment of the desired speech and the interferences and enabling a simple time-recursive implementation of the algorithm. A comprehensive experimental study demonstrates the equivalent performance of the centralized GSC and of the proposed algorithm for both narrowband and speech signals.
S. Markovich-Golan, S. Gannot, and I. Cohen,
"A sparse blocking matrix for multiple constraints GSC beamformer",
in The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, Apr. 2012, pp. 197–200. Modern high performance speech processing applications incorporate large microphone arrays. Complicated scenarios comprising multiple sources, motivate the use of the linearly constrained minimum variance (LCMV) beamformer (BF) and specifically its efficient generalized sidelobe canceler (GSC) implementation. The complexity of applying the GSC is dominated by the blocking matrix (BM). A common approach for constructing the BM is to use a projection matrix to the null-subspace of the constraints. The latter BM is denoted as the eigen-space BM, and requires M 2 complex multiplications, where M is the number of microphones. In the current contribution, a novel systematic scheme for constructing a multiple constraints sparse BM is presented. The sparsity of the proposed BM substantially reduces the complexity to K × (M – K) complex multiplications, where K is the number of constraints. A theoretical analysis of the signal leakage and of the blocking ability of the proposed sparse BM and of the eigen-space BM is derived. It is proven analytically, and tested for narrowband signals and for speech signals, that the blocking abilities of the sparse and of the eigen-space BMs are equivalent.
S. Gannot,
"On the importance of room acoustics in multi-microphone speech enhancement",
in The 163rd meeting of the Acoustical Society of America and Acoustics 2012, vol. 131, no. 4, Hong Kong, China, May 2012, pp. 3209–3209,
invited paper. Speech quality might significantly deteriorate in presence of interference. Multi-microphone measurements can be utilized to enhance speech quality and intelligibility only if room acoustics is taken into consideration. The vital role of the acoustic transfer function (ATF) between the sources and the microphones is demonstrated in two important cases: the minimum variance distortionless response (MVDR) and the linearly constrained minimum variance (LCMV) beamformers. The LCMV deals with the more general case of multiple desired speakers. It is argued that the MVDR beamformer exhibits a tradeoff between the amount of speech dereverberation and noise reduction. The level of noise reduction, sacrificed when complete dereverberation is required, is shown to depend on the direct-to-reverberation ratio. When the reverberation level is tolerable, practical beamformers can be designed by substituting the ATFs with their corresponding relative transfer functions (RTFs). As no dereverberation is performed by these beamformers, a higher level of noise reduction can be achieved. In comparison with the ATFs, the RTFs exhibit shorter impulse responses. Moreover, since non-blind procedures can be adopted, accurate RTF estimates might be obtained. Three such RTF estimation methods are discussed. Finally, a comprehensive experimental study in real acoustical environments demonstrates the benefits of using the proposed beamformers.
S. Markovich-Golan, S. Gannot, and I. Cohen,
"Blind sampling rate offset estimation and compensation in wireless acoustic sensor networks with application to beamforming",
in The International Workshop on Acoustic Signal Enhancement (IWAENC), Aachen, Germany, Sep. 2012,,
final list for best student paper award. Beamforming methods for speech enhancement in wireless acoustic sensor networks (WASNs) have recently attracted the attention of the research community. One of the major obstacles in implementing speech processing algorithms in WASN is the sampling rate offsets between the nodes. As nodes utilize individual clock sources, sampling rate offsets are inevitable and may cause severe performance degradation. In this paper, a blind procedure for estimating the sampling rate offsets is derived. The procedure is applicable to speech-absent time segments with slow time-varying interference statistics. The proposed procedure is based on the phase drift of the coherence between two signals sampled at different sampling rates. Resampling the signals with Lagrange polynomials interpolation method compensates for the sampling rate offsets. An extensive experimental study, utilizing the transfer function generalized sidelobe canceller (TFGSC), exemplifies the problem and its solution.
2011
R. Talmon, I. Cohen, and S. Gannot,
"Supervised source localization using diffusion kernels",
in The IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New paltz, New York, USA, Oct. 2011, pp. 245–248. Recently, we introduced a method to recover the controlling parameters of linear systems using diffusion kernels. In this paper, we apply our approach to the problem of source localization in a reverberant room using measurements from a single microphone. Prior recordings of signals from various known locations in the room are required for training and calibration. The proposed algorithm relies on a computation of a diffusion kernel with a specially-tailored distance measure. Experimental results in a real reverberant environment demonstrate accurate recovery of the source location.
D. Levin, S. Gannot, and E. Habets,
"Direction-of-arrival estimation using acoustic vector sensors in the presence of noise",
in The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, May 2011, pp. 105–108. A vector-sensor consisting of a monopole sensor collocated with orthogonally oriented dipole sensors can be used for direction-of arrival (DOA) estimation. A method is proposed to estimate the DOA based on the direction of maximum power. Algorithms mentioned in earlier works are shown to be special cases of the proposed method. An iterative algorithm based on the principal of gradient ascent is presented for the solution of the maximum power problem. The proposed maximum-power method is shown to approach the Cramer-Rao lower bound (CRLB) with a suitable choice of parameter.
R. Talmon, I. Cohen, and S. Gannot,
"Clustering and suppression of transient noise in speech signals using diffusion maps",
in The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, May 2011, pp. 5084–5087. Recently we have presented a novel approach for transient noise reduction that relies on non-local (NL) filtering. In this paper, we modify and extend our approach to support clustering and suppression of a few transient noise types simultaneously, by introducing two novel concepts. We observe that voiced speech spectral components are slowly varying compared to transient noise. Thus, by applying an algorithm for noise power spectral density (PSD) estimation, configured to track faster variations than pseudo-stationary noise, the PSD of speech components may be estimated. In addition, we utilize diffusion maps to embed the measurements into a new do main. We obtain a new representation which enables clustering of different transient noise types. The new representation is incorporated into a NL filter as a better affinity metric for averaging over transient instances. Experimental results show that the proposed algorithm enables clustering and suppression of multiple transient interferences.
S. Markovich-Golan, S. Gannot, and I. Cohen,
"Performance analysis of a randomly spaced wireless microphone array",
in The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, May 2011, pp. 121–124. A randomly distributed microphone array is considered in this work. In many applications exact design of the array is impractical. The performance of these arrays, characterized by a large number of microphones deployed in vast areas, cannot be analyzed by traditional deterministic methods. We therefore derive a novel statistical model for performance analysis of the MWF beamformer. We consider the scenario of one desired source and one interfering source arriving from the far-field and impinging on a uniformly distributed linear array. A theoretical model for the MMSE is developed and verified by simulations. The applicability of the proposed statistical model for speech signals is discussed.
2010
L. Ehrenberg, S. Gannot, A. Leshem, and E. Zehavi,
"Sensitivity analysis of MVDR and MPDR beamformers",
in The 26th Convention of IEEE Israel (IEEEI), Eilat, Israel, Nov. 2010, pp. 416–420,
best student paper award. A sensitivity analysis of two distortionless beamformers is presented in this paper. Specifically, two well-known variants, namely the minimum power distortionless response (MPDR) and minimum variance distortionless response (MVDR) beamformers, are considered. In our scenario, which is typical to many modern communications systems, waves emitted by multiple point sources are received by an antenna array. An analytical expression for the signal to interference and noise ratio (SINR) improvement obtained by both beamformers under steering errors is derived. These expression are experimentally evaluated and compared with the robust Capon beamformer (RCB), a robust variant of the MPDR beamformer. We show that the MVDR beamformer, which uses the noise correlation matrix in its minimization criterion, is more robust to steering errors than its counterparts, that use the received signal correlation matrix. Furthermore, even if the noise correlation matrix is erroneously estimated due to steering errors in the interference direction, the MVDR advantage is still maintained for reasonable range of steering errors. These conclusions conform with Cox findings. Only line of sight propagation regime is considered in the current contribution. Ongoing research extends this work to fading channels.
D. Levin, S. Gannot, and E. Habets,,
"Impact of source signal coloration on intensity vector based DOA estimation",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Tel-Aviv, Israel, Nov. 2010. An acoustic vector sensor provides measurements of both the pressure and particle velocity of a sound field in which it is placed. These measurements are vectorial in nature and can be used for the purpose of source localization. A straightforward approach towards determining the direction of arrival (DOA) utilizes the acoustic intensity vector, which is the product of pressure and particle velocity. The accuracy of an intensity vector based DOA estimator in the presence of sensor noise or reverberation has been analyzed previously for the case of a white source signal. In this paper, the effects of reverberation upon the accuracy of such a DOA estimator in the presence of a colored source signal are examined. The analysis is done with the aid of an extension to Polack’s statistical room impulse response model which accounts for particle velocity as well as acoustic pressure. It is shown that signal colorations brings about a degradation in performance.
S. Markovich-Golan, S. Gannot, and I. Cohen,
"A reduced bandwidth binaural MVDR beamformer",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Tel- Aviv, Israel, Aug. 2010,
best student paper award. In this contribution a novel reduced-bandwidth iterative binaural MVDR beamformer is proposed. The proposed method reduces the bandwidth requirement between hearing aids to a single channel, regardless of the number of microphones. The algorithm is proven to converge to the optimal binaural MVDR in the case of a rank-1 desired source correlation matrix. Comprehensive simulations of narrow-band and speech signals demonstrate the convergence and the optimality of the algorithm.
Y. Yeminy, S. Gannot, and Y. Keller,
"Speech enhancement using a multidimensional Mixture- Maximum model",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Tel-Aviv, Israel, Aug. 2010. We present a single-microphone speech enhancement algorithm that models the log-spectrum of the noise-free speech signal by a multidimensional Gaussian mixture. The proposed estimator is based on an earlier study which uses the single-dimensional mixture-maximum (MIXMAX) model for the speech signal. The experimental study shows that there is only a marginal difference between the proposed extension and the original algorithm in terms of both objective and subjective performance measures.
B. Castro, S. Gannot, N.D. Gaubitch, E.A.P. Habets, P. A. Naylor and S.Grant,
"Subband scale factor ambiguity correction using multiple filterbanks",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Tel-Aviv, Israel, Aug. 2010. One of the problems with blind system identification in subbands is that the subband systems can only be identified correctly up to an arbitrary scale factor. This scale factor ambiguity is the same across all channels but can differ between the subbands and therefore, limits the usability of such estimates. In this contribution, a method that uses multiple filterbanks is proposed that utilizes overlapping passband regions between these filterbanks to find scalar correction factors that make the scale factor ambiguity uniform across all subbands. Simulation results are provided, showing that the proposed method accurately identifies and corrects for these scale factors at the cost of an increased computational burden.
Markovich-Golan, S. Gannot, and I. Cohen,,
"Subspace tracking of multiple sources and its application to speakers extraction",
in The IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, Texas, USA, Mar. 2010, pp. 201–204. In this paper we introduce a novel algorithm for extracting desired speech signals uttered by moving speakers contaminated by competing speakers and stationary noise in a reverberant environment. The proposed beamformer uses eigenvectors spanning the desired and interference signals subspaces. It relaxes the common requirement on the activity patterns of the various sources. A novel mechanism for tracking the desired and interferences subspaces is proposed, based on the projection approximation subspace tracking (deflation) (PASTd) procedure and on a union of subspaces procedure. This contribution extends previously proposed methods to deal with multiple speakers in dynamic scenarios.
R. Talmon, I. Cohen, and S. Gannot,
"Speech enhancement in transient noise environment using diffusion filtering",
in The IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, Texas, USA, Mar. 2010, pp. 4782–4785. Recently, we have presented a transient noise reduction algorithm for speech signals that relies on non-local diffusion filtering. By exploiting the repetitive nature of transient noises we proposed a simple and efficient algorithm, which enabled suppression of various noise types. In this paper, we incorporate a modified diffusion operator in order to obtain a more robust algorithm and further enhancement of the speech. We demonstrate the performance of the modified algorithm and compare it with a competing solution. We show that the proposed algorithm enables improved suppression of various transient interferences without any further computational burden.
2009
E. Habets, J. Benesty, S. Gannot, P. Naylor, and I. Cohen,
"On the application of the LCMV beamformer to speech enhancement",
in The IEEE Workshop on Applications of Signal Process- ing to Audio and Acoustics (WASPAA), New Paltz, New York, USA, Oct. 2009, pp. 141–144. In theory the linearly constrained minimum variance (LCMV) beamformer can achieve perfect dereverberation and noise cancellation when the acoustic transfer functions (ATFs) between all sources (including interferences) and the microphones are known. However, blind estimation of the ATFs remains a difficult task. In this paper the noise reduction of the LCMV beamformer is analyzed and compared with the noise reduction of the minimum variance distortionless response (MVDR) beamformer. In addition, it is shown that the constraint of the LCMV can be modified such that we only require relative transfer functions rather than ATFs to achieve perfect cancellation of coherent interferences. Finally, we evaluate the noise reduction performance achieved by the LCMV and MVDR beamformers for two coherent sources: one desired and one undesired.
R. Talmon, I. Cohen, and S. Gannot,,
"Multichannel speech enhancement using convolutive transfer function approximation in reverberant environments",
in The IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan, Apr. 2009, pp. 3885–3888. Recently, we have presented a transfer-function generalized sidelobe canceler (TF-GSC) beamformer in the short time Fourier transform domain, which relies on a convolutive transfer function approximation of relative transfer functions between distinct sensors. In this paper, we combine a delay-and-sum beamformer with the TF-GSC structure in order to suppress the speech signal reflections captured at the sensors in reverberant environments. We demonstrate the performance of the proposed beamformer and compare it with the TF-GSC. We show that the proposed algorithm enables suppression of reverberations and further noise reduction compared with the TF-GSC beamformer.
E. Habets, J. Benesty, I. Cohen, and S. Gannot,
"On a tradeoff between dereverberation and noise reduction using the MVDR beamformer",
in The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan, Apr. 2009, pp. 3741–3744, invited paper. The minimum variance distortionless response (MVDR) beamformer can be used for both speech dereverberation and noise reduction. In this paper we analyse the tradeoff between the amount of speech dereverberation and noise reduction achieved by the MVDR beamformer. We show that the amount of noise reduction that is sacrificed when desiring both speech dereverberation and noise reduction depends on the direct-to-reverberation ratio of the acoustic transfer function between the desired source and a reference microphone. The performance evaluation supports the theoretical analysis and demonstrates the tradeoff between speech dereverberation and noise reduction.
2008
E. Habets, S. Gannot, and I. Cohen,
"Speech dereverberation using backward estimation of the late reverberant spectral variance",
in The 25th Convention of IEEE Israel (IEEEI), Eilat, Israel, Dec. 2008, pp. 384–388. In speech communication systems the received microphone signals are degraded by room reverberation and ambient noise. This signal degradation can decrease the fidelity and intelligibility of the desired speaker. Reverberant speech can be separated into two components, viz. an early speech component and a late reverberant speech component. Reverberation suppression algorithms, that are feasible in practice, have been developed to suppress late reverberant speech or in other words to estimate the early speech component. The main challenge is to develop an estimator for the so-called late reverberant spectral variance (LRSV). In this contribution a generalized statistical reverberation model is proposed that can be used to estimate the LRSV. Novel and existing estimators can be derived from this model. One novel estimator is a so-called backward estimator that uses an estimate of the early speech component to obtain an estimate of the LRSV. Advantages and possible disadvantages of the estimators are discussed, and experimental results using simulated reverberant speech are presented.
S. Markovich, S. Gannot, and I. Cohen,
"A comparison between alternative beamforming strategies for interference cancelation in noisy and reverberant environment",
in The 25th Convention of IEEE Israel (IEEEI), Eilat, Israel, Dec. 2008, pp. 203–207. In speech communication systems the received microphone signals are often degraded by competing speakers, noise signals and room reverberation. Microphone arrays are commonly utilized to enhance the desired speech signal. In this paper two important design criteria, namely the minimum variance distortionless response (MVDR) and the linearly constrained minimum variance (LCMV) beamformers, are explored. These structures differ in their treatment of the interference sources. Experimental results using simulated reverberant enclosure are used for comparing the two strategies. It is shown that the LCMV beamformer outperforms the MVDR beamformer provided that the acoustic environment is time-invariant.
R. Talmon, I. Cohen, and S. Gannot,
"Identification of the relative transfer function between microphones in reverberant environments",
in The 25th Convention of IEEE Israel (IEEEI), Eilat, Israel, Dec. 2008, pp. 208–212. Recently, a relative transfer function (RTF) identification method based on the convolutive transfer function (CTF) approximation was developed. This method is adapted to speech sources in reverberant environments and exploits the non-stationarity and presence probability of the speech signal. In this paper, we present experimental results that demonstrate the advantages and robustness of the proposed method. Specifically, we show the robustness of this method to the environment and to a variety of recorded noise signals.
S. Gannot,
"Multi-microphone speech dereverberation based on eigen-decomposition: A study",
in The 42nd Asilomar Conference on Signals, Systems and Computers, Monterey, CA, USA, Oct. 2008, pp. 801–805,
invited paper. A family of approaches for multi-microphone speech dereverbera- tion in colored noise environments, which uses the eigen-decomposition of the data correlation matrix, is studied in this paper. A recently proposed method shows that the Room Impulse Response (RIR)s, relating the speech source and the microphones, are embedded in the null subspace of the received signals. In cases where the channel order is overestimated, a closed-form algorithm for extracting the RIR is proposed. A variant, in which the sub- space method is incorporated into a subband framework, is given as well. In the last stage of the proposed method, the desired signal is reconstructed, using the estimated RIRs, by applying either the Matched Filter Beamformer (MBF) or the Multi-channel Inverse filter Theorem (MINT) algorithms. The emphasis of the current work is a comprehensive experimental study of the eigen-decomposition based dereverberation methods and the required channel inversion algorithms. This study supports the potential of the presented method, and provides insight into its limitations.
E. Habets, S. Gannot, and I. Cohen,
"Robust early echo cancellation and late echo suppression in the STFT domain",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Seattle, Washington, USA, Sep. 2008, pp. 4565–4568. Acoustic echo arises due to acoustic coupling between the loudspeaker and the microphone of a communication device. Acoustic echo cancellation and suppression techniques are used to reduce the acoustic echo. In this work we propose to first cancel the early echo, which is related to the early part of the echo path, and subsequently suppress the late echo, which is related to the later part of the echo path. The identification of the echo path is carried out in the Short-Time Fourier Transform (STFT) domain, where a trade-off is facilitated between distortion of the near-end speech, residual echo, convergence rate, and robustness to echo path changes. Experimental results demonstrate that the system achieves high echo and noise reduction while maintaining low distortion of the near-end speech. In addition, it is shown that the proposed system is more robust to echo path changes compared to an acoustic canceller alone.
S. Gannot,
"A filter design and implementation experiment using simulink and Texas Instruments C6713DSK board",
in European DSP Education and Research Symposium (EDERS), Tel-Aviv, Israel, Jun. 2008 A. Meiri, S. Melman, J. Fainguelernt, and S. Gannot,
"Real time implementation of convolutive blind source separation using TI-6713DSK board,",
in European DSP Education and Research Symposium (EDERS), Tel-Aviv, Israel, Jun. 2008. A. Abramson, E. Habets, S. Gannot, and I. Cohen,
"Dual-microphone speech dereverberation using GARCH modeling",
in The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Las Vegas, Nevada, USA, Apr. 2008, pp. 4565–4568. In this paper, we develop a dual-microphone speech dereverberation algorithm for noisy environments, which is aimed at suppressing late reverberation and background noise. The spectral variance of the late reverberation is obtained with adaptively-estimated direct path compensation. A Markov-switching generalized autoregressive conditional heteroscedasticity (GARCH) model is used to estimate the spectral variance of the desired signal, which includes the direct sound and early reverberation. Experimental results demonstrate the advantage of the proposed algorithm compared to a decision-directed-based algorithm.
L. Ehrenberg, S. Gannot, A. Leshem, and E. Zehavi,
"Performance bounds for channel tracking algorithms for MIMO systems",
in The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Las Vegas, Nevada, USA, Apr. 2008, pp. 3085–3088. In this paper we derive performance bounds for tracking time-varying OFDM multiple-input multiple-output (MIMO) communication channel in the presence of additive white Gaussian noise (AWGN). We discuss two channel tracking schemes. The first tracks the filter coefficients directly in time-domain, while the second separately tracks each tone in the frequency-domain. The Kalman filter, with known channel statistics, is utilized for evaluating the performance bounds. It is shown that the time-domain tracking scheme, which exploits the sparseness of the channel impulse response, outperforms the computationally more efficient, frequency-domain tracking scheme, which does not exploit the smooth frequency response of the channel.
2007
A. Leshem and S. Gannot,
"Robust sequential interference cancellation for space division multiple access communications",
in The European Signal Processing Conference (EUSIPCO), Poz- nan, Poland, Sep. 2007. In this paper, we consider a multiuser detection scheme for space division multiple access communication systems. Sequential interference cancellation (SIC) procedures are subject to performance degradation when the antenna array is only partially calibrated. We propose to incorporate robust beamforming algorithms into the SIC procedure to compensate for the array misalignment. We show by a simulation study that the proposed combination outperforms conventional SIC procedures for various degrees of array misalignment, different SNR values, several array configurations, and two modulation constellations (namely, QPSK and 16-QAM).
E. Habets and S. Gannot,
"Dual-microphone speech dereverberation using a reference signal",
in the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, Hawaii, USA, Apr. 2007. Speech signals recorded with a distant microphone usually contain reverberation, which degrades the fidelity and intelligibility of speech, and the recognition performance of automatic speech recognition systems. In this paper we propose a speech dereverberation system which uses two microphones. A generalized sidelobe canceller (GSC) type of structure is used to enhance the desired speech signal. The GSC structure is used to create two signals. The first signal is the output of a standard delay and sum beamformer, and the second signal is a reference signal which is constructed such that the direct speech signal is blocked. We propose to utilize the reverberation which is present in the reference signal to enhance the output of the delay and sum beamformer. The power envelope of the reference signal and the power envelope of the output of the delay and sum beamformer are used to estimate the residual reverberation in the output of the delay and sum beamformer. The output of the delay and sum beamformer is then enhanced using a spectral enhancement technique. The proposed method only requires an estimate of the direction of arrival of the desired speech source. Experiments using simulated room impulse responses are presented and show significant reverberation reduction while keeping the speech distortion low.
2006
S. Gannot, A. Leshem, O. Shayevitz, and E. Zehavi,
"Tracking a MIMO channel singular value decomposition via projection approximation",
in The 24th Convention of IEEE Israel (IEEEI), Eilat, Israel, 2006, pp. 91–94. A bidirectional multiple-input multiple-output (MIMO) time varying channel is considered. The projection approximation subspace tracking (PAST) algorithm is used on both terminals in order to track the singular value decomposition of the channel matrix. Simulations using an autoregressive channel model and also a sampled MIMO indoor channel are performed, and the expected capacity degradation due to the estimation error is evaluated.
E. Habets, I. Cohen, and S. Gannot,
"MMSE log-spectral amplitude estimator for multiple interferences",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Paris, France, Sep. 2006. In this paper we present an algorithm for robust speech enhancement based on an Optimal Modified Minimum Mean-Square Error Log-Spectral Amplitude (OM-LSA) estimator for multiple interferences. In the original OM-LSA one interference was taken into account. However, there are many situations where multiple interferences are present. Since the human ear is more sensitive to a small amount of residual non-stationary interference than to a stationary interference we would like to reduce the non-stationary interference signal down to the residual noise level of the stationary interference. Possible applications for the proposed algorithm are joint speech dereverbera-tion and noise reduction, and joint residual echo suppres-sion and noise reduction. Additionally, we present two possible methods to estimate the a priori Signal to Noise Ratio of each of the interferences.
S. Gannot and V. Avrin,
"A Simulink© and texas instruments C6713® based digital signal processing laboratory",
in The European Signal Processing Conference (EUSIPCO), Florence, Italy, Sep. 2006. In this contribution1 a digital signal processing educational lab, established at the School of Electrical and Computers Engineering at Bar-Ilan University, Israel is presented. A unique educational approach is adopted. In this approach sophisticated algorithms can be implemented in an intuitive top-level design using Simulink©. Simultaneously, our approach gives the students the opportunity to conduct hands-on experiments with real signals and hardware, using Texas instruments (TI) C6713 evaluation boards. By taking this combined approach, we tried to focus the efforts of the students on the DSP problems themselves rather than on the actual programming. A comprehensive ensemble of experiments, which expose the students to a wide spectrum of DSP concepts, is introduced in this paper. The experiments were designed to enable the illustration and demonstration of theoretical aspects, already acquired by several DSP courses in the curriculum.
E. Habets, S. Gannot, and I. Cohen,
"Dual-microphone speech dereverberation in a noisy environment",
in The IEEE International Symposium on Signal Processing and Information Tech- nology (ISSPIT), Vancouver, Canada, Ayg. 2006, pp. 651–655. Speech signals recorded with a distant microphone usually contain reverberation and noise, which degrade the fidelity and intelligibility of speech, and the recognition performance of automatic speech recognition systems. In E. Habets (2005) presented a multi-microphone speech dereverberation algorithm to suppress late reverberation in a noise-free environment. In this paper we show how an estimate of the late reverberant energy can be obtained from noisy observations. A more sophisticated speech enhancement technique based on the optimally-modified log spectral amplitude (OM-LSA) estimator is used to suppress the undesired late reverberant signal and noise. The speech presence probability used in the OM-LSA is extended to improve the decision between speech, late reverberation and noise. Experiments using simulated and real acoustic impulse responses are presented and show significant reverberation reduction with little speech distortion
S. Tabiby, N. Tal, J. Fainguelernt, and S. Gannot,
"Real-time implementation of a subspace dere- verberation method",
in European DSP Education and Research Symposium (EDERS), Munich, Germany, Apr. 2006. H. Bluemanfeld, Y. Rahamim, and S. Gannot,
"Real-time implementation of an energy-based voice activity detector",
in European DSP Education and Research Symposium (EDERS), Mu- nich, Germany, Apr. 2006. 2005
G. Reuven, S. Gannot, and I. Cohen,
"Dual source TF-GSC and its application to echo cancellation",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Eindhoven, the Netherlands, Sep. 2005, pp. 89–92. T. Dvorkind and S. Gannot,
"Speaker localization using the unscented Kalman filter",
in Joint workshop on Hand-Free Speech Communication and Microphone Arrays (HSCMA), Rutgers University, Piscataway, New-Jersey, USA, Mar. 2005. 2004
G. Reuven, S. Gannot, and I. Cohen,
"Multichannel acoustic echo cancellation and noise reduction in reverberant environments using the transfer-function GSC",
in the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, Hawaii, USA, Apr. 2007. G. Reuven, S. Gannot, and I. Cohen,
"Joint acoustic echo cancellation and transfer function GSC in the frequency domain",
in The 23rd Convention of IEEE Israel (IEEEI), Herzliya, Israel, Sep. 2004, pp. 412–415. 2003
S. Gannot and M. Moonen,
"Speech dereverberation via sub-band implementation of subspace methods",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Kyoto, Japan, Sep. 2003, pp. 95–98. A novel approach for sub-band based multi-microphone speech dereverberation is presented1 . In recent contribution a method utilizing the null subspace of the spatial-temporal correlation matrix of the received signals (obtained by the generalized eigenvalue decomposition (GEVD) procedure). The desired acoustic transfer functions (ATF-s) are shown to be embedded in these generalized eigenvectors. The special Silvester structure of the filtering matrix, related to this subspace, was exploited for deriving a total least squares (TLS) estimate for the ATF-s. The high sensitivity of the GEVD procedure to noise, especially when the involved ATF-s are very long, and the wide dynamic range of the speech signal, make the proposed method problematic in realistic scenarios. In this contribution we suggest to incorporate the TLS subspace method into a sub-band structure. The novel method proves to be efficient, although some new problems arise and other remain open. A preliminary experimental study supports the potential of the proposed method.
T. Dvorkind and S. Gannot,
"Speaker localization exploiting spatial-temporal information",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Kyoto, Japan, Sep. 2003, pp. 295–298,
Distinguished paper Determining the spatial position of a speaker finds a growing interest in video conference scenario where automated camera steering and tracking are required. Speaker localization can be achieved with a dual step approach. In the preliminary stage microphone array is used to extract the time difference of arrival (TDOA) of the speech signal. These readings are then used by the second stage for the actual localization. Since speaker trajectory must be smooth, estimates of close speaker positions might be used to improve the current position estimate. However, many methods, although exploiting the spatial information obtained by different microphone pairs, do not exploit this temporal information. In this contribution we present two localization schemes, which exploit the temporal information. The first is the well known extended Kalman filter (EKF). The second is a recursive form of a Gauss method, which we denote Recursive Gauss (RG). Experimental study supports the potential of the proposed methods.
S. Gannot and M. Moonen,
"On the application of the unscented kalman filter to speech processing",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Kyoto, Japan, Sep. 2003, pp. 8–11,
distinguished paper In a series of recent studies a new approach for applying the Kalman filter to nonlinear system, referred to as Unscented Kalman filter (UKF), was proposed. In this contribution1 we apply the UKF to several speech processing problems, in which a model with unknown parameters is given to the measured signals. We show that the nonlinearity arises naturally in these problems. Preliminary simulation results for artificial signals manifests the potential of the method.
T. Dvorkind and S. Gannot,
"Approaches for time difference of arrival estimation in a noisy and reverberant environment",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Kyoto, Japan, Sep. 2003, pp. 215–218. Determining the spatial position of a speaker finds a growing interest in video conference scenario where automated camera steering and tracking are required. As a preliminary step for the localization, microphone array can be used to extract the time difference of arrival (TDOA) of the speech signal. The direction of arrival of the speech signal is then determined by the relative time delay between each, spatially separated, microphone pairs. In this work we present novel, frequency domain, approaches for TDOA calculation in a reverberant and noisy environment. Our methods are based on the speech quasi-stationarity property, and on the fact that the speech and the noise are uncorrelated. The proposed methods are supported by an extensive experimental study.
I. Cohen, S. Gannot, and B. Berdugo,
"Real-time TF-GSC in nonstationary noise environments",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Kyoto, Japan, Sep. 2003, pp. 183–186. Adaptive beamforming techniques are inefficient for eliminating transient noise components that randomly arrive from unpredictable directions. In this paper, we present a real-time transfer function generalized sidelobe canceller (TF-GSC) for such nonstationary noise environments. Hypothesis testing in the spectral domain indicates either absence of transients, presence of an interfering transient, or presence of desired source components. The noise canceller branch of the TF-GSC is updated only during absence of transients, while the identification of the acoustical transfer function is carried out only when desired source components are present. Following the beamforming and the hypothesis testing, estimates for the signal presence probability, the noise power spectral density, and the desired speech log-spectral amplitude are derived. Experimental results demonstrate the usefulness of the proposed approach under nonstationary noise conditions.
S. Gannot and I. Cohen,
"Speech enhancement based on the general transfer function GSC and postfiltering",
in the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong-Kong, China, Apr. 2003. In speech enhancement applications microphone array postfiltering allows additional reduction of noise components at a beamformer output. Among microphone array structures the recently proposed general transfer function generalized sidelobe canceller (TF-GSC) has shown impressive noise reduction abilities in a directional noise field, while still maintaining low speech distortion. However, in a diffused noise field less significant noise reduction is obtainable. The performance is even further degraded when the noise signal is nonstationary. In this contribution we propose three postfiltering methods for improving the performance of microphone arrays. Two of which are based on single-channel speech enhancers and making use of recently proposed algorithms concatenated to the beamformer output. The third is a multichannel speech enhancer which exploits noise-only components constructed within the TF-GSC structure. This work concentrates on the assessment of the proposed postfiltering structures. An extensive experimental study, which consists of both objective and subjective evaluation in various noise fields, demonstrates the advantage of the multichannel postfiltering compared to the single-channel techniques.
2002
T. Dvorkind and S. Gannot,
"Speaker localization in a reverberant environment",
in The 22nd Convention of IEEE Israel (IEEEI), Tel-Aviv University, Israel, Dec. 2002, pp. 7–9. The problem of speaker localization is addressed in this work. We present a novel approach for estimating the time difference of arrival (TDOA) of the speech signal to a microphone array, in a reverberant and noisy environment. By estimating acoustical transfer function (ATF) ratios, the TDOA is extracted from a relatively short impulse response. Our approach shows superior performance, compared with the traditional generalized cross correlation (GCC) method.
2001
S. Gannot and M. Moonen,
"Subspace methods for multi-microphone speech dereverberation",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Darmstadt, Germany, Sep. 2001. S. Gannot, D. Burshtein, and E. Weinstein,
"Theoretical analysis of the general transfer function GSC",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Darmstadt, Germany, Sep. 2001. In recent work we considered the use of a microphone array located in a reverberated room where general acoustic transfer functions (ATFs) relate the source signal and the microphones for enhancing a speech signal contaminated by interference. The resulting frequency-domain algorithm enables dealing with a complicated ATF in the same simple manner as Griffiths & Jim GSC algorithm deals with delay-only arrays. In this contribution a general expression of the enhancer output is derived. This expression is used for evaluating two figures of merit, i.e., noise reduction ability and the amount of distortion imposed. The performance is shown to be dependent on the ATFs involved, the noise field and the quality of estimation of the ATF ratios. Analytical performance evaluation of the method is obtained. It is shown that the proposed method maintains its good performance even in the general ATF case.
1999
S. Gannot, D. Burshtein, and E. Weinstein,
"Beamforming methods for multi-channel speech enhancement",
in The International Workshop on Acoustic Echo and Noise Control (IWAENC), Pocono Manor, Pennsylvania, USA, Sep. 1999, pp. 96–99. S. Gannot and D. Burshtein,
"Speech enhancement using a mixture-maximum model",
in EuroSpeech, Budapest, Hungary, Sep. 1999. 1997
S. Gannot, D. Burshtein, and E. Weinstein,
"Iterative-batch and sequential algorithms for single microphone speech enhancement",
in the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, Munich, Germany, 1997, pp. 1215–1218. Speech quality and intelligibility might significantly deteriorate in the presence of background noise, especially when the speech signal is subject to subsequent processing. In this paper we represent a class of Kalman-filter based speech enhancement algorithms with some extensions, modifications, and improvements. The first algorithm employs the estimate-maximize (EM) method to iteratively estimate the spectral parameters of the speech and noise parameters. The enhanced speech signal is obtained as a by-product of the parameter estimation algorithm. The second algorithm is a sequential, computationally efficient, gradient descent algorithm. We discuss various topics concerning the practical implementation of these algorithms. Experimental study, using real speech and noise signals is provided to compare these algorithms with alternative speech enhancement algorithms, and to compare the performance of the iterative and sequential algorithms.
1996
S. Gannot, D. Burshtein, and E. Weinstein,
"Algorithms for single microphone speech enhancement",
in the 19th Convention of IEEE Israel (IEEEI), Jerusalem, Israel, 1996, pp. 94–97.
Copyright Notice
Downloading of any paper is permitted for personal use only.
Permission to reprint / republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the author(s) and the respective publisher.
Copyright and all other rights therein are retained by authors or by other copyright holders.
All persons downloading this information are expected to adhere to the terms and constraints invoked by each publisher and author’s copyright.
In most cases, these works may not be reposted without the explicit permission of the copyright holder.
Sharon Gannot