Prof. Sharon Gannot | Publications

S. Gannot, Z.-H. Tan, M. Haardt, N. F. Chen, H.-T. Wai, I. Tashev, W. Kellermann, and J. Dauwels, "Data science education: The signal processing perspective", IEEE Signal Processing Magazine, Vol. 40, No. 7, Nov. 2023.

abstractBibTeX

BibTeX

@ARTICLE{10313218,
author={Gannot, Sharon and Tan, Zheng-Hua and Haardt, Martin and Chen, Nancy F. and Wai, Hoi-To and Tashev, Ivan and Kellermann, Walter and Dauwels, Justin},
journal={IEEE Signal Processing Magazine},
title={Data Science Education: The Signal Processing Perspective [SP Education]},
year={2023},
volume={40},
number={7},
pages={89-93},
doi={10.1109/MSP.2023.3294709}
}

abstractBibTeX

BibTeX

@article{cherkassky2019successive,
title={Successive Relative Transfer Function Identification Using Blind Oblique Projection},
author={Cherkassky, Dani and Gannot, Sharon},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
volume={28},
pages={474--486},
year={2019},
publisher={IEEE}
}

copy to clipboard

Distortionless speech extraction in a reverberant environment can be achieved by applying a beamforming algorithm, provided that the relative transfer functions (RTFs) of the sources and the covariance matrix of the noise are known. In this paper, the challenge of RTF identification in a multi-speaker scenario is addressed. We propose a successive RTF identification (SRI) technique, based on the sole assumption that sources do not become simultaneously active. That is, we address the challenge of estimating the RTF of a specific speech source while assuming that the RTFs of all other active sources in the environment were previously estimated in an earlier stage. The RTF of interest is identified by applying the blind oblique projection (BOP)-SRI technique. When a new speech source is identified, the BOP algorithm is applied. BOP results in a null steering toward the RTF of interest, by means of applying an oblique projection to the microphone measurements. We prove that by artificially increasing the rank of the range of the projection matrix, the RTF of interest can be identified. An experimental study is carried out to evaluate the performance of the BOP-SRI algorithm in various signal to noise ratio (SNR) and signal to interference ratio (SIR) conditions and to demonstrate its effectiveness in speech extraction tasks.

C. Evers, E. A. Habets, S. Gannot, and P. A. Naylor, "DoA Reliability for Distributed Acoustic Tracking", IEEE Signal Processing Letters, vol. 25, no. 9, pp. 1320–1324, 2018.

abstractBibTeX

BibTeX

@article{evers2018doa,
title={{DoA} reliability for distributed acoustic tracking},
author={Evers, Christine and Habets, Emanu{"e}l AP and Gannot, Sharon and Naylor, Patrick A},
journal=IEEE Signal Processing Letters,
volume={25},
number={9},
pages={1320--1324},
year={2018},
publisher={IEEE}
}

copy to clipboard

Distributed acoustic tracking estimates the trajectories of source positions using an acoustic sensor network. As it is often difficult to estimate the source-sensor range from individual nodes, the source positions have to be inferred from the direction-of-arrival (DoA) estimates. Due to reverberation and noise, the sound field becomes increasingly diffuse with increasing source-sensor distance, leading to a decreased Direction of Arrival (DoA)-estimation accuracy. To distinguish between accurate and uncertain DoA estimates, this letter proposes to incorporate the coherent-to-diffuse ratio as a measure of DoA reliability for single – source tracking. It is shown that the source positions, therefore, can be probabilistically triangulated by exploiting the spatial diversity of all nodes.

Dorfan, A. Plinge, G. Hazan, and S. Gannot, "Distributed expectation-maximization algorithm for speaker localization in reverberant environments", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 3, pp. 682–695, Mar. 2018.

abstractBibTeX

@ARTICLE{Dorfan15,
author = {Yuval Dorfan and Sharon Gannot},
title = {Tree-based recursive expectation-maximization algorithm for localization of acoustic sources},
journal = IEEE_J_ASLP,
year = {2015},
month = oct,
vol = 23,
no = 10,
pages = {1692--1703}
}

copy to clipboard

The problem of distributed localization for ad hoc wireless acoustic sensor networks (WASNs) is addressed in this paper. WASNs are characterized by low computational resources in each node and by limited connectivity between the nodes. Novel bi-directional tree-based distributed estimation-maximization (DEM) algorithms are proposed to circumvent these inherent limitations. We show that the proposed algorithms are capable of localizing static acoustic sources in reverberant enclosures without a priori information on the number of sources. Unlike serial estimation procedures (like ring-based algorithms), the new algorithms enable simultaneous computations in the nodes and exhibit greater robustness to communication failures. Specifically, the recursive distributed EM (RDEM) variant is better suited to online applications due to its recursive nature. Furthermore, the RDEM outperforms the other proposed variants in terms of convergence speed and simplicity. Performance is demonstrated by an extensive experimental study consisting of both simulated and actual environments.

S. Markovich-Golan, A. Bertrand, M. Moonen, and S. Gannot, "Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks", Special issue on "acoustic sensor networks and ad hoc microphone arrays," Signal Processing, vol. 107, pp. 4–20, 2015.

abstractBibTeX

BibTeX

@article{markovich2015optimal,
title={Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks},
author={Markovich-Golan, Shmulik and Bertrand, Alexander and Moonen, Marc and Gannot, Sharon},
journal={Signal Processing},
volume={107},
pages={4--20},
year={2015},
publisher={Elsevier}
}

copy to clipboard

In multiple speaker scenarios, the linearly constrained minimum variance (LCMV) beamformer is a popular microphone array-based speech enhancement technique, as it allows minimizing the noise power while maintaining a set of desired responses towards different speakers. Here, we address the algorithmic challenges arising when applying the LCMV beamformer in wireless acoustic sensor networks (WASNs), which are a next-generation technology for audio acquisition and processing. We review three optimal distributed LCMV-based algorithms, which compute a network-wide LCMV beamformer output at each node without centralizing the microphone signals. Optimality here refers to equivalence to a centralized realization where a single processor has access to all signals. We derive and motivate the algorithms in an accessible top-down framework that reveals their underlying relations. We explain how their differences result from their different design criterion (node-specific versus common constraints sets), and their different priorities for communication bandwidth, computational power, and adaptivity. Furthermore, although originally proposed for a fully connected WASN, we also explain how to extend the reviewed algorithms to the case of a partially connected WASN, which is assumed to be pruned to a tree topology. Finally, we discuss the advantages and disadvantages of the various algorithms

N. Gößling, E. Hadad, S. Gannot, and S. Doclo, "Binaural LCMV beamforming with partial noise estimation", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2942-2955, 2020.

abstractBibTeX

BibTeX

@article{gossling2019binaural,
title={Binaural {LCMV} Beamforming with Partial Noise Estimation},
author={G{"o}{ss}ling, Nico and Hadad, Elior and Gannot, Sharon and Doclo, Simon},
journal=EEE/ACM Transactions on Audio, Speech, and Language Processing,
year={2020},
volume={28},
pages={2942--2955}
}

copy to clipboard

Besides reducing undesired sources, i.e., interfering sources and background noise, another important objective of a binaural beamforming algorithm is to preserve the spatial impression of the acoustic scene, which can be achieved by preserving the binaural cues of all sound sources. While the binaural minimum variance distortionless response (BMVDR) beamformer provides a good noise reduction performance and preserves the binaural cues of the desired source, it does not allow to control the reduction of the interfering sources and distorts the binaural cues of the interfering sources and the background noise. Hence, several extensions have been proposed. First, the binaural linearly constrained minimum variance (BLCMV) beamformer uses additional constraints, enabling to control the reduction of the interfering sources while preserving their binaural cues. Second, the BMVDR with partial noise estimation (BMVDR-N) mixes the output signals of the BMVDR with the noisy reference microphone signals, enabling to control the binaural cues of the background noise. Aiming at merging the advantages of both extensions, in this paper we propose the BLCMV with partial noise estimation (BLCMV-N). We show that the output signals of the BLCMV-N can be interpreted as a mixture between the noisy reference microphone signals and the output signals of a BLCMV using an adjusted interference scaling parameter. We provide a theoretical comparison between the BMVDR, the BLCMV, the BMVDR-N and the proposed BLCMV-N in terms of noise and interference reduction performance and binaural cue preservation. Experimental results using recorded signals as well as the results of a perceptual listening test show that the BLCMV-N is able to preserve the binaural cues of an interfering source (like the BLCMV), while enabling to trade off between noise reduction performance and binaural cue preservation of the background noise (like the BMVDR-N).

E. Hadad, S. Doclo, and S. Gannot, "The binaural LCMV beamformer and its performance analysis", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 543–558, Mar. 2016.

abstractBibTeX

BibTeX

@ARTICLE{Hadad16,
author={E. Hadad and S. Doclo and S. Gannot},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={The Binaural {LCMV} Beamformer and its Performance Analysis},
year={2016},
volume={24},
number={3},
pages={543--558},
month=mar
}

copy to clipboard

audio

The recently proposed binaural linearly constrained minimum variance (BLCMV) beamformer is an extension of the well-known binaural minimum variance distortionless response (MVDR) beamformer, imposing constraints for both the desired and the interfering sources. Besides its capabilities to reduce interference and noise, it also enables to preserve the binaural cues of both the desired and interfering sources, hence making it particularly suitable for binaural hearing aid applications. In this paper, a theoretical analysis of the BLCMV beamformer is presented. In order to gain insights into the performance of the BLCMV beamformer, several decompositions are introduced that reveal its capabilities in terms of interference and noise reduction, while controlling the binaural cues of the desired and the interfering sources. When setting the parameters of the BLCMV beamformer, various considerations need to be taken into account, e.g. based on the amount of interference and noise reduction and the presence of estimation errors of the required relative transfer functions (RTFs). Analytical expressions for the performance of the BLCMV beamformer in terms of noise reduction, interference reduction, and cue preservation are derived. Comprehensive simulation experiments, using measured acoustic transfer functions as well as real recordings on binaural hearing aids, demonstrate the capabilities of the BLCMV beamformer in various noise environments.

E. Hadad, D. Marquardt, S. Doclo, and S. Gannot, "Theoretical Analysis of Binaural Transfer Function MVDR Beamformers with Interference Cue Preservation Constraints", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 12, pp. 2449-2464, Dec., 2015.

abstractBibTeX

BibTeX

@ARTICLE{Hadad15,
author = {Elior Hadad and Daniel Marquardt and Simon Doclo and Sharon Gannot},
title = {Theoretical Analysis of Binaural Transfer Function {MVDR} Beamformers with Interference Cue Preservation Constraints},
journal = IEEE/ACM Transactions on Audio, Speech, and Language Processing,
year = {2015},
month = dec,
vol = {23},
no = {12},
pages = {2449--2464}
}

copy to clipboard

The objective of binaural noise reduction algorithms is not only to selectively extract the desired speaker and to suppress interfering sources (e.g., competing speakers) and ambient background noise, but also to preserve the auditory impression of the complete acoustic scene. For directional sources this can be achieved by preserving the relative transfer function (RTF) which is defined as the ratio of the acoustical transfer functions relating the source and the two ears and corresponds to the binaural cues. In this paper, we theoretically analyze the performance of three algorithms that are based on the binaural minimum variance distortionless response (BMVDR) beamformer, and hence, process the desired source without distortion. The BMVDR beamformer preserves the binaural cues of the desired source but distorts the binaural cues of the interfering source. By adding an interference reduction (IR) constraint, the recently proposed BMVDR-IR beamformer is able to preserve the binaural cues of both the desired source and the interfering source. We further propose a novel algorithm for preserving the binaural cues of both the desired source and the interfering source by adding a constraint preserving the RTF of the interfering source, which will be referred to as the BMVDR-RTF beamformer. We analytically evaluate the performance in terms of binaural signal-to-interference-and-noise ratio (SINR), signal-to-interference ratio (SIR), and signal-to-noise ratio (SNR) of the three considered beamformers. It can be shown that the BMVDR-RTF beamformer outperforms the BMVDR-IR beamformer in terms of SINR and outperforms the BMVDR beamformer in terms of SIR. Among all beamformers which are distortionless with respect to the desired source and preserve the binaural cues of the interfering source, the newly proposed BMVDR-RTF beamformer is optimal in terms of SINR. Simulations using acoustic transfer functions measured on a binaural hearing aid validate our theoretical results.

D. Marquardt, E. Hadad, S. Gannot, and S. Doclo, "Theoretical Analysis of Linearly Constrained Multi-Channel Wiener Filtering Algorithms for Combined Noise Reduction and Binaural Cue Preservation in Binaural Hearing Aids,", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 12, pp. 2384-2397, Dec., 2015.

abstractBibTeX

@article{evers2018doa,
title={{DoA} reliability for distributed acoustic tracking},
author={Evers, Christine and Habets, Emanu{"e}l AP and Gannot, Sharon and Naylor, Patrick A},
journal=IEEE Signal Processing Letters,
volume={25},
number={9},
pages={1320--1324},
year={2018},
publisher={IEEE}
}

copy to clipboard

Distributed acoustic tracking estimates the trajectories of source positions using an acoustic sensor network. As it is often difficult to estimate the source-sensor range from individual nodes, the source positions have to be inferred from the direction-of-arrival (DoA) estimates. Due to reverberation and noise, the sound field becomes increasingly diffuse with increasing source-sensor distance, leading to a decreased Direction of Arrival (DoA)-estimation accuracy. To distinguish between accurate and uncertain DoA estimates, this letter proposes to incorporate the coherent-to-diffuse ratio as a measure of DoA reliability for single – source tracking. It is shown that the source positions, therefore, can be probabilistically triangulated by exploiting the spatial diversity of all nodes.

B. Laufer-Goldshtein, R. Talmon, and S. Gannot, "A hybrid approach for speaker tracking based on TDOA and data-driven models", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 725–735, Apr. 2018.

abstractBibTeX

BibTeX

@ARTICLE{Laufer2018a,
author={B. Laufer-Goldshtein and R. Talmon and S. Gannot},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={A Hybrid Approach for Speaker Tracking Based on {TDOA} and Data-Driven Models},
year={2018},
volume={26},
number={4},
pages={725--735},
month=apr
}

copy to clipboard

The problem of speaker tracking in noisy and reverberant enclosures is addressed. We present a hybrid algorithm, combining traditional tracking schemes with a new learning-based approach. A state-space representation, consisting of a propagation and observation models, is learned from signals measured by several distributed microphone pairs. The proposed representation is based on two data modalities corresponding
to high-dimensional acoustic features representing the full reverberant acoustic channels as well as low-dimensional TDOA estimates. The state-space representation is accompanied by a statistical model based on a Gaussian process used to relate the variations of the acoustic channels to the physical variations of the associated source positions, thereby forming a data-driven propagation model for the source movement. In the
observation model, the source positions are nonlinearly mapped to the associated TDOA readings. The obtained propagation and observation models establish the basis for employing an extended Kalman filter (EKF). Simulation results demonstrate the robustness of the proposed method in noisy and reverberant conditions.

A. Plinge, G. A. Fink, and S. Gannot, "Passive online geometry calibration of acoustic sensor networks", IEEE Signal Processing Letters, vol. 24, no. 3, pp. 324–328, Mar. 2017.

abstractBibTeX

BibTeX

@ARTICLE{Plinge2017,
author={A. Plinge and G. A. Fink and S. Gannot},
journal={IEEE Signal Processing Letters},
title={Passive Online Geometry Calibration of Acoustic Sensor Networks},
year={2017},
volume={24},
number={3},
pages={324--328},
month=mar
}

copy to clipboard

audio

As we are surrounded by an increased number of mobile devices equipped with wireless links and multiple microphones, e.g., smartphones, tablets, laptops and hearing aids, using them collaboratively for acoustic processing is a promising platform for emerging applications. These devices make up an acoustic sensor network comprised of nodes, i.e. distributed devices equipped with microphone arrays, communication unit and processing unit. Algorithms for speaker separation and localization using such a network require a precise knowledge of the nodes’ locations and orientations. To acquire this knowledge, a recently introduced approach proposed a combined direction of arrival (DoA) and time difference of arrival (TDoA) target function for off-line calibration with dedicated recordings. This paper proposes an extension of this approach to a novel online method with two new features: First, by employing an evolutionary algorithm on incremental measurements, it is online and fast enough for real-time application. Second, by using the sparse spike representation computed in a cochlear model for TDoA estimation, the amount of information shared between the nodes by transmission is reduced while the accuracy is increased. The proposed approach is able to calibrate an acoustic senor network online during a meeting in a reverberant conference room.

D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, and R. Horaud, "A Variational EM Algorithm for the Separation of Time-Varying Convolutive Audio Mixtures", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1408-1423, Aug. 2016.

abstractBibTeX

BibTeX

@ARTICLE{Kounades-Bastian16,
author={D. Kounades-Bastian and L. Girin and X. Alameda-Pineda and S. Gannot and R. Horaud},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing
title={A Variational {EM} Algorithm for the Separation of Time-Varying Convolutive Audio Mixtures},
year={2016},
volume={24},
number={8},
pages={1408--1423},
month=aug
}

copy to clipboard

This paper addresses the problem of separating audio sources from time-varying convolutive mixtures. We propose a probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization. The time-varying mixing filters are modeled by a continuous temporal stochastic process. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the time-varying mixing matrix, and that jointly estimate the source parameters. The sound sources are then separated by Wiener filters constructed with the estimators provided by the VEM algorithm. Extensive experiments on simulated data show that the proposed method outperforms a blockwise version of a state-of-the-art baseline method.

D. Cherkassky and S. Gannot, "New insights into the Kalman filter beamformer Applications to speech and robustness", IEEE Signal Processing Letters, vol. 23, no. 3, pp. 376–380, Mar. 2016.

abstractBibTeX

BibTeX

@ARTICLE{Cherkassky16,
author={Dani Cherkassky and Sharon Gannot},
journal={IEEE Signal Processing Letters},
title={New Insights into the {Kalman} Filter Beamformer: {Applications} to Speech and Robustness},
year={2016},
volume={23},
number={3},
pages={376--380},
month=mar
}

copy to clipboard

Statistically optimal spatial processors (also referred to as data-dependent beamformers) are widely-used spatial focusing techniques for desired source extraction. The Kalman filter-based beamformer (KFB) [1] is a recursive Bayesian method for implementing the beamformer. This letter provides new insights into the KFB. Specifically, we adopt the KFB framework to the task of speech extraction. We formalize the KFB with a set of linear constraints and present its equivalence to the linearly constrained minimum power (LCMP) beamformer. We further show that the optimal output power, required for implementing the KFB, is merely controlling the white noise gain (WNG) of the beamformer. We also show, that in static scenarios, the adaptation rule of the KFB reduces to the simpler affine projection algorithm (APA). The analytically derived results are verified and exemplified by a simulation study.

B. Schwartz, S. Gannot, and E. A.P. Habets, "Online speech dereverberation using Kalman filter and EM algorithm", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, no. 2, pp. 394-406, Feb. 2015.

abstractBibTeX

BibTeX

@ARTICLE{SchwartzBoaz2014,
author={Schwartz, Boaz and Gannot, Sharon and Habets, Emanu"{e}l A.P.},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={On-line speech dereverberation using {Kalman} filter and {EM} algorithm},
year={2015},
month=feb,
vol=23,
no=2,
pages={394--406}
}

copy to clipboard

audio

Speech signals recorded in a room are commonly degraded by reverberation. In most cases, both the speech signal and the acoustic system of the room are unknown and time-varying. In this paper, a scenario with a single desired sound source and slowly time-varying and spatially-white noise is considered, and a multi-microphone algorithm that simultaneously estimates the clean speech signal and the time-varying acoustic system is proposed. The recursive expectation-maximization scheme is employed to obtain both the clean speech signal and the acoustic system in an online manner. In the expectation step, the Kalman filter is applied to extract a new sample of the clean signal, and in the maximization step, the system estimate is updated according to the output of the Kalman filter. Experimental results show that the proposed method is able to significantly reduce reverberation and increase the speech quality. Moreover, the tracking ability of the algorithm was validated in practical scenarios using human speakers moving in a natural manner.

B. Laufer Goldshtein, R. Talmon, and S. Gannot, "Audio source separation by activity probability detection with maximum correlation and simplex geometry", EURASIP Journal on Audio, Speech and Music, vol. 2021, Jan. 2021.

abstractBibTeX

BibTeX

@article{schwartz2019recursive,
title={Recursive Maximum Likelihood Algorithm for Dependent Observations},
author={Schwartz, Boaz and Gannot, Sharon and Habets, Emanu{"e}l AP and Noam, Yair},
journal=IEEE Transactions on Signal Processing,
volume={67},
number={5},
pages={1366--1381},
year={2019},
publisher={IEEE}
}

copy to clipboard

A recursive maximum-likelihood algorithm (RML) is proposed that can be used when both the observations and the hidden data have continuous values and are statistically dependent between different time samples. The algorithm recursively approximates the probability density functions of the observed and hidden data by analytically computing the integrals with respect to the state variables, where the parameters are updated using gradient steps. A full convergence proof is given, based on the ordinary differential equation approach, which shows that the algorithm converges to a local minimum of the Kullback-Leibler divergence between the true and the estimated parametric probability density functions; a result which is useful even for a miss-specified parametric model. Compared to other RML algorithms proposed in the literature, this contribution extends the state-space model and provides a theoretical analysis in a non-trivial statistical model that was not analyzed so far. We further extend the RML analysis to constrained parameter estimation problems. Two examples, including nonlinear state-space models, are given to highlight this contribution.

S. Braun, A. Kuklasinski, O. Schwartz, O. Thiergart, E. A. Habets, S. Gannot, S. Doclo, and J. Jensen, "Evaluation and Comparison of Late Reverberation Power Spectral Density Estimators", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 6, pp. 1052–1067, Jun. 2018.

abstractBibTeX

BibTeX

@article{Braun2018evaluation,
title={Evaluation and Comparison of Late Reverberation Power Spectral Density Estimators},
author={Braun, Sebastian and Kuklasinski, Adam and Schwartz, Ofer and Thiergart, Oliver and Habets, Emanu{"e}l AP and Gannot, Sharon and Doclo, Simon and Jensen, Jesper},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
volume={26},
number={6},
month=jun,
pages={1052--1067},
year={2018}
}

copy to clipboard

Reduction of late reverberation can be achieved using spatio-spectral filters such as the multichannel Wiener filter (MWF). To compute this filter, an estimate of the late reverberation power spectral density (PSD) is required. In recent years, a multitude of late reverberation PSD estimators have been proposed. In this contribution, these estimators are categorized into several classes, their relations and differences are discussed, and a comprehensive experimental comparison is provided. To compare their performance, simulations in controlled as well as practical scenarios are conducted. It is shown that a common weakness of spatial coherence-based estimators is their performance in high direct-to-diffuse ratio (DDR) conditions. To mitigate this problem, a correction method is proposed and evaluated. It is shown that the proposed correction method can decrease the speech distortion without significantly affecting the reverberation reduction.

O. Schwarz, S. Gannot, and E. A.P. Habets, "Cramér-Rao bound analysis of reverberation level estimators for dereverberation and noise reduction", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 8, pp. 1680–1693, Aug. 2017.

abstractBibTeX

BibTeX

@ARTICLE{Schwartz2017a,
title = {{C}ram'{e}r-{R}ao Bound Analysis of Reverberation Level Estimators for Dereverberation and Noise Reduction},
author={Ofer Schwarz and Sharon Gannot and Emanu"{e}l A.P. Habets},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
year={2017},
volume={25},
number={8},
pages={1680--1693},
month=aug
}

copy to clipboard

The reverberation power spectral density (PSD) is often required for dereverberation and noise reduction algorithms. In this work, we compare two maximum likelihood (ML) estimators of the reverberation PSD in a noisy environment. In the first estimator, the direct path is first blocked. Then, the ML criterion for estimating the reverberation PSD is stated according to the probability density function (p.d.f.) of the blocking matrix (BM) outputs. In the second estimator, the speech component is not blocked. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. To compare the expected mean square error (MSE) between the two ML estimators of the reverberation PSD, the Cramér-Rao Bounds (CRBs) for the two ML estimators are derived. We show that the CRB for the joint reverberation and speech PSD estimator is lower than the CRB for estimating the reverberation PSD from the BM outputs. Experimental results show that the MSE of the two estimators indeed obeys the CRB curves. Experimental results of multi-microphone dereverberation and noise reduction algorithm show the benefits of using the ML estimators in comparison with another baseline estimators.

E. Hadad, S. Doclo, and S. Gannot, "The binaural LCMV beamformer and its performance analysis", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 543–558, Mar. 2016.

abstractBibTeX

BibTeX

@ARTICLE{Hadad16,
author={E. Hadad and S. Doclo and S. Gannot},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={The Binaural {LCMV} Beamformer and its Performance Analysis},
year={2016},
volume={24},
number={3},
pages={543--558},
month=mar
}

copy to clipboard

audio

U. Saqib, S. Gannot, and J. R. Jensen, "Estimation of acoustic echoes using expectation-maximization methods", EURASIP Journal on Audio, Speech and Music, vol. 2020, Jun. 2020.

abstractBibTeX

BibTeX

@article{Saqib2020,
title={Estimation of Acoustic Echoes using Expectation-Maximization Methods},
author={Saqib, Usama and Gannot, Sharon and Jensen, Jesper Rindom},
journal={EURASIP Journal on Audio, Speech and Music},
volume={2020},
month=jun,
year={2020},
url={https://rdcu.be/b6O0C}
}

copy to clipboard

Estimation problems like room geometry estimation and localization of acoustic reflectors are of great interest and importance in robot and drone audition. Several methods for tackling these problems exist, but most of them rely on information about times-of-arrival (TOAs) of the acoustic echoes. These need to be estimated in practice, which is a difficult problem in itself, especially in robot applications which are characterized by high ego-noise. Moreover, even if TOAs are successfully extracted, the difficult problem of echolabeling needs to be solved. In this paper, we propose multiple expectation-maximization (EM) methods, for jointly estimating the TOAs and directions-of-arrival (DOA) of the echoes, with a uniform circular array (UCA) and a loudspeaker in its center for probing the environment. The different methods are derived to be optimal under different noise conditions. The experimental results show that the proposed methods outperform existing methods in terms of estimation accuracy in noisy conditions. For example, it can provide accurate estimates at SNR of 10 dB lower compared to TOA extraction from room impulse responses, which is often used. Furthermore, the results confirm that the proposed methods can account for scenarios with colored noise or faulty microphones. Finally, we show the applicability of the proposed methods in mapping of an indoor environment.

Ofer Schwartz and Sharon Gannot, "A Recursive Expectation-Maximization Algorithm for Speaker Tracking and Separation", EURASIP Journal on Audio, Speech and Music, Dec. 2021.

abstractBibTeX

BibTeX

@article{Schwartz2021,
title={A Recursive Expectation-Maximization Algorithm for Speaker Tracking and Separation},
author={Schwartz, Ofer and Gannot, Sharon },
journal={EURASIP Journal on Audio, Speech and Music},
year={2021},
month=may
}

copy to clipboard

The problem of blind and online speaker localization and separation using multiple microphones is addressed
based on the recursive expectation-maximization (REM) procedure. A two-stage REM-based algorithm is
proposed: 1) multi-speaker direction of arrival (DOA) estimation and 2) multi-speaker relative transfer
function (RTF) estimation. The DOA estimation task uses only the time frequency (TF) bins dominated by a
single speaker while the entire frequency range is not required to accomplish this task. In contrast, the RTF
estimation task requires the entire frequency range in order to estimate the RTF for each frequency bin.
Accordingly, a different statistical model is used for the two tasks. The first REM model is applied under the
assumption that the speech signal is sparse in the TF domain, and utilizes a mixture of Gaussians (MoG)
model to identify the TF bins associated with a single dominant speaker. The corresponding DOAs are
estimated using these bins. The second REM model is applied under the assumption that the speakers are
concurrently active in all TF bins and consequently applies a multichannel Wiener filter (MCWF) to separate
the speakers. As a result of the assumption of the concurrent speakers, a more precise TF map of the speakers’
activity is obtained. The RTFs are estimated using the outputs of the MCWF-beamformer (BF), which are
constructed using the DOAs obtained in the previous stage. Next, using the linearly constrained minimum
variance (LCMV)-BF that utilizes the estimated RTFs, the speech signals are separated. The algorithm is
evaluated using real-life scenarios of two speakers. Evaluation of the mean absolute error (MAE) of the
estimated DOAs and the separation capabilities, demonstrates significant improvement w.r.t. a baseline DOA
estimation and speaker separation algorithm.

N. Cohen, G. Hazan, B. Schwartz, and S. Gannot, "An online algorithm for echo cancellation, dereverberation and noise reduction based on a Kalman-EM method", EURASIP Journal on Audio, Speech and Music, Aug. 2021.

BibTeX

@ARTICLE{SchwartzBoaz2014,
author={Schwartz, Boaz and Gannot, Sharon and Habets, Emanu"{e}l A.P.},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={On-line speech dereverberation using {Kalman} filter and {EM} algorithm},
year={2015},
month=feb,
vol=23,
no=2,
pages={394--406}
}

abstractBibTeX

BibTeX

@article{Dorfan2021FB,
title={Forward-Backward Recursive Expectation-Maximization for Concurrent Speakers Tracking},
author={Dorfan, Yuval and Schwartz, Boaz and Gannot, Sharon},
journal={EURASIP Journal on Audio, Speech and Music},
volume={2020},
year={2021},
url={https://rdcu.be/ch29Q}
}

copy to clipboard

In this paper, a study addressing the task of tracking multiple concurrent speakers in reverberant conditions is presented. Since both past and future observations can contribute to the current location estimate, we propose a forward-backward approach, which improves tracking accuracy by introducing near-future data to the estimator, in the cost of an additional short latency. Unlike classical target tracking, we apply a non-Bayesian approach, which does not make assumptions with respect to the target trajectories, except for assuming a realistic change in the parameters due to natural behaviour. The proposed method is based on the recursive expectation-maximization (REM) approach. The new method is dubbed forward-backward recursive expectation-maximization (FB-REM). The performance is demonstrated using an experimental study, where the tested scenarios involve both simulated and recorded signals, with typical reverberation levels and multiple moving sources. It is shown that the proposed algorithm outperforms the regular common causal (REM).

Y. Hu, P. N. Samarasinghe, S. Gannot, and T. D. Abhayapala, "Semi-supervised multiple source localization using relative harmonic coefficients under noisy and reverberant environments", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 3108-3123, 2020.

abstractBibTeX

BibTeX

@article{Yonggang2020RHC,
title={Semi-Supervised Multiple Source Localization Using Relative Harmonic Coefficients Under Noisy and Reverberant Environments},
author={Yonggang Hu and Prasanga N. Samarasinghe and Sharon Gannot and Thushara D. Abhayapala},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
year={2020},
volume={28},
number={},
pages={3108--3123},
}

copy to clipboard

This paper develops a semi-supervised algorithm to address the challenging multi-source localization problem in a noisy and reverberant environment, using a spherical harmonics domain source feature of the relative harmonic coefficients. We present a comprehensive research of this source feature, including (i) an illustration confirming its sole dependence on the source position, (ii) a feature estimator in the presence of noise, (iii) a feature selector exploiting its inherent directivity over space.Source features at varied spherical harmonic modes, representing unique characterization of the soundfield, are fused by the Multi-Mode Gaussian Process modeling. Based on the unifying model, we then formulate the mapping function revealing the underlying relationship between the source feature(s) and position(s) using a Bayesian inference approach. Another issue of the overlapped components is addressed by a pre-processing technique performing overlapped frame detection, which in turn reduces this challenging problem to a single source localization. It is highlighted that this data-driven method has a strong potential to be implemented in practice because only a limited number of labeled measurements is required. We evaluate this proposed algorithm using simulated recordings between multiple speakers in diverse environments, and extensive results confirm improved performance in comparison with the state-of-art methods. Additional assessments using real-life recordings further prove the effectiveness of the method, even at unfavorable circumstances with severe source overlapping.

K. Weisberg, B. Laufer-Goldshtein, and S. Gannot, "Simultaneous tracking and separation of multiple sources using factor graph model", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2848-2864, 2020.

abstractBibTeX

BibTeX

@article{Weisberg2020a,
title={Simultaneous Tracking and Separation of Multiple Sources Using Factor Graph Model},
author={Weisberg, Koby and Laufer-Goldshtein, Bracha and Gannot, Sharon},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
year={2020},
volume={28},
pages={2848--2864},
}

@ARTICLE{SchwartzOfer13,
author={Schwartz, O. and Gannot, S.},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={Speaker Tracking Using Recursive {EM} Algorithms},
year={2014},
month={Feb},
volume={22},
number={2},
pages={392--402}
}

copy to clipboard

The problem of localizing and tracking a known number of concurrent speakers in noisy and reverberant enclosures is addressed in this paper. We formulate the localization task as a maximum likelihood (ML) parameter estimation problem, and solve it by utilizing the expectation-maximization (EM) procedure. For the tracking scenario, we propose to adapt two recursive EM (REM) variants. The first, based on Titterington’s scheme, is a Newton-based recursion. In this work we also extend Titterington’s method to deal with constrained maximization, encountered in the problem at hand. The second is based on Cappé and Moulines’ scheme. We discuss the similarities and dissimilarities of these two variants and show their applicability to the tracking problem by a simulated experimental study.

N. Cohen, G. Hazan, B. Schwartz, and S. Gannot, "An online algorithm for echo cancellation, dereverberation and noise reduction based on a Kalman-EM method", EURASIP Journal on Audio, Speech and Music, Aug. 2021.

BibTeX

@article{Cohen2021,
title={An Online Algorithm For Echo cancellation, Dereverberation and Noise Reduction based on a {Kalman-EM} Method},
author={Cohen, Nili and Hazan, Gershon and Schwartz, Boaz and Gannot, Sharon },
journal={EURASIP Journal on Audio, Speech and Music},
year={2021},
month=aug
}

copy to clipboard

audio

N. Gößling, E. Hadad, S. Gannot, and S. Doclo, "Binaural LCMV beamforming with partial noise estimation", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2942-2955, 2020.

abstractBibTeX

BibTeX

@article{gossling2019binaural,
title={Binaural {LCMV} Beamforming with Partial Noise Estimation},
author={G{"o}{ss}ling, Nico and Hadad, Elior and Gannot, Sharon and Doclo, Simon},
journal=EEE/ACM Transactions on Audio, Speech, and Language Processing,
year={2020},
volume={28},
pages={2942--2955}
}

copy to clipboard

Besides reducing undesired sources, i.e., interfering sources and background noise, another important objective of a binaural beamforming algorithm is to preserve the spatial impression of the acoustic scene, which can be achieved by preserving the binaural cues of all sound sources. While the binaural minimum variance distortionless response (BMVDR) beamformer provides a good noise reduction performance and preserves the binaural cues of the desired source, it does not allow to control the reduction of the interfering sources and distorts the binaural cues of the interfering sources and the background noise. Hence, several extensions have been proposed. First, the binaural linearly constrained minimum variance (BLCMV) beamformer uses additional constraints, enabling to control the reduction of the interfering sources while preserving their binaural cues. Second, the BMVDR with partial noise estimation (BMVDR-N) mixes the output signals of the BMVDR with the noisy reference microphone signals, enabling to control the binaural cues of the background noise. Aiming at merging the advantages of both extensions, in this paper we propose the BLCMV with partial noise estimation (BLCMV-N). We show that the output signals of the BLCMV-N can be interpreted as a mixture between the noisy reference microphone signals and the output signals of a BLCMV using an adjusted interference scaling parameter. We provide a theoretical comparison between the BMVDR, the BLCMV, the BMVDR-N and the proposed BLCMV-N in terms of noise and interference reduction performance and binaural cue preservation. Experimental results using recorded signals as well as the results of a perceptual listening test show that the BLCMV-N is able to preserve the binaural cues of an interfering source (like the BLCMV), while enabling to trade off between noise reduction performance and binaural cue preservation of the background noise (like the BMVDR-N).

Y. Laufer, B. Laufer-Goldshtein, and S. Gannot, "ML estimation and CRBs for reverberation, speech, and noise PSDs in rank-deficient noise field", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 619-634, 2020.

abstractBibTeX

@ARTICLE{Schwartz2017e,
title = {Two Model-Based EM Algorithms for Blind Source Separation in Noisy Environments},
author={Boaz Schwarz and Sharon Gannot and Emanu"{e}l A.P. Habets},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
volume={25},
number={11},
pages={2209--2222},
month = nov,
year={2017},
}

copy to clipboard

The problem of blind separation of speech signals in the presence of noise using multiple microphones is addressed. Blind estimation of the acoustic parameters and the individual source signals are carried out by applying the expectation – maximization (EM) algorithm. Two models for the speech signals are used, namely an unknown deterministic signal model and a complex-Gaussian signal model. For the two alternatives, we
define a statistical model and develop EM-based algorithms to jointly estimate the acoustic parameters and the speech signals. The resulting algorithms are then compared from both theoretical and performance perspectives. In both cases, the latent data (differently defined for each alternative) is estimated in the E-step, where in the M-step, the two algorithms estimate the acoustic transfer functions of each source and the noise covariance matrix. The algorithms differ in the way the clean speech signals are used in the EM scheme. When the clean signal is assumed deterministic unknown, only the a posterior probabilities of the presence of each source are estimated in the E-step, while their time-frequency coefficients are designated as parameters, and are estimated in the M-step using the minimum variance distortionless response beamformer. If the clean speech signals are modelled as complex Gaussian signals, their power spectral densities (PSDs) are estimated in the E-step using the
multichannel Wiener filter output. The proposed algorithms were tested using reverberant noisy mixtures of two speech sources in different reverberation and noise conditions.

O. Schwarz, S. Gannot, and E. A.P. Habets, "Cramér-Rao bound analysis of reverberation level estimators for dereverberation and noise reduction", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 8, pp. 1680–1693, Aug. 2017.

abstractBibTeX

BibTeX

@ARTICLE{Schwartz2017a,
title = {{C}ram'{e}r-{R}ao Bound Analysis of Reverberation Level Estimators for Dereverberation and Noise Reduction},
author={Ofer Schwarz and Sharon Gannot and Emanu"{e}l A.P. Habets},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
year={2017},
volume={25},
number={8},
pages={1680--1693},
month=aug
}

copy to clipboard

The reverberation power spectral density (PSD) is often required for dereverberation and noise reduction algorithms. In this work, we compare two maximum likelihood (ML) estimators of the reverberation PSD in a noisy environment. In the first estimator, the direct path is first blocked. Then, the ML criterion for estimating the reverberation PSD is stated according to the probability density function (p.d.f.) of the blocking matrix (BM) outputs. In the second estimator, the speech component is not blocked. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. To compare the expected mean square error (MSE) between the two ML estimators of the reverberation PSD, the Cramér-Rao Bounds (CRBs) for the two ML estimators are derived. We show that the CRB for the joint reverberation and speech PSD estimator is lower than the CRB for estimating the reverberation PSD from the BM outputs. Experimental results show that the MSE of the two estimators indeed obeys the CRB curves. Experimental results of multi-microphone dereverberation and noise reduction algorithm show the benefits of using the ML estimators in comparison with another baseline estimators.

O. Schwarz, S. Gannot, and E. A. Habets, "Multi-speaker LCMV beamformer and postfilter for source separation and noise reduction", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 940–951, May 2017.

abstractBibTeX

BibTeX

@ARTICLE{Schwartz2017,
author={O. Schwartz and S. Gannot and E. A. P. Habets},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={Multispeaker {LCMV} Beamformer and Postfilter for Source Separation and Noise Reduction},
year={2017},
volume={25},
number={5},
pages={940--951},
month=may
}

copy to clipboard

audio

The problem of source separation and noise reduction using multiple microphones is addressed. The minimum
mean square error (MMSE) estimator for the multi-speaker case is derived and a novel decomposition of this estimator is pre-
sented. The MMSE estimator is decomposed into two stages: i) a multi-speaker linearly constrained minimum variance (LCMV)
beamformer (BF), and ii) a subsequent multi-speaker Wiener postfilter. The first stage separates and enhances the signals of
the individual speakers by utilizing the spatial characteristics of the speakers (as manifested by the respective acoustic transfer
functions (ATFs)) and the noise spatial correlation matrix, while the second stage exploits the speakers’ power spectral density
matrix to reduce the residual noise at the output of the first stage. The output vector of the multi-speaker LCMV BF is proven to be
the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral
amplitude estimator for the multi-speaker case is also derived given the multi-speaker LCMV BF outputs. The performance
evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically
verified that the multi-speaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when
compared with the single-speaker postfilter.

S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, "A Consolidated Perspective on Multi-Microphone Speech Enhancement and Source Separation", IEEE/ACM Transactions on Audio, Speech, and Language Processing, (invited tutorial paper), vol. 25, no. 4, pp. 692–730, Apr. 2017.

abstractBibTeX

BibTeX

@ARTICLE{Gannot2017,
author={S. Gannot and E. Vincent and S. Markovich-Golan and A. Ozerov},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation},
year={2017},
volume={25},
number={4},
pages={692--730},
month=apr,
note={Invited tutorial paper.}
}

copy to clipboard

Speech enhancement and separation are core problems in audio signal processing, with commercial applications
in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are cru-
cial pre-processing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight mi-
crophones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those
of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting
with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated
and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between
these approaches is lacking at present. In this article, we propose to fill this gap by analyzing a large number of established
and recent techniques according to four transverse axes:
a) the acoustic impulse response model
b) the spatial filter design criterion
c) the parameter estimation algorithm
d) optional postfiltering.
We conclude this overview paper by providing a list
of software and data resources and by discussing perspectives and
future trends in the field.

S. Markovich-Golan, S. Gannot, and W. Kellermann, "Combined LCMV-TRINICON beamforming for separating multiple speech sources in noisy and reverberant environments", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 25, No. 2, pp. 320-332, February 2017.

abstractBibTeX

@ARTICLE{Kinoshita15,
author = {Keisuke Kinoshita and Marc Delcroix and Sharon Gannot and Emanu"{e}l Habets and Reinhold Haeb-Umbach and
Walter Kellermann and Volker Leutnant and Roland Maas and Tomohiro Nakatani and Bhiksha Raj and Armin Sehr and Takuya Yoshioka},
title = {A summary of the {REVERB} challenge: state-of-the-art and remaining challenges in reverberant speech processing research},
journal = {EURASIP Journal on Advances in Signal Processing, Special issue on: ``Silencing the Echoes - Processing of Reverberant Speech''},
volume={2016},
number={1},
month = jan,
pages={1--19},
year={2016},
}

copy to clipboard

In recent years, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques and automatic speech recognition (ASR) techniques that are robust to reverberation. In this paper, we describe the REVERB challenge, which is an evaluation campaign that was designed to evaluate such speech enhancement (SE) and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions. Even though most existing benchmark tasks and challenges for distant speech processing focus on the noise robustness issue and sometimes only on a single-channel scenario, a particular novelty of the REVERB challenge is that it is carefully designed to test robustness against reverberation, based on both real, single-channel, and multichannel recordings. This challenge attracted 27 papers, which represent 25 systems specifically designed for SE purposes and 49 systems specifically designed for ASR purposes. This paper describes the problems dealt within the challenge, provides an overview of the submitted systems, and scrutinizes them to clarify what current processing strategies appear effective in reverberant speech processing.

Ofer Schwartz and Sharon Gannot, "A Recursive Expectation-Maximization Algorithm for Speaker Tracking and Separation", EURASIP Journal on Audio, Speech and Music, Dec. 2021.

abstractBibTeX

BibTeX

@article{Schwartz2021,
title={A Recursive Expectation-Maximization Algorithm for Speaker Tracking and Separation},
author={Schwartz, Ofer and Gannot, Sharon },
journal={EURASIP Journal on Audio, Speech and Music},
year={2021},
month=may
}

copy to clipboard

The problem of blind and online speaker localization and separation using multiple microphones is addressed
based on the recursive expectation-maximization (REM) procedure. A two-stage REM-based algorithm is
proposed: 1) multi-speaker direction of arrival (DOA) estimation and 2) multi-speaker relative transfer
function (RTF) estimation. The DOA estimation task uses only the time frequency (TF) bins dominated by a
single speaker while the entire frequency range is not required to accomplish this task. In contrast, the RTF
estimation task requires the entire frequency range in order to estimate the RTF for each frequency bin.
Accordingly, a different statistical model is used for the two tasks. The first REM model is applied under the
assumption that the speech signal is sparse in the TF domain, and utilizes a mixture of Gaussians (MoG)
model to identify the TF bins associated with a single dominant speaker. The corresponding DOAs are
estimated using these bins. The second REM model is applied under the assumption that the speakers are
concurrently active in all TF bins and consequently applies a multichannel Wiener filter (MCWF) to separate
the speakers. As a result of the assumption of the concurrent speakers, a more precise TF map of the speakers’
activity is obtained. The RTFs are estimated using the outputs of the MCWF-beamformer (BF), which are
constructed using the DOAs obtained in the previous stage. Next, using the linearly constrained minimum
variance (LCMV)-BF that utilizes the estimated RTFs, the speech signals are separated. The algorithm is
evaluated using real-life scenarios of two speakers. Evaluation of the mean absolute error (MAE) of the
estimated DOAs and the separation capabilities, demonstrates significant improvement w.r.t. a baseline DOA
estimation and speaker separation algorithm.

Diego Di Carlo, Pinchas Tandeitnik, Cedric Foy, Nancy Bertin, Antoine Deleforge, Sharon Gannot, "dEchorate: a Calibrated Room Impulse Response Dataset for Echo-aware Signal Processing", Accepted for publication, EURASIP Journal on Audio, Speech and Music, Nov. 2021

abstractBibTeX

BibTeX

@article{DiCarlo2021,
title={{dEchorate}: a Calibrated Room Impulse Response Dataset for Echo-aware Signal Processing},
author={Di Carlo, Diego and Tandeitnik, Pinchas and Foy, Cedric and Bertin, Nancy and Deleforge, Antoine and Gannot, Sharon },
journal={EURASIP Journal on Audio, Speech and Music},
year={2021},
month=jun
}

copy to clipboard

This paper presents a new dataset of measured multichannel Room Impulse Responses (RIRs) named dEchorate. It includes annotations of early echo timings and 3D positions of microphones, real sources and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, acoustic echo retrieval, microphone calibration, echo labeling and reflectors position estimation. The dataset is provided with software utilities to easily access, manipulate and visualize the data as well as baseline methods for echo-related tasks.

N. Cohen, G. Hazan, B. Schwartz, and S. Gannot, "An online algorithm for echo cancellation, dereverberation and noise reduction based on a Kalman-EM method", EURASIP Journal on Audio, Speech and Music, Aug. 2021.

BibTeX

@article{cherkassky2019successive,
title={Successive Relative Transfer Function Identification Using Blind Oblique Projection},
author={Cherkassky, Dani and Gannot, Sharon},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
volume={28},
pages={474--486},
year={2019},
publisher={IEEE}
}

copy to clipboard

Distortionless speech extraction in a reverberant environment can be achieved by applying a beamforming algorithm, provided that the relative transfer functions (RTFs) of the sources and the covariance matrix of the noise are known. In this paper, the challenge of RTF identification in a multi-speaker scenario is addressed. We propose a successive RTF identification (SRI) technique, based on the sole assumption that sources do not become simultaneously active. That is, we address the challenge of estimating the RTF of a specific speech source while assuming that the RTFs of all other active sources in the environment were previously estimated in an earlier stage. The RTF of interest is identified by applying the blind oblique projection (BOP)-SRI technique. When a new speech source is identified, the BOP algorithm is applied. BOP results in a null steering toward the RTF of interest, by means of applying an oblique projection to the microphone measurements. We prove that by artificially increasing the rank of the range of the projection matrix, the RTF of interest can be identified. An experimental study is carried out to evaluate the performance of the BOP-SRI algorithm in various signal to noise ratio (SNR) and signal to interference ratio (SIR) conditions and to demonstrate its effectiveness in speech extraction tasks.

X. Li, L. Girin, S. Gannot, and R. Horaud, "Multichannel online dereverberation based on spectral magnitude inverse filtering", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1365–1377, Sep. 2019.

abstractBibTeX

BibTeX

@ARTICLE{Gannot2017,
author={S. Gannot and E. Vincent and S. Markovich-Golan and A. Ozerov},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation},
year={2017},
volume={25},
number={4},
pages={692--730},
month=apr,
note={Invited tutorial paper.}
}

copy to clipboard

Speech enhancement and separation are core problems in audio signal processing, with commercial applications
in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are cru-
cial pre-processing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight mi-
crophones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those
of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting
with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated
and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between
these approaches is lacking at present. In this article, we propose to fill this gap by analyzing a large number of established
and recent techniques according to four transverse axes:
a) the acoustic impulse response model
b) the spatial filter design criterion
c) the parameter estimation algorithm
d) optional postfiltering.
We conclude this overview paper by providing a list
of software and data resources and by discussing perspectives and
future trends in the field.

S. Markovich-Golan, S. Gannot, and W. Kellermann, "Combined LCMV-TRINICON beamforming for separating multiple speech sources in noisy and reverberant environments", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 25, No. 2, pp. 320-332, February 2017.

abstractBibTeX

BibTeX

@ARTICLE{Markovich-Golan16,
title = {Iterative combined {TRINICON-LCMV} beamforming for separating multiple speech sources in noisy and reverberant environments},
author={Shmulik Markovich-Golan and Sharon Gannot and Walter Kellermann},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
volume = {25},
number = {2},
year = {2017},
month = feb,
pages ={320--332}
}

copy to clipboard

The problem of source separation using an array of microphones in reverberant and noisy conditions is addressed. We consider applying the well-known linearly constrained minimum variance (LCMV) beamformer (BF) for extracting individual speakers. Constraints are defined using relative transfer functions (RTFs) for the sources, which are acoustic transfer functions (ATFs) ratios between any microphone and a reference microphone. The latter are usually estimated by methods which rely on single-talk time segments where only a single source is active and on reliable knowledge of the source activity. Two novel algorithms for estimation of RTFs using the TRINICON (Triple-N ICA for convolutive mixtures) framework are proposed, not resorting to the usually unavailable source activity pattern. The first algorithm estimates the RTFs of the sources by applying multiple two-channel geometrically constrained (GC) TRINICON units, where approximate direction of arrival (DOA) information for the sources is utilized for ensuring convergence to the desired solution. The GC-TRINICON is applied to all microphone pairs using a common reference microphone. In the second algorithm, we propose to estimate RTFs iteratively using GC-TRINICON, where instead of using a fixed reference microphone as before, we suggest to use the output signals of LCMV-BFs from the previous iteration as spatially processed references (SPRs) with improved signal-to-interference-and-noise ratio (SINR). For both algorithms, a simple detection of noise-only time segments is required for estimating the covariance matrix of noise and interference. We conduct an experimental study in which the performance of the proposed methods is confirmed and compared to corresponding supervised methods.

X. Li, L. Girin, R. Horaud, and S. Gannot, "Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2171–2186, Nov. 2016.

abstractBibTeX

BibTeX

@ARTICLE{Li16a,
author={X. Li and L. Girin and R. Horaud and S. Gannot},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization},
year={2016},
volume={24},
number={11},
pages={2171--2186},
month=nov
}

copy to clipboard

This paper addresses the problem of sound-source localization of a single speech source in noisy and reverberant environments. For a given binaural microphone setup, the binaural response corresponding to the direct-path propagation of a single source is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function of the two channels. We propose a method to estimate the DP-RTF from the noisy and reverberant microphone signals in the short-time Fourier transform (STFT) domain. First, the convolutive transfer function approximation is adopted to accurately represent the impulse response of the sensors in the STFT domain. Second, the DP-RTF is estimated by using the auto- and cross-power spectral densities at each frequency and over multiple frames. In the presence of stationary noise, an interframe spectral subtraction algorithm is proposed, which enables to achieve the estimation of noise-free auto- and cross-power spectral densities. Finally, the estimated DP-RTFs are concatenated across frequencies and used as a feature vector for the localization of speech source. Experiments with both simulated and real data show that the proposed localization method performs well, even under severe adverse acoustic conditions, and outperforms state-of-the-art localization methods under most of the acoustic conditions.

D. Y. Levin, E. A. P. Habets, and S. Gannot, "Near-field signal acquisition for smartglasses using two acoustic vector-sensors", Speech Communication, vol 83, pp. 42–53, Oct. 2016.

abstractBibTeX

BibTeX

@ARTICLE{Levin16,
author = {Dovid Y. Levin and Emanu"{e}l A. P. Habets and Sharon Gannot},
title = {Near-field signal acquisition for smartglasses using two acoustic vector-sensors},
journal = {Speech Communication},
year = {2016},
month = oct,
pages = {42--53},
volume = {83}
}

BibTeX

@ARTICLE{SchwartzBoaz2014,
author={Schwartz, Boaz and Gannot, Sharon and Habets, Emanu"{e}l A.P.},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={On-line speech dereverberation using {Kalman} filter and {EM} algorithm},
year={2015},
month=feb,
vol=23,
no=2,
pages={394--406}
}

copy to clipboard

audio

Speech signals recorded in a room are commonly degraded by reverberation. In most cases, both the speech signal and the acoustic system of the room are unknown and time-varying. In this paper, a scenario with a single desired sound source and slowly time-varying and spatially-white noise is considered, and a multi-microphone algorithm that simultaneously estimates the clean speech signal and the time-varying acoustic system is proposed. The recursive expectation-maximization scheme is employed to obtain both the clean speech signal and the acoustic system in an online manner. In the expectation step, the Kalman filter is applied to extract a new sample of the clean signal, and in the maximization step, the system estimate is updated according to the output of the Kalman filter. Experimental results show that the proposed method is able to significantly reduce reverberation and increase the speech quality. Moreover, the tracking ability of the algorithm was validated in practical scenarios using human speakers moving in a natural manner.

S. Markovich-Golan, A. Bertrand, M. Moonen, and S. Gannot, "Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks", Special issue on "acoustic sensor networks and ad hoc microphone arrays," Signal Processing, vol. 107, pp. 4–20, 2015.

abstractBibTeX

BibTeX

@article{markovich2015optimal,
title={Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks},
author={Markovich-Golan, Shmulik and Bertrand, Alexander and Moonen, Marc and Gannot, Sharon},
journal={Signal Processing},
volume={107},
pages={4--20},
year={2015},
publisher={Elsevier}
}

copy to clipboard

In multiple speaker scenarios, the linearly constrained minimum variance (LCMV) beamformer is a popular microphone array-based speech enhancement technique, as it allows minimizing the noise power while maintaining a set of desired responses towards different speakers. Here, we address the algorithmic challenges arising when applying the LCMV beamformer in wireless acoustic sensor networks (WASNs), which are a next-generation technology for audio acquisition and processing. We review three optimal distributed LCMV-based algorithms, which compute a network-wide LCMV beamformer output at each node without centralizing the microphone signals. Optimality here refers to equivalence to a centralized realization where a single processor has access to all signals. We derive and motivate the algorithms in an accessible top-down framework that reveals their underlying relations. We explain how their differences result from their different design criterion (node-specific versus common constraints sets), and their different priorities for communication bandwidth, computational power, and adaptivity. Furthermore, although originally proposed for a fully connected WASN, we also explain how to extend the reviewed algorithms to the case of a partially connected WASN, which is assumed to be pruned to a tree topology. Finally, we discuss the advantages and disadvantages of the various algorithms

O. Schwarz, S. Gannot, and E. A. Habets, "Multi-speaker LCMV beamformer and postfilter for source separation and noise reduction", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 940–951, May 2017, Biomedical Biomaterials

abstractBibTeX

BibTeX

@ARTICLE{Schwartz2017,
author={O. Schwartz and S. Gannot and E. A. P. Habets},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing
title={Multispeaker {LCMV} Beamformer and Postfilter for Source Separation and Noise Reduction},
year={2017},
volume={25},
number={5},
pages={940--951},
month=may
}

copy to clipboard

presentationaudio

The problem of source separation and noise reduction using multiple microphones is addressed. The minimum mean square error (MMSE) estimator for the multispeaker case is derived and a novel decomposition of this estimator is presented. The MMSE estimator is decomposed into two stages: first, a multispeaker linearly constrained minimum variance (LCMV) beamformer (BF); and second, a subsequent multispeaker Wiener postfilter. The first stage separates and enhances the signals of the individual speakers by utilizing the spatial characteristics of the speakers [as manifested by the respective acoustic transfer functions (ATFs)] and the noise power spectral density (PSD) matrix, while the second stage exploits the speakers’ PSD matrix to reduce the residual noise at the output of the first stage. The output vector of the multispeaker LCMV BF is proven to be the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral amplitude estimator for the multispeaker case is also derived given the multispeaker LCMV BF outputs. The performance evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically verified that the multispeaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when compared with the single-speaker postfilter.

Ofer Schwartz and Sharon Gannot, "A Recursive Expectation-Maximization Algorithm for Speaker Tracking and Separation", EURASIP Journal on Audio, Speech and Music, Dec. 2021.

abstractBibTeX

BibTeX

@article{Schwartz2021,
title={A Recursive Expectation-Maximization Algorithm for Speaker Tracking and Separation},
author={Schwartz, Ofer and Gannot, Sharon },
journal={EURASIP Journal on Audio, Speech and Music},
year={2021},
month=may
}

copy to clipboard

The problem of blind and online speaker localization and separation using multiple microphones is addressed
based on the recursive expectation-maximization (REM) procedure. A two-stage REM-based algorithm is
proposed: 1) multi-speaker direction of arrival (DOA) estimation and 2) multi-speaker relative transfer
function (RTF) estimation. The DOA estimation task uses only the time frequency (TF) bins dominated by a
single speaker while the entire frequency range is not required to accomplish this task. In contrast, the RTF
estimation task requires the entire frequency range in order to estimate the RTF for each frequency bin.
Accordingly, a different statistical model is used for the two tasks. The first REM model is applied under the
assumption that the speech signal is sparse in the TF domain, and utilizes a mixture of Gaussians (MoG)
model to identify the TF bins associated with a single dominant speaker. The corresponding DOAs are
estimated using these bins. The second REM model is applied under the assumption that the speakers are
concurrently active in all TF bins and consequently applies a multichannel Wiener filter (MCWF) to separate
the speakers. As a result of the assumption of the concurrent speakers, a more precise TF map of the speakers’
activity is obtained. The RTFs are estimated using the outputs of the MCWF-beamformer (BF), which are
constructed using the DOAs obtained in the previous stage. Next, using the linearly constrained minimum
variance (LCMV)-BF that utilizes the estimated RTFs, the speech signals are separated. The algorithm is
evaluated using real-life scenarios of two speakers. Evaluation of the mean absolute error (MAE) of the
estimated DOAs and the separation capabilities, demonstrates significant improvement w.r.t. a baseline DOA
estimation and speaker separation algorithm.

B. Laufer Goldshtein, R. Talmon, and S. Gannot, "Audio source separation by activity probability detection with maximum correlation and simplex geometry", EURASIP Journal on Audio, Speech and Music, vol. 2021, Jan. 2021.

abstractBibTeX

BibTeX

@article{Laufer2020Corr,
title={Audio Source Separation by Activity Probability Detection with Maximum Correlation and Simplex Geometry},
author={Laufer Goldshtein, Bracha and Talmon, Ronen and Gannot, Sharon},
journal={EURASIP Journal on Audio, Speech and Music},
volume={2021},
month=jan,
year={2021},
url={https://rdcu.be/ch29B}
}

copy to clipboard

Two novel methods for speaker separation of multi-microphone recordings that can also detect speakers with infrequent activity are presented. The proposed methods are based on a statistical model of the probability of activity of the speakers across time. Each method takes a different approach for estimating the activity probabilities. The first method is derived using a linear programming (LP) problem for maximizing the correlation function between different time frames. It is shown that the obtained maxima correspond to frames which contain a single active speaker. Accordingly, we propose an algorithm for successive identification of frames dominated by each speaker. The second method aggregates the correlation values associated with each frame in a correlation vector. We show that these correlation vectors lie in a simplex with vertices that correspond to frames dominated by one of the speakers. In this method, we utilize convex geometry tools to sequentially detect the simplex vertices. The correlation functions associated with single-speaker frames, which are detected by either of the two proposed methods, are used for recovering the activity probabilities. A spatial mask is estimated based on the recovered probabilities and is utilized for separation and enhancement by means of both spatial and spectral processing. Experimental results demonstrate the performance of the proposed methods in various conditions on real-life recordings with different reverberation and noise levels, outperforming a state-of-the-art separation method.

K. Weisberg, B. Laufer-Goldshtein, and S. Gannot, "Simultaneous tracking and separation of multiple sources using factor graph model", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2848-2864, 2020.

abstractBibTeX

BibTeX

@article{Weisberg2020a,
title={Simultaneous Tracking and Separation of Multiple Sources Using Factor Graph Model},
author={Weisberg, Koby and Laufer-Goldshtein, Bracha and Gannot, Sharon},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
year={2020},
volume={28},
pages={2848--2864},
}

copy to clipboard

audio

In this paper, we present an algorithm for direction of arrival (DOA) tracking and separation of multiple speakers with a microphone array using the factor graph statistical model. In our model, the speakers can be located in one of a predefined set of candidate DOAs, and each time-frequency (TF) bin can be associated with a single speaker. Accordingly, by attributing a statistical model to both the DOAs and the associations, as well as to the microphone array observations given these variables, we show that the conditional probability of these variables given the microphone array observations can be modeled as a factor graph. Using the loopy belief propagation (LBP) algorithm, we derive a novel inference scheme which simultaneously estimates both the DOAs and the associations. These estimates are used in turn for separating the sources, by directing a beamformer towards the estimated DOAs, and then applying a TF masking according to the estimated associations. A comprehensive experimental study demonstrates the benefits of the proposed algorithm in both simulated data and real-life measurements recorded in our
laboratory.

N. Gößling, E. Hadad, S. Gannot, and S. Doclo, "Binaural LCMV beamforming with partial noise estimation", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2942-2955, 2020.

abstractBibTeX

@article{Cohen2021,
title={An Online Algorithm For Echo cancellation, Dereverberation and Noise Reduction based on a {Kalman-EM} Method},
author={Cohen, Nili and Hazan, Gershon and Schwartz, Boaz and Gannot, Sharon },
journal={EURASIP Journal on Audio, Speech and Music},
year={2021},
month=aug
}

copy to clipboard

audio

Y. Laufer, B. Laufer-Goldshtein, and S. Gannot, "ML estimation and CRBs for reverberation, speech, and noise PSDs in rank-deficient noise field", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 619-634, 2020.

abstractBibTeX

@article{Braun2018evaluation,
title={Evaluation and Comparison of Late Reverberation Power Spectral Density Estimators},
author={Braun, Sebastian and Kuklasinski, Adam and Schwartz, Ofer and Thiergart, Oliver and Habets, Emanu{"e}l AP and Gannot, Sharon and Doclo, Simon and Jensen, Jesper},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
volume={26},
number={6},
month=jun,
pages={1052--1067},
year={2018}
}

copy to clipboard

Reduction of late reverberation can be achieved using spatio-spectral filters such as the multichannel Wiener filter (MWF). To compute this filter, an estimate of the late reverberation power spectral density (PSD) is required. In recent years, a multitude of late reverberation PSD estimators have been proposed. In this contribution, these estimators are categorized into several classes, their relations and differences are discussed, and a comprehensive experimental comparison is provided. To compare their performance, simulations in controlled as well as practical scenarios are conducted. It is shown that a common weakness of spatial coherence-based estimators is their performance in high direct-to-diffuse ratio (DDR) conditions. To mitigate this problem, a correction method is proposed and evaluated. It is shown that the proposed correction method can decrease the speech distortion without significantly affecting the reverberation reduction.

O. Schwarz, S. Gannot, and E. A.P. Habets, "Cramér-Rao bound analysis of reverberation level estimators for dereverberation and noise reduction", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 8, pp. 1680–1693, Aug. 2017.

abstractBibTeX

BibTeX

@ARTICLE{Schwartz2017a,
title = {{C}ram'{e}r-{R}ao Bound Analysis of Reverberation Level Estimators for Dereverberation and Noise Reduction},
author={Ofer Schwarz and Sharon Gannot and Emanu"{e}l A.P. Habets},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
year={2017},
volume={25},
number={8},
pages={1680--1693},
month=aug
}

copy to clipboard

The reverberation power spectral density (PSD) is often required for dereverberation and noise reduction algorithms. In this work, we compare two maximum likelihood (ML) estimators of the reverberation PSD in a noisy environment. In the first estimator, the direct path is first blocked. Then, the ML criterion for estimating the reverberation PSD is stated according to the probability density function (p.d.f.) of the blocking matrix (BM) outputs. In the second estimator, the speech component is not blocked. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. To compare the expected mean square error (MSE) between the two ML estimators of the reverberation PSD, the Cramér-Rao Bounds (CRBs) for the two ML estimators are derived. We show that the CRB for the joint reverberation and speech PSD estimator is lower than the CRB for estimating the reverberation PSD from the BM outputs. Experimental results show that the MSE of the two estimators indeed obeys the CRB curves. Experimental results of multi-microphone dereverberation and noise reduction algorithm show the benefits of using the ML estimators in comparison with another baseline estimators.

O. Schwartz, S. Gannot, and E.A.P. Habets, "An Expectation-Maximization Algorithm for Multi-microphone Speech Dereverberation and Noise Reduction with Coherence Matrix Estimation", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 9, pp. 1495–1510, Sep. 2016.

abstractBibTeX

BibTeX

@ARTICLE{Schwartz16,
author={O. Schwartz and S. Gannot and E. A. P. Habets},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={An Expectation-Maximization Algorithm for Multimicrophone Speech Dereverberation and Noise Reduction With Coherence Matrix Estimation},
year={2016},
volume={24},
number={9},
pages={1495--1510},
month=sep
}

copy to clipboard

In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that consists of the direct path and some early reflections and a late reverberant component that consists of all late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation, and ambient noise is presented. The expectation-maximization (EM) algorithm is used to estimate the signals and spatial parameters of the early speech component and the late reverberation components. As a result, a spatially filtered version of the early speech component is estimated in the E-step. The power spectral density (PSD) of the anechoic speech, the relative early transfer functions, and the PSD matrix of the late reverberation are estimated in the M-step of the EM algorithm. The algorithm is evaluated using real room impulse response recorded in our acoustic lab with a reverberation time set to 0.36 s and 0.61 s and several signal-to-noise ratio levels. It is shown that significant improvement is obtained and that the proposed algorithm outperforms baseline single-channel and multichannel dereverberation algorithms, as well as a state-of-the-art multichannel dereverberation algorithm.

K. Kinoshita, M. Delcroix, S. Gannot, E. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, and T. Yoshioka, "A summary of the REVERB challenge", EURASIP Journal on Advances in Signal Processing, Special issue on: “Silencing the Echoes-Processing of Reverberant Speech”, vol. 2016, no. 1, pp. 1–19, Jan. 2016.

abstractBibTeX

BibTeX

@ARTICLE{Kinoshita15,
author = {Keisuke Kinoshita and Marc Delcroix and Sharon Gannot and Emanu"{e}l Habets and Reinhold Haeb-Umbach and
Walter Kellermann and Volker Leutnant and Roland Maas and Tomohiro Nakatani and Bhiksha Raj and Armin Sehr and Takuya Yoshioka},
title = {A summary of the {REVERB} challenge: state-of-the-art and remaining challenges in reverberant speech processing research},
journal = {EURASIP Journal on Advances in Signal Processing, Special issue on: ``Silencing the Echoes - Processing of Reverberant Speech''},
volume={2016},
number={1},
month = jan,
pages={1--19},
year={2016},
}

copy to clipboard

In recent years, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques and automatic speech recognition (ASR) techniques that are robust to reverberation. In this paper, we describe the REVERB challenge, which is an evaluation campaign that was designed to evaluate such speech enhancement (SE) and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions. Even though most existing benchmark tasks and challenges for distant speech processing focus on the noise robustness issue and sometimes only on a single-channel scenario, a particular novelty of the REVERB challenge is that it is carefully designed to test robustness against reverberation, based on both real, single-channel, and multichannel recordings. This challenge attracted 27 papers, which represent 25 systems specifically designed for SE purposes and 49 systems specifically designed for ASR purposes. This paper describes the problems dealt within the challenge, provides an overview of the submitted systems, and scrutinizes them to clarify what current processing strategies appear effective in reverberant speech processing.

O. Schwartz, S. Gannot, and E. A.P. Habets, "Multi-microphone speech dereverberation and noise reduction using relative early transfer functions", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, no. 2, pp. 240-251, Feb. 2015.

abstractBibTeX

BibTeX

@ARTICLE{SchwartzOfer2014,
author={Schwartz, Ofer and Gannot, Sharon and Habets, Emanu"{e}l A.P.},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={Multi-Microphone Speech Dereverberation and Noise Reduction Using Relative Early Transfer Functions},
year={2015},
month=feb,
vol=23,
no=2,
pages={240--251}
}

copy to clipboard

audio

In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that includes the direct path and some early reflections, and a late reverberant component that includes all the late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation and ambient noise is presented. A multi-microphone minimum mean square error estimator is used to obtain a spatially filtered version of the early speech component. The estimator constructed as a minimum variance distortionless response (MVDR) beamformer (BF) followed by a postfilter (PF). Three unique design features characterize the proposed method. First, the MVDR BF is implemented in a special structure, named the nonorthogonal generalized sidelobe canceller (NO-GSC). Compared with the more conventional orthogonal GSC structure, the new structure allows for a simpler implementation of the GSC blocks for various MVDR constraints. Second, In contrast to earlier works, RETFs are used in the MVDR criterion rather than either the entire RTFs or only the direct-path of the desired speech signal. An estimator of the RETFs is proposed as well. Third, the late reverberation and noise are processed by both the beamforming stage and the PF stage. Since the relative power of the noise and the late reverberation varies with the frame index, a computationally efficient method for the required matrix inversion is proposed to circumvent the cumbersome mathematical operation. The algorithm was evaluated and compared with two alternative multichannel algorithms and one single-channel algorithm using simulated data and data recorded in a room with a reverberation time of 0.5 s for various source-microphone array distances (1-4 m) and several signal-to-noise levels. The processed signals were tested using two commonly used objective measures, namely perceptual …

B. Schwartz, S. Gannot, and E. A.P. Habets, "Online speech dereverberation using Kalman filter and EM algorithm", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, no. 2, pp. 394-406, Feb. 2015.

abstractBibTeX

BibTeX

@ARTICLE{SchwartzBoaz2014,
author={Schwartz, Boaz and Gannot, Sharon and Habets, Emanu"{e}l A.P.},
journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing,
title={On-line speech dereverberation using {Kalman} filter and {EM} algorithm},
year={2015},
month=feb,
vol=23,
no=2,
pages={394--406}
}

copy to clipboard

audio

Speech signals recorded in a room are commonly degraded by reverberation. In most cases, both the speech signal and the acoustic system of the room are unknown and time-varying. In this paper, a scenario with a single desired sound source and slowly time-varying and spatially-white noise is considered, and a multi-microphone algorithm that simultaneously estimates the clean speech signal and the time-varying acoustic system is proposed. The recursive expectation-maximization scheme is employed to obtain both the clean speech signal and the acoustic system in an online manner. In the expectation step, the Kalman filter is applied to extract a new sample of the clean signal, and in the maximization step, the system estimate is updated according to the output of the Kalman filter. Experimental results show that the proposed method is able to significantly reduce reverberation and increase the speech quality. Moreover, the tracking ability of the algorithm was validated in practical scenarios using human speakers moving in a natural manner.

Education

BibTeX

Topic A

submitted

BibTeX

Robust beamforming

Relative transfer function (RTF) estimation

Self-Localization and Mapping

BibTeX

BibTeX

BibTeX

BibTeX

Tutorial/Review Paper

BibTeX

BibTeX

BibTeX

BibTeX

Synchronization

BibTeX

BibTeX

Distributed acoustic sensor networks

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

Binaural

BibTeX

BibTeX

BibTeX

BibTeX

Bayesian methods

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

Simplex analysis

BibTeX

BibTeX

BibTeX

Other

BibTeX

BibTeX

Theoretical study and performance analysis

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

Echo cancellation and echo-path estimation

BibTeX

BibTeX

BibTeX

Maximum Likelihood and Expectation-Maximization (batch and recursive)

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX

BibTeX