In the last decade, the signal processing (SP) community has witnessed a paradigm shift from model-based to data-driven methods. Machine learning (ML)—more specifically, deep learning—methodologies are nowadays widely used in all SP fields, e.g., audio, speech, image, video, multimedia, and multimodal/multisensor processing, to name a few. Many data-driven methods also incorporate domain knowledge to improve problem modeling, especially when computational burden, training data scarceness, and memory size are important constraints.
Abstract—In this paper we propose a data-driven approach for multiple speaker tracking in reverberant enclosures. The method comprises two stages. The first stage executes a single source localization using semi-supervised learning on multiple manifold. The second stage uses unsupervised time-varying maximum likelihood estimation for tracking. The feature vectors, used by both stages, are the relative transfer functions (RTFs), which are known to be related to source positions. The number of sources is assumed to be known while the microphone positions are unknown. In the training stage, a large database of RTFs is given. A small percentage of the data is attributed with exact positions
and the rest is assumed to be unlabelled, i.e. the respective position is unknown. Then, a nonlinear, manifold-based, mapping function between the RTFs and the source positions is inferred. Applying this mapping function to all unlabelled RTFs constructs a dense grid of localized sources. In the test phase, this RTF grid serves as the centroids for a Mixture of Gaussians (MoG) model. The MoG parameters are estimated by applying recursive expectation-maximization (EM). The EM procedure relies on the sparsity and intermittency of the speech signals. A comprehensive
simulation study, with two overlapping speakers in various reverberation levels, demonstrates the usability of the proposed scheme, achieving high level of accuracy compared to a baseline method using a simpler propagation model.
This paper addresses the problem of tracking a moving source, e.g., a robot, equipped with both receivers and a source, that is tracking its own location and simultaneously estimating the locations of multiple plane reflectors. We assume a noisy knowledge of the robot’s movement. We formulate this problem, which is also known as simultaneous localization and mapping (SLAM), as a hybrid estimation problem. We derive the extended Kalman filter (EKF) for both tracking the robot’s own location and estimating the room geometry. Since the EKF employs linearization at every step, we incorporate a regulated kinematic model, which facilitates a successful tracking. In addition, we consider the echo-labeling problem as solved and beyond the scope of this paper. We then develop the hybrid Cramér-Rao lower bound on the estimation accuracy of both the localization and mapping parameters. The algorithm is evaluated with respect to the bound via simulations, which shows that the EKF approaches the hybrid Cramér-Rao bound (CRB) (HCRB) as the number of observation increases. This result implies that for the examples tested in simulation, the HCRB is an asymptotically tight bound and that the EKF is an optimal estimator. Whether this property is true in general remains an open question.
Estimation problems like room geometry estimation and localization of acoustic reflectors are of great interest and importance in robot and drone audition. Several methods for tackling these problems exist, but most of them rely on information about times-of-arrival (TOAs) of the acoustic echoes. These need to be estimated in practice, which is a difficult problem in itself, especially in robot applications which are characterized by high ego-noise. Moreover, even if TOAs are successfully extracted, the difficult problem of echolabeling needs to be solved. In this paper, we propose multiple expectation-maximization (EM) methods, for jointly estimating the TOAs and directions-of-arrival (DOA) of the echoes, with a uniform circular array (UCA) and a loudspeaker in its center for probing the environment. The different methods are derived to be optimal under different noise conditions. The experimental results show that the proposed methods outperform existing methods in terms of estimation accuracy in noisy conditions. For example, it can provide accurate estimates at SNR of 10 dB lower compared to TOA extraction from room impulse responses, which is often used. Furthermore, the results confirm that the proposed methods can account for scenarios with colored noise or faulty microphones. Finally, we show the applicability of the proposed methods in mapping of an indoor environment.
Ad hoc acoustic networks comprising multiple nodes, each of which consists of several microphones, are addressed. From the ad hoc nature of the node constellation, microphone positions are unknown. Hence, typical tasks, such as localization, tracking, and beamforming, cannot be directly applied. To tackle this challenging joint multiple speaker localization and array calibration task, we propose a novel variant of the expectation-maximization (EM) algorithm. The coordinates of multiple arrays relative to an anchor array are blindly estimated using naturally uttered speech signals of multiple concurrent speakers. The speakers’ locations, relative to the anchor array, are also estimated. The inter-distances of the microphones in each array, as well their orientations, are assumed known, which is a reasonable assumption for many modern mobile devices (in outdoor and in a several indoor scenarios). The well-known initialization problem of the batch EM algorithm is circumvented by an incremental procedure, also derived here. The proposed algorithm is tested by an extensive simulation study.
As we are surrounded by an increased number of mobile devices equipped with wireless links and multiple microphones, e.g., smartphones, tablets, laptops and hearing aids, using them collaboratively for acoustic processing is a promising platform for emerging applications. These devices make up an acoustic sensor network comprised of nodes, i.e. distributed devices equipped with microphone arrays, communication unit and processing unit. Algorithms for speaker separation and localization using such a network require a precise knowledge of the nodes’ locations and orientations. To acquire this knowledge, a recently introduced approach proposed a combined direction of arrival (DoA) and time difference of arrival (TDoA) target function for off-line calibration with dedicated recordings. This paper proposes an extension of this approach to a novel online method with two new features: First, by employing an evolutionary algorithm on incremental measurements, it is online and fast enough for real-time application. Second, by using the sparse spike representation computed in a cochlear model for TDoA estimation, the amount of information shared between the nodes by transmission is reduced while the accuracy is increased. The proposed approach is able to calibrate an acoustic senor network online during a meeting in a reverberant conference room.
Acoustic data provide scientific and engineering insights in fields ranging from biology and communications to ocean and Earth science. We survey the recent advances and transformative potential of machine learning (ML), including deep learning, in the field of acoustics. ML is a broad family of techniques, which are often based in statistics, for automatically detecting and utilizing patterns in data. Relative to conventional acoustics and signal processing, ML is data-driven. Given sufficient training data, ML can discover complex relationships between features and desired labels or actions, or between features themselves. With large volumes of training data, ML can discover models describing complex acoustic phenomena such as human speech and reverberation. ML in acoustics is rapidly developing with compelling results and significant future promise. We first introduce ML, then highlight ML developments in four acoustics research areas: source localization in speech processing, source localization in ocean acoustics, bioacoustics, and environmental sounds in everyday scenes.
Reduction of late reverberation can be achieved using spatio-spectral filters such as the multichannel Wiener filter (MWF). To compute this filter, an estimate of the late reverberation power spectral density (PSD) is required. In recent years, a multitude of late reverberation PSD estimators have been proposed. In this contribution, these estimators are categorized into several classes, their relations and differences are discussed, and a comprehensive experimental comparison is provided. To compare their performance, simulations in controlled as well as practical scenarios are conducted. It is shown that a common weakness of spatial coherence-based estimators is their performance in high direct-to-diffuse ratio (DDR) conditions. To mitigate this problem, a correction method is proposed and evaluated. It is shown that the proposed correction method can decrease the speech distortion without significantly affecting the reverberation reduction.
Speech enhancement and separation are core problems in audio signal processing, with commercial applications
in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are cru-
cial pre-processing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight mi-
crophones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those
of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting
with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated
and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between
these approaches is lacking at present. In this article, we propose to fill this gap by analyzing a large number of established
and recent techniques according to four transverse axes:
a) the acoustic impulse response model
b) the spatial filter design criterion
c) the parameter estimation algorithm
d) optional postfiltering.
We conclude this overview paper by providing a list
of software and data resources and by discussing perspectives and
future trends in the field.
In recent years, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques and automatic speech recognition (ASR) techniques that are robust to reverberation. In this paper, we describe the REVERB challenge, which is an evaluation campaign that was designed to evaluate such speech enhancement (SE) and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions. Even though most existing benchmark tasks and challenges for distant speech processing focus on the noise robustness issue and sometimes only on a single-channel scenario, a particular novelty of the REVERB challenge is that it is carefully designed to test robustness against reverberation, based on both real, single-channel, and multichannel recordings. This challenge attracted 27 papers, which represent 25 systems specifically designed for SE purposes and 49 systems specifically designed for ASR purposes. This paper describes the problems dealt within the challenge, provides an overview of the submitted systems, and scrutinizes them to clarify what current processing strategies appear effective in reverberant speech processing.
Distortionless speech extraction in a reverberant environment can be achieved by applying a beamforming algorithm, provided that the relative transfer functions (RTFs) of the sources and the covariance matrix of the noise are known. In this paper, the challenge of RTF identification in a multi-speaker scenario is addressed. We propose a successive RTF identification (SRI) technique, based on the sole assumption that sources do not become simultaneously active. That is, we address the challenge of estimating the RTF of a specific speech source while assuming that the RTFs of all other active sources in the environment were previously estimated in an earlier stage. The RTF of interest is identified by applying the blind oblique projection (BOP)-SRI technique. When a new speech source is identified, the BOP algorithm is applied. BOP results in a null steering toward the RTF of interest, by means of applying an oblique projection to the microphone measurements. We prove that by artificially increasing the rank of the range of the projection matrix, the RTF of interest can be identified. An experimental study is carried out to evaluate the performance of the BOP-SRI algorithm in various signal to noise ratio (SNR) and signal to interference ratio (SIR) conditions and to demonstrate its effectiveness in speech extraction tasks.
The challenge of blindly resynchronizing the data acquisition processes in a wireless acoustic sensor network (WASN) is addressed in this paper. The sampling rate offset (SRO) is precisely modeled as a time scaling. The applicability of a wideband correlation processor for estimating the SRO, even in a reverberant and multiple source environment, is presented. An explicit expression for the ambiguity function, which in our case involves time scaling of the received signals, is derived by applying truncated band-limited interpolation. We then propose the recursive band-limited interpolation (RBI) algorithm for recursive SRO estimation. A complete resynchronization scheme utilizing the RBI algorithm, in parallel with the SRO compensation module, is presented. The resulting resynchronization method operates in the time domain in a sequential manner and is thus capable of tracking a potentially time-varying SRO. We compared the performance of the proposed RBI algorithm to other available methods in a simulation study. The importance of resynchronization in a beamforming application is demonstrated by both a simulation study and experiments with a real WASN. Finally, we present an experimental study evaluating the expected SRO level between typical data acquisition devices.
Distortionless speech extraction in a reverberant environment can be achieved by applying a beamforming algorithm, provided that the relative transfer functions (RTFs) of the sources and the covariance matrix of the noise are known. In this paper, the challenge of RTF identification in a multi-speaker scenario is addressed. We propose a successive RTF identification (SRI) technique, based on the sole assumption that sources do not become simultaneously active. That is, we address the challenge of estimating the RTF of a specific speech source while assuming that the RTFs of all other active sources in the environment were previously estimated in an earlier stage. The RTF of interest is identified by applying the blind oblique projection (BOP)-SRI technique. When a new speech source is identified, the BOP algorithm is applied. BOP results in a null steering toward the RTF of interest, by means of applying an oblique projection to the microphone measurements. We prove that by artificially increasing the rank of the range of the projection matrix, the RTF of interest can be identified. An experimental study is carried out to evaluate the performance of the BOP-SRI algorithm in various signal to noise ratio (SNR) and signal to interference ratio (SIR) conditions and to demonstrate its effectiveness in speech extraction tasks.
Distributed acoustic tracking estimates the trajectories of source positions using an acoustic sensor network. As it is often difficult to estimate the source-sensor range from individual nodes, the source positions have to be inferred from the direction-of-arrival (DoA) estimates. Due to reverberation and noise, the sound field becomes increasingly diffuse with increasing source-sensor distance, leading to a decreased Direction of Arrival (DoA)-estimation accuracy. To distinguish between accurate and uncertain DoA estimates, this letter proposes to incorporate the coherent-to-diffuse ratio as a measure of DoA reliability for single – source tracking. It is shown that the source positions, therefore, can be probabilistically triangulated by exploiting the spatial diversity of all nodes.
Localization of acoustic sources has attracted a considerable amount of research attention in recent years. A major obstacle to achieving high localization accuracy is the presence of reverberation, the influence of which obviously increases with the number of active speakers in the room. Human hearing is capable of localizing acoustic sources even in extreme conditions.
In this study, we propose to combine a method based on human hearing mechanisms and a modified incremental distributed expectation-maximization algorithm (IDEM).
Rather than using phase difference measurements that are modeled by a mixture of complex-valued Gaussians, as proposed in the original IDEM framework, we propose to use time difference of arrival (TDoA) measurements in multiple subbands and model them by a mixture of real-valued truncated Gaussians. Moreover, we propose to first filter the measurements in order to reduce the effect of the multi-path conditions. The proposed
method is evaluated using both simulated data and real-life recordings.
As we are surrounded by an increased number of mobile devices equipped with wireless links and multiple microphones, e.g., smartphones, tablets, laptops and hearing aids, using them collaboratively for acoustic processing is a promising platform for emerging applications. These devices make up an acoustic sensor network comprised of nodes, i.e. distributed devices equipped with microphone arrays, communication unit and processing unit. Algorithms for speaker separation and localization using such a network require a precise knowledge of the nodes’ locations and orientations. To acquire this knowledge, a recently introduced approach proposed a combined direction of arrival (DoA) and time difference of arrival (TDoA) target function for off-line calibration with dedicated recordings. This paper proposes an extension of this approach to a novel online method with two new features: First, by employing an evolutionary algorithm on incremental measurements, it is online and fast enough for real-time application. Second, by using the sparse spike representation computed in a cochlear model for TDoA estimation, the amount of information shared between the nodes by transmission is reduced while the accuracy is increased. The proposed approach is able to calibrate an acoustic senor network online during a meeting in a reverberant conference room.
The challenge of blindly resynchronizing the data acquisition processes in a wireless acoustic sensor network (WASN) is addressed in this paper. The sampling rate offset (SRO) is precisely modeled as a time scaling. The applicability of a wideband correlation processor for estimating the SRO, even in a reverberant and multiple source environment, is presented. An explicit expression for the ambiguity function, which in our case involves time scaling of the received signals, is derived by applying truncated band-limited interpolation. We then propose the recursive band-limited interpolation (RBI) algorithm for recursive SRO estimation. A complete resynchronization scheme utilizing the RBI algorithm, in parallel with the SRO compensation module, is presented. The resulting resynchronization method operates in the time domain in a sequential manner and is thus capable of tracking a potentially time-varying SRO. We compared the performance of the proposed RBI algorithm to other available methods in a simulation study. The importance of resynchronization in a beamforming application is demonstrated by both a simulation study and experiments with a real WASN. Finally, we present an experimental study evaluating the expected SRO level between typical data acquisition devices.
The problem of distributed localization for ad hoc wireless acoustic sensor networks (WASNs) is addressed in this paper. WASNs are characterized by low computational resources in each node and by limited connectivity between the nodes. Novel bi-directional tree-based distributed estimation-maximization (DEM) algorithms are proposed to circumvent these inherent limitations. We show that the proposed algorithms are capable of localizing static acoustic sources in reverberant enclosures without a priori information on the number of sources. Unlike serial estimation procedures (like ring-based algorithms), the new algorithms enable simultaneous computations in the nodes and exhibit greater robustness to communication failures. Specifically, the recursive distributed EM (RDEM) variant is better suited to online applications due to its recursive nature. Furthermore, the RDEM outperforms the other proposed variants in terms of convergence speed and simplicity. Performance is demonstrated by an extensive experimental study consisting of both simulated and actual environments.
In multiple speaker scenarios, the linearly constrained minimum variance (LCMV) beamformer is a popular microphone array-based speech enhancement technique, as it allows minimizing the noise power while maintaining a set of desired responses towards different speakers. Here, we address the algorithmic challenges arising when applying the LCMV beamformer in wireless acoustic sensor networks (WASNs), which are a next-generation technology for audio acquisition and processing. We review three optimal distributed LCMV-based algorithms, which compute a network-wide LCMV beamformer output at each node without centralizing the microphone signals. Optimality here refers to equivalence to a centralized realization where a single processor has access to all signals. We derive and motivate the algorithms in an accessible top-down framework that reveals their underlying relations. We explain how their differences result from their different design criterion (node-specific versus common constraints sets), and their different priorities for communication bandwidth, computational power, and adaptivity. Furthermore, although originally proposed for a fully connected WASN, we also explain how to extend the reviewed algorithms to the case of a partially connected WASN, which is assumed to be pruned to a tree topology. Finally, we discuss the advantages and disadvantages of the various algorithms
Besides reducing undesired sources, i.e., interfering sources and background noise, another important objective of a binaural beamforming algorithm is to preserve the spatial impression of the acoustic scene, which can be achieved by preserving the binaural cues of all sound sources. While the binaural minimum variance distortionless response (BMVDR) beamformer provides a good noise reduction performance and preserves the binaural cues of the desired source, it does not allow to control the reduction of the interfering sources and distorts the binaural cues of the interfering sources and the background noise. Hence, several extensions have been proposed. First, the binaural linearly constrained minimum variance (BLCMV) beamformer uses additional constraints, enabling to control the reduction of the interfering sources while preserving their binaural cues. Second, the BMVDR with partial noise estimation (BMVDR-N) mixes the output signals of the BMVDR with the noisy reference microphone signals, enabling to control the binaural cues of the background noise. Aiming at merging the advantages of both extensions, in this paper we propose the BLCMV with partial noise estimation (BLCMV-N). We show that the output signals of the BLCMV-N can be interpreted as a mixture between the noisy reference microphone signals and the output signals of a BLCMV using an adjusted interference scaling parameter. We provide a theoretical comparison between the BMVDR, the BLCMV, the BMVDR-N and the proposed BLCMV-N in terms of noise and interference reduction performance and binaural cue preservation. Experimental results using recorded signals as well as the results of a perceptual listening test show that the BLCMV-N is able to preserve the binaural cues of an interfering source (like the BLCMV), while enabling to trade off between noise reduction performance and binaural cue preservation of the background noise (like the BMVDR-N).
The recently proposed binaural linearly constrained minimum variance (BLCMV) beamformer is an extension of the well-known binaural minimum variance distortionless response (MVDR) beamformer, imposing constraints for both the desired and the interfering sources. Besides its capabilities to reduce interference and noise, it also enables to preserve the binaural cues of both the desired and interfering sources, hence making it particularly suitable for binaural hearing aid applications. In this paper, a theoretical analysis of the BLCMV beamformer is presented. In order to gain insights into the performance of the BLCMV beamformer, several decompositions are introduced that reveal its capabilities in terms of interference and noise reduction, while controlling the binaural cues of the desired and the interfering sources. When setting the parameters of the BLCMV beamformer, various considerations need to be taken into account, e.g. based on the amount of interference and noise reduction and the presence of estimation errors of the required relative transfer functions (RTFs). Analytical expressions for the performance of the BLCMV beamformer in terms of noise reduction, interference reduction, and cue preservation are derived. Comprehensive simulation experiments, using measured acoustic transfer functions as well as real recordings on binaural hearing aids, demonstrate the capabilities of the BLCMV beamformer in various noise environments.
The objective of binaural noise reduction algorithms is not only to selectively extract the desired speaker and to suppress interfering sources (e.g., competing speakers) and ambient background noise, but also to preserve the auditory impression of the complete acoustic scene. For directional sources this can be achieved by preserving the relative transfer function (RTF) which is defined as the ratio of the acoustical transfer functions relating the source and the two ears and corresponds to the binaural cues. In this paper, we theoretically analyze the performance of three algorithms that are based on the binaural minimum variance distortionless response (BMVDR) beamformer, and hence, process the desired source without distortion. The BMVDR beamformer preserves the binaural cues of the desired source but distorts the binaural cues of the interfering source. By adding an interference reduction (IR) constraint, the recently proposed BMVDR-IR beamformer is able to preserve the binaural cues of both the desired source and the interfering source. We further propose a novel algorithm for preserving the binaural cues of both the desired source and the interfering source by adding a constraint preserving the RTF of the interfering source, which will be referred to as the BMVDR-RTF beamformer. We analytically evaluate the performance in terms of binaural signal-to-interference-and-noise ratio (SINR), signal-to-interference ratio (SIR), and signal-to-noise ratio (SNR) of the three considered beamformers. It can be shown that the BMVDR-RTF beamformer outperforms the BMVDR-IR beamformer in terms of SINR and outperforms the BMVDR beamformer in terms of SIR. Among all beamformers which are distortionless with respect to the desired source and preserve the binaural cues of the interfering source, the newly proposed BMVDR-RTF beamformer is optimal in terms of SINR. Simulations using acoustic transfer functions measured on a binaural hearing aid validate our theoretical results.
Besides noise reduction, an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of all sound sources. For the desired speech source and the interfering sources, e.g., competing speakers, this can be achieved by preserving their relative transfer functions (RTFs). It has been shown that the binaural multi-channel Wiener filter (MWF) preserves the RTF of the desired speech source, but typically distorts the RTF of the interfering sources. To this end, in this paper we propose two extensions of the binaural MWF, i.e., the binaural MWF with RTF preservation (MWF-RTF) aiming to preserve the RTF of the interfering source and the binaural MWF with interference rejection (MWF-IR) aiming to completely suppress the interfering source. Analytical expressions for the performance of the binaural MWF, MWF-RTF and MWF-IR in terms of noise reduction, speech distortion and binaural cue preservation are derived, showing that the proposed extensions yield a better performance in terms of the signal-to-interference ratio and preservation of the binaural cues of the directional interference, while the overall noise reduction performance is degraded compared to the binaural MWF. Simulation results using binaural behind-the-ear impulse responses measured in a reverberant environment validate the derived analytical expressions for the theoretically achievable performance of the binaural MWF, MWF-RTF, and MWF-IR, showing that the performance highly depends on the position of the interfering source and the number of microphones. Furthermore, the simulation results show that the MWF-RTF yields a very similar overall noise reduction performance as the binaural MWF, while preserving the binaural cues of both the speech and the interfering source.
In this paper, we present an algorithm for direction of arrival (DOA) tracking and separation of multiple speakers with a microphone array using the factor graph statistical model. In our model, the speakers can be located in one of a predefined set of candidate DOAs, and each time-frequency (TF) bin can be associated with a single speaker. Accordingly, by attributing a statistical model to both the DOAs and the associations, as well as to the microphone array observations given these variables, we show that the conditional probability of these variables given the microphone array observations can be modeled as a factor graph. Using the loopy belief propagation (LBP) algorithm, we derive a novel inference scheme which simultaneously estimates both the DOAs and the associations. These estimates are used in turn for separating the sources, by directing a beamformer towards the estimated DOAs, and then applying a TF masking according to the estimated associations. A comprehensive experimental study demonstrates the benefits of the proposed algorithm in both simulated data and real-life measurements recorded in our
laboratory.
We present a fully Bayesian hierarchical approach for multichannel speech enhancement with time-varying audio channel. Our probabilistic approach relies on a Gaussian prior for the speech signal and a Gamma hyperprior for the speech precision, combined with a multichannel linear-Gaussian state space model for the acoustic channel. Furthermore, we assume a Wishart prior for the noise precision matrix. We derive a variational Expectation-Maximization (VEM) algorithm which uses a variant of multichannel Wiener filter (MCWF) to infer the sound source and a Kalman smoother to infer the acoustic channel. It is further shown that the VEM speech estimator can be recast as a multichannel minimum variance distortionless response (MVDR) beamformer followed by a single-channel variational postfilter. The proposed algorithm was evaluated using both simulated and real room environments with several noise types and reverberation levels. Both static and dynamic scenarios are considered. In terms of speech quality, it is shown that a significant improvement is obtained with respect to the noisy signal, and that the proposed method outperforms a baseline algorithm. In terms of channel alignment and tracking ability, a superior channel estimate is demonstrated.
Distributed acoustic tracking estimates the trajectories of source positions using an acoustic sensor network. As it is often difficult to estimate the source-sensor range from individual nodes, the source positions have to be inferred from the direction-of-arrival (DoA) estimates. Due to reverberation and noise, the sound field becomes increasingly diffuse with increasing source-sensor distance, leading to a decreased Direction of Arrival (DoA)-estimation accuracy. To distinguish between accurate and uncertain DoA estimates, this letter proposes to incorporate the coherent-to-diffuse ratio as a measure of DoA reliability for single – source tracking. It is shown that the source positions, therefore, can be probabilistically triangulated by exploiting the spatial diversity of all nodes.
The problem of speaker tracking in noisy and reverberant enclosures is addressed. We present a hybrid algorithm, combining traditional tracking schemes with a new learning-based approach. A state-space representation, consisting of a propagation and observation models, is learned from signals measured by several distributed microphone pairs. The proposed representation is based on two data modalities corresponding
to high-dimensional acoustic features representing the full reverberant acoustic channels as well as low-dimensional TDOA estimates. The state-space representation is accompanied by a statistical model based on a Gaussian process used to relate the variations of the acoustic channels to the physical variations of the associated source positions, thereby forming a data-driven propagation model for the source movement. In the
observation model, the source positions are nonlinearly mapped to the associated TDOA readings. The obtained propagation and observation models establish the basis for employing an extended Kalman filter (EKF). Simulation results demonstrate the robustness of the proposed method in noisy and reverberant conditions.
As we are surrounded by an increased number of mobile devices equipped with wireless links and multiple microphones, e.g., smartphones, tablets, laptops and hearing aids, using them collaboratively for acoustic processing is a promising platform for emerging applications. These devices make up an acoustic sensor network comprised of nodes, i.e. distributed devices equipped with microphone arrays, communication unit and processing unit. Algorithms for speaker separation and localization using such a network require a precise knowledge of the nodes’ locations and orientations. To acquire this knowledge, a recently introduced approach proposed a combined direction of arrival (DoA) and time difference of arrival (TDoA) target function for off-line calibration with dedicated recordings. This paper proposes an extension of this approach to a novel online method with two new features: First, by employing an evolutionary algorithm on incremental measurements, it is online and fast enough for real-time application. Second, by using the sparse spike representation computed in a cochlear model for TDoA estimation, the amount of information shared between the nodes by transmission is reduced while the accuracy is increased. The proposed approach is able to calibrate an acoustic senor network online during a meeting in a reverberant conference room.
This paper addresses the problem of separating audio sources from time-varying convolutive mixtures. We propose a probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization. The time-varying mixing filters are modeled by a continuous temporal stochastic process. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the time-varying mixing matrix, and that jointly estimate the source parameters. The sound sources are then separated by Wiener filters constructed with the estimators provided by the VEM algorithm. Extensive experiments on simulated data show that the proposed method outperforms a blockwise version of a state-of-the-art baseline method.
Statistically optimal spatial processors (also referred to as data-dependent beamformers) are widely-used spatial focusing techniques for desired source extraction. The Kalman filter-based beamformer (KFB) [1] is a recursive Bayesian method for implementing the beamformer. This letter provides new insights into the KFB. Specifically, we adopt the KFB framework to the task of speech extraction. We formalize the KFB with a set of linear constraints and present its equivalence to the linearly constrained minimum power (LCMP) beamformer. We further show that the optimal output power, required for implementing the KFB, is merely controlling the white noise gain (WNG) of the beamformer. We also show, that in static scenarios, the adaptation rule of the KFB reduces to the simpler affine projection algorithm (APA). The analytically derived results are verified and exemplified by a simulation study.
Speech signals recorded in a room are commonly degraded by reverberation. In most cases, both the speech signal and the acoustic system of the room are unknown and time-varying. In this paper, a scenario with a single desired sound source and slowly time-varying and spatially-white noise is considered, and a multi-microphone algorithm that simultaneously estimates the clean speech signal and the time-varying acoustic system is proposed. The recursive expectation-maximization scheme is employed to obtain both the clean speech signal and the acoustic system in an online manner. In the expectation step, the Kalman filter is applied to extract a new sample of the clean signal, and in the maximization step, the system estimate is updated according to the output of the Kalman filter. Experimental results show that the proposed method is able to significantly reduce reverberation and increase the speech quality. Moreover, the tracking ability of the algorithm was validated in practical scenarios using human speakers moving in a natural manner.
Two novel methods for speaker separation of multi-microphone recordings that can also detect speakers with infrequent activity are presented. The proposed methods are based on a statistical model of the probability of activity of the speakers across time. Each method takes a different approach for estimating the activity probabilities. The first method is derived using a linear programming (LP) problem for maximizing the correlation function between different time frames. It is shown that the obtained maxima correspond to frames which contain a single active speaker. Accordingly, we propose an algorithm for successive identification of frames dominated by each speaker. The second method aggregates the correlation values associated with each frame in a correlation vector. We show that these correlation vectors lie in a simplex with vertices that correspond to frames dominated by one of the speakers. In this method, we utilize convex geometry tools to sequentially detect the simplex vertices. The correlation functions associated with single-speaker frames, which are detected by either of the two proposed methods, are used for recovering the activity probabilities. A spatial mask is estimated based on the recovered probabilities and is utilized for separation and enhancement by means of both spatial and spectral processing. Experimental results demonstrate the performance of the proposed methods in various conditions on real-life recordings with different reverberation and noise levels, outperforming a state-of-the-art separation method.
The problem of blind audio source separation (BASS) in noisy and reverberant conditions is addressed by a novel approach, termed Global and LOcal Simplex Separation (GLOSS), which integrates full- and narrow-band simplex representations. We show that the eigenvectors of the correlation matrix between time frames in a certain frequency band form a simplex that organizes the frames according to the speaker activities in the corresponding band. We propose to build two simplex representations: one global based on a broad frequency band and one local based on a narrow band. In turn, the two representations are combined to determine the dominant speaker in each time-frequency (TF). Using the identified dominating speakers, a spectral mask is computed and is utilized for extracting each of the speakers using spatial beamforming followed by spectral postfiltering. The performance of the proposed algorithm is demonstrated using real-life recordings in various noisy and reverberant conditions.
Blind source separation (BSS) is addressed, using a novel data-driven approach, based on a well-established probabilistic model. The proposed method is specifically designed for separation of multichannel audio mixtures. The algorithm relies on spectral decomposition of the correlation matrix between different time frames. The probabilistic model implies that the column space of the correlation matrix is spanned by the probabilities of the various speakers across time. The number of speakers is recovered by the eigenvalue decay, and the eigenvectors form a simplex of the speakers’ probabilities. Time frames dominated by each of the speakers are identified exploiting convex geometry tools on the recovered simplex. The mixing acoustic channels are estimated utilizing the identified sets of frames, and a linear unmixing is performed to extract the individual speakers. The derived simplexes are visually demonstrated for mixtures of 2, 3 and 4 speakers. We also conduct a comprehensive experimental
study, showing high separation capabilities in various reverberation conditions.
Objective: The present study implements an automatic method of assessing arousal in vocal data as well as dynamic system models to explore intrapersonal and interpersonal affect dynamics within psychotherapy and to determine whether these dynamics are associated with treatment outcomes. Method: The data of 21,133 mean vocal arousal observations were extracted from 279 therapy sessions in a sample of 30 clients treated by 24 therapists. Before and after each session, clients self-reported their well-being level, using the Outcome Rating Scale. Results: Both clients’ and therapists’ vocal arousal showed intrapersonal dampening. Specifically, although both therapists and clients departed from their baseline, their vocal arousal levels were “pulled” back to these baselines. In addition, both clients and therapists exhibited interpersonal dampening. Specifically, both the clients’ and the therapists’ levels of arousal were “pulled” toward the other party’s arousal level, and clients were “pulled” by their therapists’ vocal arousal toward their own baseline. These dynamics exhibited a linear change over the course of treatment: whereas interpersonal dampening decreased over time, there was an increase in intrapersonal dampening over time. In addition, higher levels of interpersonal dampening were associated with better session outcomes. Conclusions: These findings demonstrate the advantages of using automatic vocal measures to capture nuanced intrapersonal and interpersonal affect dynamics in psychotherapy and demonstrate how these dynamics are associated with treatment gains. (PsycInfo Database Record (c) 2021 APA, all rights reserved)
The IEEE Audio and Acoustic Signal Processing Technical Committee (AASP TC) is one of 13 TCs in the IEEE Signal Processing Society. Its mission is to support, nourish, and lead scientific and technological development in all areas of AASP. These areas are currently seeing increased levels of interest and significant growth, providing a fertile ground for a broad range of specific and interdisciplinary research and development. Ranging from array processing for microphones and loudspeakers to music genre classification, from psychoacoustics to machine learning (ML), from consumer electronics devices to blue-sky research, this scope encompasses countless technical challenges and many hot topics. The TC has roughly 30 elected volunteer members drawn equally from leading academic and industrial organizations around the world, unified by the common aim of offering their expertise in the service of the scientific community.
The gain achieved by a superdirective beamformer operating in a diffuse noise-field is significantly higher than the gain attainable with conventional delay-and-sum weights. A classical result states that for a compact linear array consisting of N sensors which receives a plane-wave signal from the endfire direction, the optimal superdirective gain approaches N2. It has been noted that in the near-field regime higher gains can be attained. The gain can increase, in theory, without bound for increasing wavelength or decreasing source receiver distance. We aim to address the phenomenon of near-field superdirectivity in a comprehensive manner. We derive the optimal performance for the limiting case of an infinitesimal-aperture array receiving a spherical-wave signal. This is done with the aid of a sequence of linear transformations. The resulting gain expression is a polynomial, which depends on the number of sensors employed, the wavelength, and the source-receiver distance. The resulting gain curves are optimal and outperform weights corresponding to other superdirectivity methods. The practical case of a finite aperture array is discussed. We present conditions for which the gain of such an array would approach that predicted by the theory of the infinitesimal case. The white noise gain (WNG) metric of robustness is shown to increase in the near-field regime.
This paper addresses the problem of tracking a moving source, e.g., a robot, equipped with both receivers and a source, that is tracking its own location and simultaneously estimating the locations of multiple plane reflectors. We assume a noisy knowledge of the robot’s movement. We formulate this problem, which is also known as simultaneous localization and mapping (SLAM), as a hybrid estimation problem. We derive the extended Kalman filter (EKF) for both tracking the robot’s own location and estimating the room geometry. Since the EKF employs linearization at every step, we incorporate a regulated kinematic model, which facilitates a successful tracking. In addition, we consider the echo-labeling problem as solved and beyond the scope of this paper. We then develop the hybrid Cramér-Rao lower bound on the estimation accuracy of both the localization and mapping parameters. The algorithm is evaluated with respect to the bound via simulations, which shows that the EKF approaches the hybrid Cramér-Rao bound (CRB) (HCRB) as the number of observation increases. This result implies that for the examples tested in simulation, the HCRB is an asymptotically tight bound and that the EKF is an optimal estimator. Whether this property is true in general remains an open question.
Speech communication systems are prone to performance degradation in reverberant and noisy acoustic environments. Dereverberation and noise reduction algorithms typically require several model parameters, e.g. the speech, reverberation and noise power spectral densities (PSDs). A commonly used assumption is that the noise PSD matrix is known. However, in practical acoustic scenarios, the noise PSD matrix is unknown
and should be estimated along with the speech and reverberation PSDs. In this paper, we consider the case of rank-deficient noise PSD matrix, which arises when the noise signal consists of multiple directional noise sources, whose number is less than the number of microphones. We derive two closed-form maximum likelihood estimators (MLEs). The first is a non-blocking-based estimator which jointly estimates the speech, reverberation and noise PSDs, and the second is a blocking-based estimator, which first blocks the speech signal and then jointly estimates the
reverberation and noise PSDs. Both estimators are analytically compared and analyzed, and mean square errors (MSEs) expressions are derived. Furthermore, Cramér-Rao Bounds (CRBs) on the estimated PSDs are derived. The proposed estimators are examined using both simulation and real reverberant and noisy signals, demonstrating the advantage of the proposed method compared to competing estimators.
Hands-free speech systems are subject to performance degradation due to reverberation and noise. Common methods for enhancing reverberant and noisy speech require the knowledge of the speech, reverberation and noise power spectral densities (PSDs). Most literature on this topic assumes that the noise PSD matrix is known. However, in many practical acoustic scenarios, the noise PSD is unknown and should be estimated along with the speech and the reverberation PSDs. In this paper, the noise is modelled as a spatially homogeneous sound field, with an unknown time-varying PSD multiplied by a known time-invariant spatial coherence matrix. We derive two maximum likelihood estimators (MLEs) for the various PSDs, including the noise: The first is a non-blocking-based estimator, that jointly estimates the PSDs of the speech, reverberation and noise components. The second MLE is a blocking-based estimator, that blocks the speech signal and estimates the reverberation and noise PSDs. Since a closed-form solution does not exist, both estimators iteratively maximize the likelihood using the Fisher scoring method. In order to compare both methods, the corresponding Cramér-Rao Bounds (CRBs) are derived. For both the reverberation and the noise PSDs, it is shown that the non-blocking-based CRB is lower than the blocking-based CRB. Performance evaluation using both simulated and real reverberant and noisy signals, shows that the proposed estimators outperform competing estimators, and greatly reduce the effect of reverberation and noise.
A recursive maximum-likelihood algorithm (RML) is proposed that can be used when both the observations and the hidden data have continuous values and are statistically dependent between different time samples. The algorithm recursively approximates the probability density functions of the observed and hidden data by analytically computing the integrals with respect to the state variables, where the parameters are updated using gradient steps. A full convergence proof is given, based on the ordinary differential equation approach, which shows that the algorithm converges to a local minimum of the Kullback-Leibler divergence between the true and the estimated parametric probability density functions; a result which is useful even for a miss-specified parametric model. Compared to other RML algorithms proposed in the literature, this contribution extends the state-space model and provides a theoretical analysis in a non-trivial statistical model that was not analyzed so far. We further extend the RML analysis to constrained parameter estimation problems. Two examples, including nonlinear state-space models, are given to highlight this contribution.
Reduction of late reverberation can be achieved using spatio-spectral filters such as the multichannel Wiener filter (MWF). To compute this filter, an estimate of the late reverberation power spectral density (PSD) is required. In recent years, a multitude of late reverberation PSD estimators have been proposed. In this contribution, these estimators are categorized into several classes, their relations and differences are discussed, and a comprehensive experimental comparison is provided. To compare their performance, simulations in controlled as well as practical scenarios are conducted. It is shown that a common weakness of spatial coherence-based estimators is their performance in high direct-to-diffuse ratio (DDR) conditions. To mitigate this problem, a correction method is proposed and evaluated. It is shown that the proposed correction method can decrease the speech distortion without significantly affecting the reverberation reduction.
The reverberation power spectral density (PSD) is often required for dereverberation and noise reduction algorithms. In this work, we compare two maximum likelihood (ML) estimators of the reverberation PSD in a noisy environment. In the first estimator, the direct path is first blocked. Then, the ML criterion for estimating the reverberation PSD is stated according to the probability density function (p.d.f.) of the blocking matrix (BM) outputs. In the second estimator, the speech component is not blocked. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. To compare the expected mean square error (MSE) between the two ML estimators of the reverberation PSD, the Cramér-Rao Bounds (CRBs) for the two ML estimators are derived. We show that the CRB for the joint reverberation and speech PSD estimator is lower than the CRB for estimating the reverberation PSD from the BM outputs. Experimental results show that the MSE of the two estimators indeed obeys the CRB curves. Experimental results of multi-microphone dereverberation and noise reduction algorithm show the benefits of using the ML estimators in comparison with another baseline estimators.
The recently proposed binaural linearly constrained minimum variance (BLCMV) beamformer is an extension of the well-known binaural minimum variance distortionless response (MVDR) beamformer, imposing constraints for both the desired and the interfering sources. Besides its capabilities to reduce interference and noise, it also enables to preserve the binaural cues of both the desired and interfering sources, hence making it particularly suitable for binaural hearing aid applications. In this paper, a theoretical analysis of the BLCMV beamformer is presented. In order to gain insights into the performance of the BLCMV beamformer, several decompositions are introduced that reveal its capabilities in terms of interference and noise reduction, while controlling the binaural cues of the desired and the interfering sources. When setting the parameters of the BLCMV beamformer, various considerations need to be taken into account, e.g. based on the amount of interference and noise reduction and the presence of estimation errors of the required relative transfer functions (RTFs). Analytical expressions for the performance of the BLCMV beamformer in terms of noise reduction, interference reduction, and cue preservation are derived. Comprehensive simulation experiments, using measured acoustic transfer functions as well as real recordings on binaural hearing aids, demonstrate the capabilities of the BLCMV beamformer in various noise environments.
Statistically optimal spatial processors (also referred to as data-dependent beamformers) are widely-used spatial focusing techniques for desired source extraction. The Kalman filter-based beamformer (KFB) [1] is a recursive Bayesian method for implementing the beamformer. This letter provides new insights into the KFB. Specifically, we adopt the KFB framework to the task of speech extraction. We formalize the KFB with a set of linear constraints and present its equivalence to the linearly constrained minimum power (LCMP) beamformer. We further show that the optimal output power, required for implementing the KFB, is merely controlling the white noise gain (WNG) of the beamformer. We also show, that in static scenarios, the adaptation rule of the KFB reduces to the simpler affine projection algorithm (APA). The analytically derived results are verified and exemplified by a simulation study.
The objective of binaural noise reduction algorithms is not only to selectively extract the desired speaker and to suppress interfering sources (e.g., competing speakers) and ambient background noise, but also to preserve the auditory impression of the complete acoustic scene. For directional sources this can be achieved by preserving the relative transfer function (RTF) which is defined as the ratio of the acoustical transfer functions relating the source and the two ears and corresponds to the binaural cues. In this paper, we theoretically analyze the performance of three algorithms that are based on the binaural minimum variance distortionless response (BMVDR) beamformer, and hence, process the desired source without distortion. The BMVDR beamformer preserves the binaural cues of the desired source but distorts the binaural cues of the interfering source. By adding an interference reduction (IR) constraint, the recently proposed BMVDR-IR beamformer is able to preserve the binaural cues of both the desired source and the interfering source. We further propose a novel algorithm for preserving the binaural cues of both the desired source and the interfering source by adding a constraint preserving the RTF of the interfering source, which will be referred to as the BMVDR-RTF beamformer. We analytically evaluate the performance in terms of binaural signal-to-interference-and-noise ratio (SINR), signal-to-interference ratio (SIR), and signal-to-noise ratio (SNR) of the three considered beamformers. It can be shown that the BMVDR-RTF beamformer outperforms the BMVDR-IR beamformer in terms of SINR and outperforms the BMVDR beamformer in terms of SIR. Among all beamformers which are distortionless with respect to the desired source and preserve the binaural cues of the interfering source, the newly proposed BMVDR-RTF beamformer is optimal in terms of SINR. Simulations using acoustic transfer functions measured on a binaural hearing aid validate our theoretical results.
Besides noise reduction, an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of all sound sources. For the desired speech source and the interfering sources, e.g., competing speakers, this can be achieved by preserving their relative transfer functions (RTFs). It has been shown that the binaural multi-channel Wiener filter (MWF) preserves the RTF of the desired speech source, but typically distorts the RTF of the interfering sources. To this end, in this paper we propose two extensions of the binaural MWF, i.e., the binaural MWF with RTF preservation (MWF-RTF) aiming to preserve the RTF of the interfering source and the binaural MWF with interference rejection (MWF-IR) aiming to completely suppress the interfering source. Analytical expressions for the performance of the binaural MWF, MWF-RTF and MWF-IR in terms of noise reduction, speech distortion and binaural cue preservation are derived, showing that the proposed extensions yield a better performance in terms of the signal-to-interference ratio and preservation of the binaural cues of the directional interference, while the overall noise reduction performance is degraded compared to the binaural MWF. Simulation results using binaural behind-the-ear impulse responses measured in a reverberant environment validate the derived analytical expressions for the theoretically achievable performance of the binaural MWF, MWF-RTF, and MWF-IR, showing that the performance highly depends on the position of the interfering source and the number of microphones. Furthermore, the simulation results show that the MWF-RTF yields a very similar overall noise reduction performance as the binaural MWF, while preserving the binaural cues of both the speech and the interfering source.
The directivity factor (DF) of a beamformer describes its spatial selectivity and ability to suppress diffuse noise which arrives from all directions. For a given array configuration, it is possible to design beamforming weights which maximize the DF for a particular look-direction, while enforcing nulls for a set of undesired directions. In general, the resulting DF is dependent upon the specific look- and null directions. Using the same array, one may apply a different set of weights designed for any other feasible set of look- and null directions. In this contribution we show that when the optimal DF is averaged over all look directions the result equals the number of sensors minus the number of null constraints. This result holds, regardless of the positions and spatial responses of the individual sensors, and of the null directions. The result generalizes to more complex wave-propagation domains (e.g., reverberation).
This paper presents a new dataset of measured multichannel Room Impulse Responses (RIRs) named dEchorate. It includes annotations of early echo timings and 3D positions of microphones, real sources and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, acoustic echo retrieval, microphone calibration, echo labeling and reflectors position estimation. The dataset is provided with software utilities to easily access, manipulate and visualize the data as well as baseline methods for echo-related tasks.
Estimation problems like room geometry estimation and localization of acoustic reflectors are of great interest and importance in robot and drone audition. Several methods for tackling these problems exist, but most of them rely on information about times-of-arrival (TOAs) of the acoustic echoes. These need to be estimated in practice, which is a difficult problem in itself, especially in robot applications which are characterized by high ego-noise. Moreover, even if TOAs are successfully extracted, the difficult problem of echolabeling needs to be solved. In this paper, we propose multiple expectation-maximization (EM) methods, for jointly estimating the TOAs and directions-of-arrival (DOA) of the echoes, with a uniform circular array (UCA) and a loudspeaker in its center for probing the environment. The different methods are derived to be optimal under different noise conditions. The experimental results show that the proposed methods outperform existing methods in terms of estimation accuracy in noisy conditions. For example, it can provide accurate estimates at SNR of 10 dB lower compared to TOA extraction from room impulse responses, which is often used. Furthermore, the results confirm that the proposed methods can account for scenarios with colored noise or faulty microphones. Finally, we show the applicability of the proposed methods in mapping of an indoor environment.
The problem of blind and online speaker localization and separation using multiple microphones is addressed
based on the recursive expectation-maximization (REM) procedure. A two-stage REM-based algorithm is
proposed: 1) multi-speaker direction of arrival (DOA) estimation and 2) multi-speaker relative transfer
function (RTF) estimation. The DOA estimation task uses only the time frequency (TF) bins dominated by a
single speaker while the entire frequency range is not required to accomplish this task. In contrast, the RTF
estimation task requires the entire frequency range in order to estimate the RTF for each frequency bin.
Accordingly, a different statistical model is used for the two tasks. The first REM model is applied under the
assumption that the speech signal is sparse in the TF domain, and utilizes a mixture of Gaussians (MoG)
model to identify the TF bins associated with a single dominant speaker. The corresponding DOAs are
estimated using these bins. The second REM model is applied under the assumption that the speakers are
concurrently active in all TF bins and consequently applies a multichannel Wiener filter (MCWF) to separate
the speakers. As a result of the assumption of the concurrent speakers, a more precise TF map of the speakers’
activity is obtained. The RTFs are estimated using the outputs of the MCWF-beamformer (BF), which are
constructed using the DOAs obtained in the previous stage. Next, using the linearly constrained minimum
variance (LCMV)-BF that utilizes the estimated RTFs, the speech signals are separated. The algorithm is
evaluated using real-life scenarios of two speakers. Evaluation of the mean absolute error (MAE) of the
estimated DOAs and the separation capabilities, demonstrates significant improvement w.r.t. a baseline DOA
estimation and speaker separation algorithm.
This paper addresses the problem of tracking a moving source, e.g., a robot, equipped with both receivers and a source, that is tracking its own location and simultaneously estimating the locations of multiple plane reflectors. We assume a noisy knowledge of the robot’s movement. We formulate this problem, which is also known as simultaneous localization and mapping (SLAM), as a hybrid estimation problem. We derive the extended Kalman filter (EKF) for both tracking the robot’s own location and estimating the room geometry. Since the EKF employs linearization at every step, we incorporate a regulated kinematic model, which facilitates a successful tracking. In addition, we consider the echo-labeling problem as solved and beyond the scope of this paper. We then develop the hybrid Cramér-Rao lower bound on the estimation accuracy of both the localization and mapping parameters. The algorithm is evaluated with respect to the bound via simulations, which shows that the EKF approaches the hybrid Cramér-Rao bound (CRB) (HCRB) as the number of observation increases. This result implies that for the examples tested in simulation, the HCRB is an asymptotically tight bound and that the EKF is an optimal estimator. Whether this property is true in general remains an open question.
In this paper, a study addressing the task of tracking multiple concurrent speakers in reverberant conditions is presented. Since both past and future observations can contribute to the current location estimate, we propose a forward-backward approach, which improves tracking accuracy by introducing near-future data to the estimator, in the cost of an additional short latency. Unlike classical target tracking, we apply a non-Bayesian approach, which does not make assumptions with respect to the target trajectories, except for assuming a realistic change in the parameters due to natural behaviour. The proposed method is based on the recursive expectation-maximization (REM) approach. The new method is dubbed forward-backward recursive expectation-maximization (FB-REM). The performance is demonstrated using an experimental study, where the tested scenarios involve both simulated and recorded signals, with typical reverberation levels and multiple moving sources. It is shown that the proposed algorithm outperforms the regular common causal (REM).
Estimation problems like room geometry estimation and localization of acoustic reflectors are of great interest and importance in robot and drone audition. Several methods for tackling these problems exist, but most of them rely on information about times-of-arrival (TOAs) of the acoustic echoes. These need to be estimated in practice, which is a difficult problem in itself, especially in robot applications which are characterized by high ego-noise. Moreover, even if TOAs are successfully extracted, the difficult problem of echolabeling needs to be solved. In this paper, we propose multiple expectation-maximization (EM) methods, for jointly estimating the TOAs and directions-of-arrival (DOA) of the echoes, with a uniform circular array (UCA) and a loudspeaker in its center for probing the environment. The different methods are derived to be optimal under different noise conditions. The experimental results show that the proposed methods outperform existing methods in terms of estimation accuracy in noisy conditions. For example, it can provide accurate estimates at SNR of 10 dB lower compared to TOA extraction from room impulse responses, which is often used. Furthermore, the results confirm that the proposed methods can account for scenarios with colored noise or faulty microphones. Finally, we show the applicability of the proposed methods in mapping of an indoor environment.
Ad hoc acoustic networks comprising multiple nodes, each of which consists of several microphones, are addressed. From the ad hoc nature of the node constellation, microphone positions are unknown. Hence, typical tasks, such as localization, tracking, and beamforming, cannot be directly applied. To tackle this challenging joint multiple speaker localization and array calibration task, we propose a novel variant of the expectation-maximization (EM) algorithm. The coordinates of multiple arrays relative to an anchor array are blindly estimated using naturally uttered speech signals of multiple concurrent speakers. The speakers’ locations, relative to the anchor array, are also estimated. The inter-distances of the microphones in each array, as well their orientations, are assumed known, which is a reasonable assumption for many modern mobile devices (in outdoor and in a several indoor scenarios). The well-known initialization problem of the batch EM algorithm is circumvented by an incremental procedure, also derived here. The proposed algorithm is tested by an extensive simulation study.
Acoustic data provide scientific and engineering insights in fields ranging from biology and communications to ocean and Earth science. We survey the recent advances and transformative potential of machine learning (ML), including deep learning, in the field of acoustics. ML is a broad family of techniques, which are often based in statistics, for automatically detecting and utilizing patterns in data. Relative to conventional acoustics and signal processing, ML is data-driven. Given sufficient training data, ML can discover complex relationships between features and desired labels or actions, or between features themselves. With large volumes of training data, ML can discover models describing complex acoustic phenomena such as human speech and reverberation. ML in acoustics is rapidly developing with compelling results and significant future promise. We first introduce ML, then highlight ML developments in four acoustics research areas: source localization in speech processing, source localization in ocean acoustics, bioacoustics, and environmental sounds in everyday scenes.
A recursive maximum-likelihood algorithm (RML) is proposed that can be used when both the observations and the hidden data have continuous values and are statistically dependent between different time samples. The algorithm recursively approximates the probability density functions of the observed and hidden data by analytically computing the integrals with respect to the state variables, where the parameters are updated using gradient steps. A full convergence proof is given, based on the ordinary differential equation approach, which shows that the algorithm converges to a local minimum of the Kullback-Leibler divergence between the true and the estimated parametric probability density functions; a result which is useful even for a miss-specified parametric model. Compared to other RML algorithms proposed in the literature, this contribution extends the state-space model and provides a theoretical analysis in a non-trivial statistical model that was not analyzed so far. We further extend the RML analysis to constrained parameter estimation problems. Two examples, including nonlinear state-space models, are given to highlight this contribution.
Reduction of late reverberation can be achieved using spatio-spectral filters such as the multichannel Wiener filter (MWF). To compute this filter, an estimate of the late reverberation power spectral density (PSD) is required. In recent years, a multitude of late reverberation PSD estimators have been proposed. In this contribution, these estimators are categorized into several classes, their relations and differences are discussed, and a comprehensive experimental comparison is provided. To compare their performance, simulations in controlled as well as practical scenarios are conducted. It is shown that a common weakness of spatial coherence-based estimators is their performance in high direct-to-diffuse ratio (DDR) conditions. To mitigate this problem, a correction method is proposed and evaluated. It is shown that the proposed correction method can decrease the speech distortion without significantly affecting the reverberation reduction.
Localization of acoustic sources has attracted a considerable amount of research attention in recent years. A major obstacle to achieving high localization accuracy is the presence of reverberation, the influence of which obviously increases with the number of active speakers in the room. Human hearing is capable of localizing acoustic sources even in extreme conditions.
In this study, we propose to combine a method based on human hearing mechanisms and a modified incremental distributed expectation-maximization algorithm (IDEM).
Rather than using phase difference measurements that are modeled by a mixture of complex-valued Gaussians, as proposed in the original IDEM framework, we propose to use time difference of arrival (TDoA) measurements in multiple subbands and model them by a mixture of real-valued truncated Gaussians. Moreover, we propose to first filter the measurements in order to reduce the effect of the multi-path conditions. The proposed
method is evaluated using both simulated data and real-life recordings.
The problem of blind separation of speech signals in the presence of noise using multiple microphones is addressed. Blind estimation of the acoustic parameters and the individual source signals are carried out by applying the expectation – maximization (EM) algorithm. Two models for the speech signals are used, namely an unknown deterministic signal model and a complex-Gaussian signal model. For the two alternatives, we
define a statistical model and develop EM-based algorithms to jointly estimate the acoustic parameters and the speech signals. The resulting algorithms are then compared from both theoretical and performance perspectives. In both cases, the latent data (differently defined for each alternative) is estimated in the E-step, where in the M-step, the two algorithms estimate the acoustic transfer functions of each source and the noise covariance matrix. The algorithms differ in the way the clean speech signals are used in the EM scheme. When the clean signal is assumed deterministic unknown, only the a posterior probabilities of the presence of each source are estimated in the E-step, while their time-frequency coefficients are designated as parameters, and are estimated in the M-step using the minimum variance distortionless response beamformer. If the clean speech signals are modelled as complex Gaussian signals, their power spectral densities (PSDs) are estimated in the E-step using the
multichannel Wiener filter output. The proposed algorithms were tested using reverberant noisy mixtures of two speech sources in different reverberation and noise conditions.
This paper addresses the problem of multiple- speaker localization in noisy and reverberant environments, using binaural recordings of an acoustic scene. A complex-valued Gaussian mixture model (CGMM) is adopted, whose components correspond to all the possible candidate source locations defined on a grid. After optimizing the CGMM-based objective function, given an observed set of complex-valued binaural features, both the number of sources and their locations are estimated by selecting the CGMM components with the largest weights. An entropy-based penalty term is added to the likelihood to impose sparsity over the set of CGMM component weights. This favors a small number of detected speakers with respect to the large number of initial candidate source locations. In addition, the direct-path relative transfer function (DP-RTF) is used to build robust binaural features. The DP-RTF, recently proposed for single-source localization, encodes inter-channel information
corresponding to the direct-path of sound propagation and is thus robust to reverberations. In this paper, we extend the DP-RTF estimation to the case of multiple sources. In the short-time Fourier transform domain, a consistency test is proposed to check whether a set of consecutive frames is associated to the same source or not. Reliable DP-RTF features are selected from the frames that pass the consistency test to be used for source localization. Experiments carried out using both simulation data and real data recorded with a robotic head confirm the efficiency of the proposed multi-source localization method.
The reverberation power spectral density (PSD) is often required for dereverberation and noise reduction algorithms. In this work, we compare two maximum likelihood (ML) estimators of the reverberation PSD in a noisy environment. In the first estimator, the direct path is first blocked. Then, the ML criterion for estimating the reverberation PSD is stated according to the probability density function (p.d.f.) of the blocking matrix (BM) outputs. In the second estimator, the speech component is not blocked. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. To compare the expected mean square error (MSE) between the two ML estimators of the reverberation PSD, the Cramér-Rao Bounds (CRBs) for the two ML estimators are derived. We show that the CRB for the joint reverberation and speech PSD estimator is lower than the CRB for estimating the reverberation PSD from the BM outputs. Experimental results show that the MSE of the two estimators indeed obeys the CRB curves. Experimental results of multi-microphone dereverberation and noise reduction algorithm show the benefits of using the ML estimators in comparison with another baseline estimators.
In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that consists of the direct path and some early reflections and a late reverberant component that consists of all late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation, and ambient noise is presented. The expectation-maximization (EM) algorithm is used to estimate the signals and spatial parameters of the early speech component and the late reverberation components. As a result, a spatially filtered version of the early speech component is estimated in the E-step. The power spectral density (PSD) of the anechoic speech, the relative early transfer functions, and the PSD matrix of the late reverberation are estimated in the M-step of the EM algorithm. The algorithm is evaluated using real room impulse response recorded in our acoustic lab with a reverberation time set to 0.36 s and 0.61 s and several signal-to-noise ratio levels. It is shown that significant improvement is obtained and that the proposed algorithm outperforms baseline single-channel and multichannel dereverberation algorithms, as well as a state-of-the-art multichannel dereverberation algorithm.
The problem of distributed localization for ad hoc wireless acoustic sensor networks (WASNs) is addressed in this paper. WASNs are characterized by low computational resources in each node and by limited connectivity between the nodes. Novel bi-directional tree-based distributed estimation-maximization (DEM) algorithms are proposed to circumvent these inherent limitations. We show that the proposed algorithms are capable of localizing static acoustic sources in reverberant enclosures without a priori information on the number of sources. Unlike serial estimation procedures (like ring-based algorithms), the new algorithms enable simultaneous computations in the nodes and exhibit greater robustness to communication failures. Specifically, the recursive distributed EM (RDEM) variant is better suited to online applications due to its recursive nature. Furthermore, the RDEM outperforms the other proposed variants in terms of convergence speed and simplicity. Performance is demonstrated by an extensive experimental study consisting of both simulated and actual environments.
Speech signals recorded in a room are commonly degraded by reverberation. In most cases, both the speech signal and the acoustic system of the room are unknown and time-varying. In this paper, a scenario with a single desired sound source and slowly time-varying and spatially-white noise is considered, and a multi-microphone algorithm that simultaneously estimates the clean speech signal and the time-varying acoustic system is proposed. The recursive expectation-maximization scheme is employed to obtain both the clean speech signal and the acoustic system in an online manner. In the expectation step, the Kalman filter is applied to extract a new sample of the clean signal, and in the maximization step, the system estimate is updated according to the output of the Kalman filter. Experimental results show that the proposed method is able to significantly reduce reverberation and increase the speech quality. Moreover, the tracking ability of the algorithm was validated in practical scenarios using human speakers moving in a natural manner.
The problem of localizing and tracking a known number of concurrent speakers in noisy and reverberant enclosures is addressed in this paper. We formulate the localization task as a maximum likelihood (ML) parameter estimation problem, and solve it by utilizing the expectation-maximization (EM) procedure. For the tracking scenario, we propose to adapt two recursive EM (REM) variants. The first, based on Titterington’s scheme, is a Newton-based recursion. In this work we also extend Titterington’s method to deal with constrained maximization, encountered in the problem at hand. The second is based on Cappé and Moulines’ scheme. We discuss the similarities and dissimilarities of these two variants and show their applicability to the tracking problem by a simulated experimental study.
Acoustic data provide scientific and engineering insights in fields ranging from biology and communications to ocean and Earth science. We survey the recent advances and transformative potential of machine learning (ML), including deep learning, in the field of acoustics. ML is a broad family of techniques, which are often based in statistics, for automatically detecting and utilizing patterns in data. Relative to conventional acoustics and signal processing, ML is data-driven. Given sufficient training data, ML can discover complex relationships between features and desired labels or actions, or between features themselves. With large volumes of training data, ML can discover models describing complex acoustic phenomena such as human speech and reverberation. ML in acoustics is rapidly developing with compelling results and significant future promise. We first introduce ML, then highlight ML developments in four acoustics research areas: source localization in speech processing, source localization in ocean acoustics, bioacoustics, and environmental sounds in everyday scenes.
The problem of speaker tracking in noisy and reverberant enclosures is addressed. We present a hybrid algorithm, combining traditional tracking schemes with a new learning-based approach. A state-space representation, consisting of a propagation and observation models, is learned from signals measured by several distributed microphone pairs. The proposed representation is based on two data modalities corresponding
to high-dimensional acoustic features representing the full reverberant acoustic channels as well as low-dimensional TDOA estimates. The state-space representation is accompanied by a statistical model based on a Gaussian process used to relate the variations of the acoustic channels to the physical variations of the associated source positions, thereby forming a data-driven propagation model for the source movement. In the
observation model, the source positions are nonlinearly mapped to the associated TDOA readings. The obtained propagation and observation models establish the basis for employing an extended Kalman filter (EKF). Simulation results demonstrate the robustness of the proposed method in noisy and reverberant conditions.
The problem of single source localization with ad hoc microphone networks in noisy and reverberant enclosures is addressed in this paper. A training set is formed by prerecorded measurements collected in advance, and consists of a limited number of labelled measurements, attached with corresponding positions, and a larger number of unlabelled measurements from unknown locations. Further information about the enclosure
characteristics or the microphone positions is not required. We propose a Bayesian inference approach for estimating a function that maps measurement-based features to the corresponding positions. The signals measured by the microphones represent different viewpoints, which are combined in a unified statistical framework. For this purpose, the mapping function is modelled by a Gaussian process with a covariance function that encapsulates both the connections between pairs of microphones and the relations among the samples in the training set. The parameters
of the process are estimated by optimizing a maximum likelihood (ML) criterion. In addition, a recursive adaptation mechanism is derived, where the new streaming measurements are used to update the model. Performance is demonstrated for both simulated data and real-life recordings in a variety of reverberation
and noise levels.
Conventional speaker localization algorithms, based merely on the received microphone signals, are often sensitive to adverse conditions, such as: high reverberation or low signal-to-noise ratio (SNR). In some scenarios, e.g., in meeting rooms or cars, it can be assumed that the source position is confined to a predefined area, and the acoustic parameters of the environment are approximately fixed. Such scenarios give rise to the assumption that the acoustic samples from the region of interest have a distinct geometrical structure. In this paper, we show that the high-dimensional acoustic samples indeed lie on a low-dimensional manifold and can be embedded into a low-dimensional space. Motivated by this result, we propose a semi-supervised source localization algorithm based on two-microphone measurements, which recovers the inverse mapping between the acoustic samples and their corresponding locations. The idea is to use an optimization framework based on manifold regularization, that involves smoothness constraints of possible solutions with respect to the manifold. The proposed algorithm, termed manifold regularization for localization, is adapted while new unlabelled measurements (from unknown source locations) are accumulated during runtime. Experimental results show superior localization performance when compared with a recently presented algorithm based on a manifold learning approach and with the generalized cross-correlation algorithm as a baseline. The algorithm achieves 2° accuracy in typical noisy and reverberant environments (reverberation time between 200 and 800 ms and SNR between 5 and 20 dB).
In this study, we present a deep neural network-based online multi-speaker localization algorithm based on a multi-microphone array. Following the W-disjoint orthogonality principle in the spectral domain, time-frequency (TF) bin is dominated by a single speaker and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high-resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using simulated and real-life recordings in static and dynamic scenarios demonstrates that the proposed algorithm significantly outperforms both classic and recent deep-learning-based algorithms. Finally, as a byproduct, we further show that the proposed method is also capable of separating moving speakers by the application of the obtained TF masks.
Localization in reverberant environments remains an open challenge. Recently, supervised learning approaches have demonstrated very promising results in addressing reverberation. However, even with large data volumes, the number of labels available for supervised learning in such environments is usually small. We propose to address this issue with a semi-supervised learning (SSL) approach, based on deep generative modeling. Our chosen deep generative model, the variational autoencoder (VAE), is trained to generate the phase of relative transfer functions (RTFs) between microphones. In parallel, a direction of arrival (DOA) classifier network based on RTF-phase is also trained. The joint generative and discriminative model, deemed VAE-SSL, is trained using labeled and unlabeled RTF-phase sequences. In learning to generate and classify the sequences, the VAE-SSL extracts the physical causes of the RTF-phase (i.e., source location) from distracting signal characteristics such as noise and speech activity. This facilitates effective <italic>end-to-end</italic> operation of the VAE-SSL, which requires minimal preprocessing of RTF-phase. VAE-SSL is compared with two signal processing-based approaches, steered response power with phase transform (SRP-PHAT) and MUltiple SIgnal Classification (MUSIC), as well as fully supervised CNNs. The approaches are compared using data from two real acoustic environments – one of which was recently obtained at Technical University of Denmark specifically for our study. We find that VAE-SSL can outperform the conventional approaches and the CNN in label-limited scenarios. Further, the trained VAE-SSL system can generate new RTF-phase samples which capture the physics of the acoustic environment. Thus, the generative modeling in VAE-SSL provides a means of interpreting the learned representations. To the best of our knowledge, this paper presents the first approach to modeling the physics of acoustic propagation using deep generative modeling.
Acoustic data provide scientific and engineering insights in fields ranging from biology and communications to ocean and Earth science. We survey the recent advances and transformative potential of machine learning (ML), including deep learning, in the field of acoustics. ML is a broad family of techniques, which are often based in statistics, for automatically detecting and utilizing patterns in data. Relative to conventional acoustics and signal processing, ML is data-driven. Given sufficient training data, ML can discover complex relationships between features and desired labels or actions, or between features themselves. With large volumes of training data, ML can discover models describing complex acoustic phenomena such as human speech and reverberation. ML in acoustics is rapidly developing with compelling results and significant future promise. We first introduce ML, then highlight ML developments in four acoustics research areas: source localization in speech processing, source localization in ocean acoustics, bioacoustics, and environmental sounds in everyday scenes.
This paper addresses the problem of multichannel online dereverberation. The proposed method is carried out in the short-time Fourier transform (STFT) domain, and for each frequency band independently. In the STFT domain, the time-domain room impulse response is approximately represented by the convolutive transfer function (CTF). The multichannel CTFs are adaptively identified based on the cross-relation method, and
using the recursive least square criterion. Instead of the complex- valued CTF convolution model, we use a nonnegative convolution model etween the STFT magnitude of the source signal and the CTF magnitude, which is just a coarse approximation of the former model, but is shown to be more robust against the CTF perturbations. Based on this nonnegative model, we propose an online STFT magnitude inverse filtering method. The inverse filters of the CTF magnitude are formulated based on the multiple-input/output inverse theorem (MINT), and adaptively stimated based on the gradient descent criterion. Finally, the inverse filtering is applied to the STFT magnitude of the microphone signals, obtaining an estimate of the STFT magnitude of the source signal. Experiments regarding both speech enhancement and automatic speech recognition are conducted, which demonstrate that the proposed method can effectively suppress reverberation, even for the difficult case of a moving speaker.
The problem of blind and online speaker localization and separation using multiple microphones is addressed
based on the recursive expectation-maximization (REM) procedure. A two-stage REM-based algorithm is
proposed: 1) multi-speaker direction of arrival (DOA) estimation and 2) multi-speaker relative transfer
function (RTF) estimation. The DOA estimation task uses only the time frequency (TF) bins dominated by a
single speaker while the entire frequency range is not required to accomplish this task. In contrast, the RTF
estimation task requires the entire frequency range in order to estimate the RTF for each frequency bin.
Accordingly, a different statistical model is used for the two tasks. The first REM model is applied under the
assumption that the speech signal is sparse in the TF domain, and utilizes a mixture of Gaussians (MoG)
model to identify the TF bins associated with a single dominant speaker. The corresponding DOAs are
estimated using these bins. The second REM model is applied under the assumption that the speakers are
concurrently active in all TF bins and consequently applies a multichannel Wiener filter (MCWF) to separate
the speakers. As a result of the assumption of the concurrent speakers, a more precise TF map of the speakers’
activity is obtained. The RTFs are estimated using the outputs of the MCWF-beamformer (BF), which are
constructed using the DOAs obtained in the previous stage. Next, using the linearly constrained minimum
variance (LCMV)-BF that utilizes the estimated RTFs, the speech signals are separated. The algorithm is
evaluated using real-life scenarios of two speakers. Evaluation of the mean absolute error (MAE) of the
estimated DOAs and the separation capabilities, demonstrates significant improvement w.r.t. a baseline DOA
estimation and speaker separation algorithm.
In this study, we present a deep neural network-based online multi-speaker localization algorithm based on a multi-microphone array. Following the W-disjoint orthogonality principle in the spectral domain, time-frequency (TF) bin is dominated by a single speaker and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high-resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using simulated and real-life recordings in static and dynamic scenarios demonstrates that the proposed algorithm significantly outperforms both classic and recent deep-learning-based algorithms. Finally, as a byproduct, we further show that the proposed method is also capable of separating moving speakers by the application of the obtained TF masks.
This paper addresses the problem of tracking a moving source, e.g., a robot, equipped with both receivers and a source, that is tracking its own location and simultaneously estimating the locations of multiple plane reflectors. We assume a noisy knowledge of the robot’s movement. We formulate this problem, which is also known as simultaneous localization and mapping (SLAM), as a hybrid estimation problem. We derive the extended Kalman filter (EKF) for both tracking the robot’s own location and estimating the room geometry. Since the EKF employs linearization at every step, we incorporate a regulated kinematic model, which facilitates a successful tracking. In addition, we consider the echo-labeling problem as solved and beyond the scope of this paper. We then develop the hybrid Cramér-Rao lower bound on the estimation accuracy of both the localization and mapping parameters. The algorithm is evaluated with respect to the bound via simulations, which shows that the EKF approaches the hybrid Cramér-Rao bound (CRB) (HCRB) as the number of observation increases. This result implies that for the examples tested in simulation, the HCRB is an asymptotically tight bound and that the EKF is an optimal estimator. Whether this property is true in general remains an open question.
Localization in reverberant environments remains an open challenge. Recently, supervised learning approaches have demonstrated very promising results in addressing reverberation. However, even with large data volumes, the number of labels available for supervised learning in such environments is usually small. We propose to address this issue with a semi-supervised learning (SSL) approach, based on deep generative modeling. Our chosen deep generative model, the variational autoencoder (VAE), is trained to generate the phase of relative transfer functions (RTFs) between microphones. In parallel, a direction of arrival (DOA) classifier network based on RTF-phase is also trained. The joint generative and discriminative model, deemed VAE-SSL, is trained using labeled and unlabeled RTF-phase sequences. In learning to generate and classify the sequences, the VAE-SSL extracts the physical causes of the RTF-phase (i.e., source location) from distracting signal characteristics such as noise and speech activity. This facilitates effective <italic>end-to-end</italic> operation of the VAE-SSL, which requires minimal preprocessing of RTF-phase. VAE-SSL is compared with two signal processing-based approaches, steered response power with phase transform (SRP-PHAT) and MUltiple SIgnal Classification (MUSIC), as well as fully supervised CNNs. The approaches are compared using data from two real acoustic environments – one of which was recently obtained at Technical University of Denmark specifically for our study. We find that VAE-SSL can outperform the conventional approaches and the CNN in label-limited scenarios. Further, the trained VAE-SSL system can generate new RTF-phase samples which capture the physics of the acoustic environment. Thus, the generative modeling in VAE-SSL provides a means of interpreting the learned representations. To the best of our knowledge, this paper presents the first approach to modeling the physics of acoustic propagation using deep generative modeling.
In this paper, a study addressing the task of tracking multiple concurrent speakers in reverberant conditions is presented. Since both past and future observations can contribute to the current location estimate, we propose a forward-backward approach, which improves tracking accuracy by introducing near-future data to the estimator, in the cost of an additional short latency. Unlike classical target tracking, we apply a non-Bayesian approach, which does not make assumptions with respect to the target trajectories, except for assuming a realistic change in the parameters due to natural behaviour. The proposed method is based on the recursive expectation-maximization (REM) approach. The new method is dubbed forward-backward recursive expectation-maximization (FB-REM). The performance is demonstrated using an experimental study, where the tested scenarios involve both simulated and recorded signals, with typical reverberation levels and multiple moving sources. It is shown that the proposed algorithm outperforms the regular common causal (REM).
This paper develops a semi-supervised algorithm to address the challenging multi-source localization problem in a noisy and reverberant environment, using a spherical harmonics domain source feature of the relative harmonic coefficients. We present a comprehensive research of this source feature, including (i) an illustration confirming its sole dependence on the source position, (ii) a feature estimator in the presence of noise, (iii) a feature selector exploiting its inherent directivity over space.Source features at varied spherical harmonic modes, representing unique characterization of the soundfield, are fused by the Multi-Mode Gaussian Process modeling. Based on the unifying model, we then formulate the mapping function revealing the underlying relationship between the source feature(s) and position(s) using a Bayesian inference approach. Another issue of the overlapped components is addressed by a pre-processing technique performing overlapped frame detection, which in turn reduces this challenging problem to a single source localization. It is highlighted that this data-driven method has a strong potential to be implemented in practice because only a limited number of labeled measurements is required. We evaluate this proposed algorithm using simulated recordings between multiple speakers in diverse environments, and extensive results confirm improved performance in comparison with the state-of-art methods. Additional assessments using real-life recordings further prove the effectiveness of the method, even at unfavorable circumstances with severe source overlapping.
In this paper, we present an algorithm for direction of arrival (DOA) tracking and separation of multiple speakers with a microphone array using the factor graph statistical model. In our model, the speakers can be located in one of a predefined set of candidate DOAs, and each time-frequency (TF) bin can be associated with a single speaker. Accordingly, by attributing a statistical model to both the DOAs and the associations, as well as to the microphone array observations given these variables, we show that the conditional probability of these variables given the microphone array observations can be modeled as a factor graph. Using the loopy belief propagation (LBP) algorithm, we derive a novel inference scheme which simultaneously estimates both the DOAs and the associations. These estimates are used in turn for separating the sources, by directing a beamformer towards the estimated DOAs, and then applying a TF masking according to the estimated associations. A comprehensive experimental study demonstrates the benefits of the proposed algorithm in both simulated data and real-life measurements recorded in our
laboratory.
Ad hoc acoustic networks comprising multiple nodes, each of which consists of several microphones, are addressed. From the ad hoc nature of the node constellation, microphone positions are unknown. Hence, typical tasks, such as localization, tracking, and beamforming, cannot be directly applied. To tackle this challenging joint multiple speaker localization and array calibration task, we propose a novel variant of the expectation-maximization (EM) algorithm. The coordinates of multiple arrays relative to an anchor array are blindly estimated using naturally uttered speech signals of multiple concurrent speakers. The speakers’ locations, relative to the anchor array, are also estimated. The inter-distances of the microphones in each array, as well their orientations, are assumed known, which is a reasonable assumption for many modern mobile devices (in outdoor and in a several indoor scenarios). The well-known initialization problem of the batch EM algorithm is circumvented by an incremental procedure, also derived here. The proposed algorithm is tested by an extensive simulation study.
Distributed acoustic tracking estimates the trajectories of source positions using an acoustic sensor network. As it is often difficult to estimate the source-sensor range from individual nodes, the source positions have to be inferred from the direction-of-arrival (DoA) estimates. Due to reverberation and noise, the sound field becomes increasingly diffuse with increasing source-sensor distance, leading to a decreased Direction of Arrival (DoA)-estimation accuracy. To distinguish between accurate and uncertain DoA estimates, this letter proposes to incorporate the coherent-to-diffuse ratio as a measure of DoA reliability for single – source tracking. It is shown that the source positions, therefore, can be probabilistically triangulated by exploiting the spatial diversity of all nodes.
The problem of speaker tracking in noisy and reverberant enclosures is addressed. We present a hybrid algorithm, combining traditional tracking schemes with a new learning-based approach. A state-space representation, consisting of a propagation and observation models, is learned from signals measured by several distributed microphone pairs. The proposed representation is based on two data modalities corresponding
to high-dimensional acoustic features representing the full reverberant acoustic channels as well as low-dimensional TDOA estimates. The state-space representation is accompanied by a statistical model based on a Gaussian process used to relate the variations of the acoustic channels to the physical variations of the associated source positions, thereby forming a data-driven propagation model for the source movement. In the
observation model, the source positions are nonlinearly mapped to the associated TDOA readings. The obtained propagation and observation models establish the basis for employing an extended Kalman filter (EKF). Simulation results demonstrate the robustness of the proposed method in noisy and reverberant conditions.
Localization of acoustic sources has attracted a considerable amount of research attention in recent years. A major obstacle to achieving high localization accuracy is the presence of reverberation, the influence of which obviously increases with the number of active speakers in the room. Human hearing is capable of localizing acoustic sources even in extreme conditions.
In this study, we propose to combine a method based on human hearing mechanisms and a modified incremental distributed expectation-maximization algorithm (IDEM).
Rather than using phase difference measurements that are modeled by a mixture of complex-valued Gaussians, as proposed in the original IDEM framework, we propose to use time difference of arrival (TDoA) measurements in multiple subbands and model them by a mixture of real-valued truncated Gaussians. Moreover, we propose to first filter the measurements in order to reduce the effect of the multi-path conditions. The proposed
method is evaluated using both simulated data and real-life recordings.
This paper addresses the problem of multiple- speaker localization in noisy and reverberant environments, using binaural recordings of an acoustic scene. A complex-valued Gaussian mixture model (CGMM) is adopted, whose components correspond to all the possible candidate source locations defined on a grid. After optimizing the CGMM-based objective function, given an observed set of complex-valued binaural features, both the number of sources and their locations are estimated by selecting the CGMM components with the largest weights. An entropy-based penalty term is added to the likelihood to impose sparsity over the set of CGMM component weights. This favors a small number of detected speakers with respect to the large number of initial candidate source locations. In addition, the direct-path relative transfer function (DP-RTF) is used to build robust binaural features. The DP-RTF, recently proposed for single-source localization, encodes inter-channel information
corresponding to the direct-path of sound propagation and is thus robust to reverberations. In this paper, we extend the DP-RTF estimation to the case of multiple sources. In the short-time Fourier transform domain, a consistency test is proposed to check whether a set of consecutive frames is associated to the same source or not. Reliable DP-RTF features are selected from the frames that pass the consistency test to be used for source localization. Experiments carried out using both simulation data and real data recorded with a robotic head confirm the efficiency of the proposed multi-source localization method.
The problem of single source localization with ad hoc microphone networks in noisy and reverberant enclosures is addressed in this paper. A training set is formed by prerecorded measurements collected in advance, and consists of a limited number of labelled measurements, attached with corresponding positions, and a larger number of unlabelled measurements from unknown locations. Further information about the enclosure
characteristics or the microphone positions is not required. We propose a Bayesian inference approach for estimating a function that maps measurement-based features to the corresponding positions. The signals measured by the microphones represent different viewpoints, which are combined in a unified statistical framework. For this purpose, the mapping function is modelled by a Gaussian process with a covariance function that encapsulates both the connections between pairs of microphones and the relations among the samples in the training set. The parameters
of the process are estimated by optimizing a maximum likelihood (ML) criterion. In addition, a recursive adaptation mechanism is derived, where the new streaming measurements are used to update the model. Performance is demonstrated for both simulated data and real-life recordings in a variety of reverberation
and noise levels.
This paper addresses the problem of sound-source localization of a single speech source in noisy and reverberant environments. For a given binaural microphone setup, the binaural response corresponding to the direct-path propagation of a single source is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function of the two channels. We propose a method to estimate the DP-RTF from the noisy and reverberant microphone signals in the short-time Fourier transform (STFT) domain. First, the convolutive transfer function approximation is adopted to accurately represent the impulse response of the sensors in the STFT domain. Second, the DP-RTF is estimated by using the auto- and cross-power spectral densities at each frequency and over multiple frames. In the presence of stationary noise, an interframe spectral subtraction algorithm is proposed, which enables to achieve the estimation of noise-free auto- and cross-power spectral densities. Finally, the estimated DP-RTFs are concatenated across frequencies and used as a feature vector for the localization of speech source. Experiments with both simulated and real data show that the proposed localization method performs well, even under severe adverse acoustic conditions, and outperforms state-of-the-art localization methods under most of the acoustic conditions.
Conventional speaker localization algorithms, based merely on the received microphone signals, are often sensitive to adverse conditions, such as: high reverberation or low signal-to-noise ratio (SNR). In some scenarios, e.g., in meeting rooms or cars, it can be assumed that the source position is confined to a predefined area, and the acoustic parameters of the environment are approximately fixed. Such scenarios give rise to the assumption that the acoustic samples from the region of interest have a distinct geometrical structure. In this paper, we show that the high-dimensional acoustic samples indeed lie on a low-dimensional manifold and can be embedded into a low-dimensional space. Motivated by this result, we propose a semi-supervised source localization algorithm based on two-microphone measurements, which recovers the inverse mapping between the acoustic samples and their corresponding locations. The idea is to use an optimization framework based on manifold regularization, that involves smoothness constraints of possible solutions with respect to the manifold. The proposed algorithm, termed manifold regularization for localization, is adapted while new unlabelled measurements (from unknown source locations) are accumulated during runtime. Experimental results show superior localization performance when compared with a recently presented algorithm based on a manifold learning approach and with the generalized cross-correlation algorithm as a baseline. The algorithm achieves 2° accuracy in typical noisy and reverberant environments (reverberation time between 200 and 800 ms and SNR between 5 and 20 dB).
The problem of distributed localization for ad hoc wireless acoustic sensor networks (WASNs) is addressed in this paper. WASNs are characterized by low computational resources in each node and by limited connectivity between the nodes. Novel bi-directional tree-based distributed estimation-maximization (DEM) algorithms are proposed to circumvent these inherent limitations. We show that the proposed algorithms are capable of localizing static acoustic sources in reverberant enclosures without a priori information on the number of sources. Unlike serial estimation procedures (like ring-based algorithms), the new algorithms enable simultaneous computations in the nodes and exhibit greater robustness to communication failures. Specifically, the recursive distributed EM (RDEM) variant is better suited to online applications due to its recursive nature. Furthermore, the RDEM outperforms the other proposed variants in terms of convergence speed and simplicity. Performance is demonstrated by an extensive experimental study consisting of both simulated and actual environments.
The problem of localizing and tracking a known number of concurrent speakers in noisy and reverberant enclosures is addressed in this paper. We formulate the localization task as a maximum likelihood (ML) parameter estimation problem, and solve it by utilizing the expectation-maximization (EM) procedure. For the tracking scenario, we propose to adapt two recursive EM (REM) variants. The first, based on Titterington’s scheme, is a Newton-based recursion. In this work we also extend Titterington’s method to deal with constrained maximization, encountered in the problem at hand. The second is based on Cappé and Moulines’ scheme. We discuss the similarities and dissimilarities of these two variants and show their applicability to the tracking problem by a simulated experimental study.
Besides reducing undesired sources, i.e., interfering sources and background noise, another important objective of a binaural beamforming algorithm is to preserve the spatial impression of the acoustic scene, which can be achieved by preserving the binaural cues of all sound sources. While the binaural minimum variance distortionless response (BMVDR) beamformer provides a good noise reduction performance and preserves the binaural cues of the desired source, it does not allow to control the reduction of the interfering sources and distorts the binaural cues of the interfering sources and the background noise. Hence, several extensions have been proposed. First, the binaural linearly constrained minimum variance (BLCMV) beamformer uses additional constraints, enabling to control the reduction of the interfering sources while preserving their binaural cues. Second, the BMVDR with partial noise estimation (BMVDR-N) mixes the output signals of the BMVDR with the noisy reference microphone signals, enabling to control the binaural cues of the background noise. Aiming at merging the advantages of both extensions, in this paper we propose the BLCMV with partial noise estimation (BLCMV-N). We show that the output signals of the BLCMV-N can be interpreted as a mixture between the noisy reference microphone signals and the output signals of a BLCMV using an adjusted interference scaling parameter. We provide a theoretical comparison between the BMVDR, the BLCMV, the BMVDR-N and the proposed BLCMV-N in terms of noise and interference reduction performance and binaural cue preservation. Experimental results using recorded signals as well as the results of a perceptual listening test show that the BLCMV-N is able to preserve the binaural cues of an interfering source (like the BLCMV), while enabling to trade off between noise reduction performance and binaural cue preservation of the background noise (like the BMVDR-N).
Speech communication systems are prone to performance degradation in reverberant and noisy acoustic environments. Dereverberation and noise reduction algorithms typically require several model parameters, e.g. the speech, reverberation and noise power spectral densities (PSDs). A commonly used assumption is that the noise PSD matrix is known. However, in practical acoustic scenarios, the noise PSD matrix is unknown
and should be estimated along with the speech and reverberation PSDs. In this paper, we consider the case of rank-deficient noise PSD matrix, which arises when the noise signal consists of multiple directional noise sources, whose number is less than the number of microphones. We derive two closed-form maximum likelihood estimators (MLEs). The first is a non-blocking-based estimator which jointly estimates the speech, reverberation and noise PSDs, and the second is a blocking-based estimator, which first blocks the speech signal and then jointly estimates the
reverberation and noise PSDs. Both estimators are analytically compared and analyzed, and mean square errors (MSEs) expressions are derived. Furthermore, Cramér-Rao Bounds (CRBs) on the estimated PSDs are derived. The proposed estimators are examined using both simulation and real reverberant and noisy signals, demonstrating the advantage of the proposed method compared to competing estimators.
Hands-free speech systems are subject to performance degradation due to reverberation and noise. Common methods for enhancing reverberant and noisy speech require the knowledge of the speech, reverberation and noise power spectral densities (PSDs). Most literature on this topic assumes that the noise PSD matrix is known. However, in many practical acoustic scenarios, the noise PSD is unknown and should be estimated along with the speech and the reverberation PSDs. In this paper, the noise is modelled as a spatially homogeneous sound field, with an unknown time-varying PSD multiplied by a known time-invariant spatial coherence matrix. We derive two maximum likelihood estimators (MLEs) for the various PSDs, including the noise: The first is a non-blocking-based estimator, that jointly estimates the PSDs of the speech, reverberation and noise components. The second MLE is a blocking-based estimator, that blocks the speech signal and estimates the reverberation and noise PSDs. Since a closed-form solution does not exist, both estimators iteratively maximize the likelihood using the Fisher scoring method. In order to compare both methods, the corresponding Cramér-Rao Bounds (CRBs) are derived. For both the reverberation and the noise PSDs, it is shown that the non-blocking-based CRB is lower than the blocking-based CRB. Performance evaluation using both simulated and real reverberant and noisy signals, shows that the proposed estimators outperform competing estimators, and greatly reduce the effect of reverberation and noise.
We present a fully Bayesian hierarchical approach for multichannel speech enhancement with time-varying audio channel. Our probabilistic approach relies on a Gaussian prior for the speech signal and a Gamma hyperprior for the speech precision, combined with a multichannel linear-Gaussian state space model for the acoustic channel. Furthermore, we assume a Wishart prior for the noise precision matrix. We derive a variational Expectation-Maximization (VEM) algorithm which uses a variant of multichannel Wiener filter (MCWF) to infer the sound source and a Kalman smoother to infer the acoustic channel. It is further shown that the VEM speech estimator can be recast as a multichannel minimum variance distortionless response (MVDR) beamformer followed by a single-channel variational postfilter. The proposed algorithm was evaluated using both simulated and real room environments with several noise types and reverberation levels. Both static and dynamic scenarios are considered. In terms of speech quality, it is shown that a significant improvement is obtained with respect to the noisy signal, and that the proposed method outperforms a baseline algorithm. In terms of channel alignment and tracking ability, a superior channel estimate is demonstrated.
This paper addresses the problems of blind multichannel identification and equalization for joint speech dere-verberation and noise reduction. The time-domain cross-relation method is hardly applicable for blind room impulse response identification due to the near-common zeros of the long impulse responses. We extend the cross relation method to the short-time Fourier transform (STFT) domain, in which the time domain impulse response is approximately represented by the convolutive transfer function (CTF) with much less coefficients. For the oversampled STFT, CTFs suffer from the common zeros caused by the non-flat frequency response of the STFT window. To overcome this, we propose to identify CTFs using the STFT framework with oversampled signals and critically sampled CTFs, which is a good trade-off between the frequency
aliasing of the signals and the common zeros problem of CTFs. The identified complex-valued CTFs are not accurate enough for multichannel equalization due to the frequency aliasing of the CTFs. Thence, we only use the CTF magnitudes, which leads to a nonnegative multichannel equalization method based on a nonnegative convolution model between the STFT magnitude of the source signal and the CTF magnitude. Compared with the complex-valued convolution model, this nonnegative convolution model is shown to be more robust against the CTF perturbations. To recover the STFT magnitude of the source signal and to reduce the additive noise, the ` 2 -norm fitting error between the STFT magnitude of the microphone signals and the nonnegative convolution is constrained to be less than a noise power related tolerance. Meanwhile, the ` 1 -norm of the STFT magnitude of the source signal is minimized to impose the sparsity.
The problem of blind separation of speech signals in the presence of noise using multiple microphones is addressed. Blind estimation of the acoustic parameters and the individual source signals are carried out by applying the expectation – maximization (EM) algorithm. Two models for the speech signals are used, namely an unknown deterministic signal model and a complex-Gaussian signal model. For the two alternatives, we
define a statistical model and develop EM-based algorithms to jointly estimate the acoustic parameters and the speech signals. The resulting algorithms are then compared from both theoretical and performance perspectives. In both cases, the latent data (differently defined for each alternative) is estimated in the E-step, where in the M-step, the two algorithms estimate the acoustic transfer functions of each source and the noise covariance matrix. The algorithms differ in the way the clean speech signals are used in the EM scheme. When the clean signal is assumed deterministic unknown, only the a posterior probabilities of the presence of each source are estimated in the E-step, while their time-frequency coefficients are designated as parameters, and are estimated in the M-step using the minimum variance distortionless response beamformer. If the clean speech signals are modelled as complex Gaussian signals, their power spectral densities (PSDs) are estimated in the E-step using the
multichannel Wiener filter output. The proposed algorithms were tested using reverberant noisy mixtures of two speech sources in different reverberation and noise conditions.
The reverberation power spectral density (PSD) is often required for dereverberation and noise reduction algorithms. In this work, we compare two maximum likelihood (ML) estimators of the reverberation PSD in a noisy environment. In the first estimator, the direct path is first blocked. Then, the ML criterion for estimating the reverberation PSD is stated according to the probability density function (p.d.f.) of the blocking matrix (BM) outputs. In the second estimator, the speech component is not blocked. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. To compare the expected mean square error (MSE) between the two ML estimators of the reverberation PSD, the Cramér-Rao Bounds (CRBs) for the two ML estimators are derived. We show that the CRB for the joint reverberation and speech PSD estimator is lower than the CRB for estimating the reverberation PSD from the BM outputs. Experimental results show that the MSE of the two estimators indeed obeys the CRB curves. Experimental results of multi-microphone dereverberation and noise reduction algorithm show the benefits of using the ML estimators in comparison with another baseline estimators.
The problem of source separation and noise reduction using multiple microphones is addressed. The minimum
mean square error (MMSE) estimator for the multi-speaker case is derived and a novel decomposition of this estimator is pre-
sented. The MMSE estimator is decomposed into two stages: i) a multi-speaker linearly constrained minimum variance (LCMV)
beamformer (BF), and ii) a subsequent multi-speaker Wiener postfilter. The first stage separates and enhances the signals of
the individual speakers by utilizing the spatial characteristics of the speakers (as manifested by the respective acoustic transfer
functions (ATFs)) and the noise spatial correlation matrix, while the second stage exploits the speakers’ power spectral density
matrix to reduce the residual noise at the output of the first stage. The output vector of the multi-speaker LCMV BF is proven to be
the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral
amplitude estimator for the multi-speaker case is also derived given the multi-speaker LCMV BF outputs. The performance
evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically
verified that the multi-speaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when
compared with the single-speaker postfilter.
Speech enhancement and separation are core problems in audio signal processing, with commercial applications
in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are cru-
cial pre-processing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight mi-
crophones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those
of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting
with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated
and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between
these approaches is lacking at present. In this article, we propose to fill this gap by analyzing a large number of established
and recent techniques according to four transverse axes:
a) the acoustic impulse response model
b) the spatial filter design criterion
c) the parameter estimation algorithm
d) optional postfiltering.
We conclude this overview paper by providing a list
of software and data resources and by discussing perspectives and
future trends in the field.
The problem of source separation using an array of microphones in reverberant and noisy conditions is addressed. We consider applying the well-known linearly constrained minimum variance (LCMV) beamformer (BF) for extracting individual speakers. Constraints are defined using relative transfer functions (RTFs) for the sources, which are acoustic transfer functions (ATFs) ratios between any microphone and a reference microphone. The latter are usually estimated by methods which rely on single-talk time segments where only a single source is active and on reliable knowledge of the source activity. Two novel algorithms for estimation of RTFs using the TRINICON (Triple-N ICA for convolutive mixtures) framework are proposed, not resorting to the usually unavailable source activity pattern. The first algorithm estimates the RTFs of the sources by applying multiple two-channel geometrically constrained (GC) TRINICON units, where approximate direction of arrival (DOA) information for the sources is utilized for ensuring convergence to the desired solution. The GC-TRINICON is applied to all microphone pairs using a common reference microphone. In the second algorithm, we propose to estimate RTFs iteratively using GC-TRINICON, where instead of using a fixed reference microphone as before, we suggest to use the output signals of LCMV-BFs from the previous iteration as spatially processed references (SPRs) with improved signal-to-interference-and-noise ratio (SINR). For both algorithms, a simple detection of noise-only time segments is required for estimating the covariance matrix of noise and interference. We conduct an experimental study in which the performance of the proposed methods is confirmed and compared to corresponding supervised methods.
In this paper, we present a single-microphone speech enhancement algorithm. A hybrid approach is proposed merging the generative mixture of Gaussians (MoG) model and the discriminative deep neural network (DNN). The proposed algorithm is executed in two phases, the training phase, which does not recur, and the test phase. First, the noise-free speech log-power spectral density is modeled as an MoG, representing the phoneme-based diversity in the speech signal. A DNN is then trained with phoneme labeled database of clean speech signals for phoneme classification with mel-frequency cepstral coefficients as the input features. In the test phase, a noisy utterance of an untrained speech is processed. Given the phoneme classification results of the noisy speech utterance, a speech presence probability (SPP) is obtained using both the generative and discriminative models. SPP-controlled attenuation is then applied to the noisy speech while simultaneously, the noise estimate is updated. The discriminative DNN maintains the continuity of the speech and the generative phoneme-based MoG preserves the speech spectral structure. Extensive experimental study using real speech and noise signals is provided. We also compare the proposed algorithm with alternative speech enhancement algorithms. We show that we obtain a significant improvement over previous methods in terms of speech quality measures. Finally, we analyze the contribution of all components of the proposed algorithm indicating their combined importance.
Smartglasses, in addition to their visual-output capabilities, often contain acoustic sensors for receiving the user’s voice. However, operation in noisy environments may lead to significant degradation of the received signal. To address this issue, we propose employing an acoustic sensor array which is mounted on the eyeglasses frames. The signals from the array are processed by an algorithm with the purpose of acquiring the desired near-field speech signal produced by the wearer while suppressing noise signals originating from the environment. The array is comprised of two acoustic vector-sensors (AVSs) which are located at the fore of the glasses’ temples. Each AVS consists of four collocated subsensors: one pressure sensor (with an omnidirectional response) and three particle-velocity sensors (with dipole responses) oriented in mutually orthogonal directions. The array configuration is designed to boost the input power of the desired signal, and to ensure that the characteristics of the noise at the different channels are sufficiently diverse (lending towards more effective noise suppression). Since changes in the array’s position correspond to the desired speaker’s movement, the relative source-receiver position remains unchanged; hence, the need to track fluctuations of the steering vector is avoided. Conversely, the spatial statistics of the noise are subject to rapid and abrupt changes due to sudden movement and rotation of the user’s head. Consequently, the algorithm must be capable of rapid adaptation toward such changes. We propose an algorithm which incorporates detection of the desired speech in the time-frequency domain, and employs this information to adaptively update estimates of the noise statistics. The speech detection plays a key role in ensuring the quality of the output signal. We conduct controlled measurements of the array in noisy scenarios. The proposed algorithm preforms favorably with respect to conventional algorithms.
In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that consists of the direct path and some early reflections and a late reverberant component that consists of all late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation, and ambient noise is presented. The expectation-maximization (EM) algorithm is used to estimate the signals and spatial parameters of the early speech component and the late reverberation components. As a result, a spatially filtered version of the early speech component is estimated in the E-step. The power spectral density (PSD) of the anechoic speech, the relative early transfer functions, and the PSD matrix of the late reverberation are estimated in the M-step of the EM algorithm. The algorithm is evaluated using real room impulse response recorded in our acoustic lab with a reverberation time set to 0.36 s and 0.61 s and several signal-to-noise ratio levels. It is shown that significant improvement is obtained and that the proposed algorithm outperforms baseline single-channel and multichannel dereverberation algorithms, as well as a state-of-the-art multichannel dereverberation algorithm.
Statistically optimal spatial processors (also referred to as data-dependent beamformers) are widely-used spatial focusing techniques for desired source extraction. The Kalman filter-based beamformer (KFB) [1] is a recursive Bayesian method for implementing the beamformer. This letter provides new insights into the KFB. Specifically, we adopt the KFB framework to the task of speech extraction. We formalize the KFB with a set of linear constraints and present its equivalence to the linearly constrained minimum power (LCMP) beamformer. We further show that the optimal output power, required for implementing the KFB, is merely controlling the white noise gain (WNG) of the beamformer. We also show, that in static scenarios, the adaptation rule of the KFB reduces to the simpler affine projection algorithm (APA). The analytically derived results are verified and exemplified by a simulation study.
The objective of binaural noise reduction algorithms is not only to selectively extract the desired speaker and to suppress interfering sources (e.g., competing speakers) and ambient background noise, but also to preserve the auditory impression of the complete acoustic scene. For directional sources this can be achieved by preserving the relative transfer function (RTF) which is defined as the ratio of the acoustical transfer functions relating the source and the two ears and corresponds to the binaural cues. In this paper, we theoretically analyze the performance of three algorithms that are based on the binaural minimum variance distortionless response (BMVDR) beamformer, and hence, process the desired source without distortion. The BMVDR beamformer preserves the binaural cues of the desired source but distorts the binaural cues of the interfering source. By adding an interference reduction (IR) constraint, the recently proposed BMVDR-IR beamformer is able to preserve the binaural cues of both the desired source and the interfering source. We further propose a novel algorithm for preserving the binaural cues of both the desired source and the interfering source by adding a constraint preserving the RTF of the interfering source, which will be referred to as the BMVDR-RTF beamformer. We analytically evaluate the performance in terms of binaural signal-to-interference-and-noise ratio (SINR), signal-to-interference ratio (SIR), and signal-to-noise ratio (SNR) of the three considered beamformers. It can be shown that the BMVDR-RTF beamformer outperforms the BMVDR-IR beamformer in terms of SINR and outperforms the BMVDR beamformer in terms of SIR. Among all beamformers which are distortionless with respect to the desired source and preserve the binaural cues of the interfering source, the newly proposed BMVDR-RTF beamformer is optimal in terms of SINR. Simulations using acoustic transfer functions measured on a binaural hearing aid validate our theoretical results.
Besides noise reduction, an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of all sound sources. For the desired speech source and the interfering sources, e.g., competing speakers, this can be achieved by preserving their relative transfer functions (RTFs). It has been shown that the binaural multi-channel Wiener filter (MWF) preserves the RTF of the desired speech source, but typically distorts the RTF of the interfering sources. To this end, in this paper we propose two extensions of the binaural MWF, i.e., the binaural MWF with RTF preservation (MWF-RTF) aiming to preserve the RTF of the interfering source and the binaural MWF with interference rejection (MWF-IR) aiming to completely suppress the interfering source. Analytical expressions for the performance of the binaural MWF, MWF-RTF and MWF-IR in terms of noise reduction, speech distortion and binaural cue preservation are derived, showing that the proposed extensions yield a better performance in terms of the signal-to-interference ratio and preservation of the binaural cues of the directional interference, while the overall noise reduction performance is degraded compared to the binaural MWF. Simulation results using binaural behind-the-ear impulse responses measured in a reverberant environment validate the derived analytical expressions for the theoretically achievable performance of the binaural MWF, MWF-RTF, and MWF-IR, showing that the performance highly depends on the position of the interfering source and the number of microphones. Furthermore, the simulation results show that the MWF-RTF yields a very similar overall noise reduction performance as the binaural MWF, while preserving the binaural cues of both the speech and the interfering source.
The directivity factor (DF) of a beamformer describes its spatial selectivity and ability to suppress diffuse noise which arrives from all directions. For a given array configuration, it is possible to design beamforming weights which maximize the DF for a particular look-direction, while enforcing nulls for a set of undesired directions. In general, the resulting DF is dependent upon the specific look- and null directions. Using the same array, one may apply a different set of weights designed for any other feasible set of look- and null directions. In this contribution we show that when the optimal DF is averaged over all look directions the result equals the number of sensors minus the number of null constraints. This result holds, regardless of the positions and spatial responses of the individual sensors, and of the null directions. The result generalizes to more complex wave-propagation domains (e.g., reverberation).
Relative impulse responses between microphones are usually long and dense due to the reverberant acoustic environment. Estimating them from short and noisy recordings poses a long-standing challenge of audio signal processing. In this paper, we apply a novel strategy based on ideas of compressed sensing. Relative transfer function (RTF) corresponding to the relative impulse response can often be estimated accurately from noisy data but only for certain frequencies. This means that often only an incomplete measurement of the RTF is available. A complete RTF estimate can be obtained through finding its sparsest representation in the time-domain: that is, through computing the sparsest among the corresponding relative impulse responses. Based on this approach, we propose to estimate the RTF from noisy data in three steps. First, the RTF is estimated using any conventional method such as the nonstationarity-based estimator by Gannot or through blind source separation. Second, frequencies are determined for which the RTF estimate appears to be accurate. Third, the RTF is reconstructed through solving a weighted l1 convex program, which we propose to solve via a computationally efficient variant of the SpaRSA (Sparse Reconstruction by Separable Approximation) algorithm. An extensive experimental study with real-world recordings has been conducted. It has been shown that the proposed method is capable of improving many conventional estimators used as the first step in most situations.
In multiple speaker scenarios, the linearly constrained minimum variance (LCMV) beamformer is a popular microphone array-based speech enhancement technique, as it allows minimizing the noise power while maintaining a set of desired responses towards different speakers. Here, we address the algorithmic challenges arising when applying the LCMV beamformer in wireless acoustic sensor networks (WASNs), which are a next-generation technology for audio acquisition and processing. We review three optimal distributed LCMV-based algorithms, which compute a network-wide LCMV beamformer output at each node without centralizing the microphone signals. Optimality here refers to equivalence to a centralized realization where a single processor has access to all signals. We derive and motivate the algorithms in an accessible top-down framework that reveals their underlying relations. We explain how their differences result from their different design criterion (node-specific versus common constraints sets), and their different priorities for communication bandwidth, computational power, and adaptivity. Furthermore, although originally proposed for a fully connected WASN, we also explain how to extend the reviewed algorithms to the case of a partially connected WASN, which is assumed to be pruned to a tree topology. Finally, we discuss the advantages and disadvantages of the various algorithms
The problem of source separation and noise reduction using multiple microphones is addressed. The minimum mean square error (MMSE) estimator for the multispeaker case is derived and a novel decomposition of this estimator is presented. The MMSE estimator is decomposed into two stages: first, a multispeaker linearly constrained minimum variance (LCMV) beamformer (BF); and second, a subsequent multispeaker Wiener postfilter. The first stage separates and enhances the signals of the individual speakers by utilizing the spatial characteristics of the speakers [as manifested by the respective acoustic transfer functions (ATFs)] and the noise power spectral density (PSD) matrix, while the second stage exploits the speakers’ PSD matrix to reduce the residual noise at the output of the first stage. The output vector of the multispeaker LCMV BF is proven to be the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral amplitude estimator for the multispeaker case is also derived given the multispeaker LCMV BF outputs. The performance evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically verified that the multispeaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when compared with the single-speaker postfilter.
Abstract—In this paper we propose a data-driven approach for multiple speaker tracking in reverberant enclosures. The method comprises two stages. The first stage executes a single source localization using semi-supervised learning on multiple manifold. The second stage uses unsupervised time-varying maximum likelihood estimation for tracking. The feature vectors, used by both stages, are the relative transfer functions (RTFs), which are known to be related to source positions. The number of sources is assumed to be known while the microphone positions are unknown. In the training stage, a large database of RTFs is given. A small percentage of the data is attributed with exact positions
and the rest is assumed to be unlabelled, i.e. the respective position is unknown. Then, a nonlinear, manifold-based, mapping function between the RTFs and the source positions is inferred. Applying this mapping function to all unlabelled RTFs constructs a dense grid of localized sources. In the test phase, this RTF grid serves as the centroids for a Mixture of Gaussians (MoG) model. The MoG parameters are estimated by applying recursive expectation-maximization (EM). The EM procedure relies on the sparsity and intermittency of the speech signals. A comprehensive
simulation study, with two overlapping speakers in various reverberation levels, demonstrates the usability of the proposed scheme, achieving high level of accuracy compared to a baseline method using a simpler propagation model.
We present a novel non-iterative and rigorously motivated approach for estimating hidden Markov models (HMMs) and factorial hidden Markov models (FHMMs) of high-dimensional signals. Our approach utilizes the asymptotic properties of a spectral, graph-based approach for dimensionality reduction and manifold learning, namely the diffusion framework. We exemplify our approach by applying it to the problem of single microphone speech separation, where the log-spectra of two unmixed speakers are modeled as HMMs, while their mixture is modeled as an FHMM. We derive two diffusion-based FHMM estimation schemes. One of which is experimentally shown to provide separation results that compare with contemporary speech separation approaches based on HMM. The second scheme allows a reduced computational burden.
In this paper, we present a single-microphone speech enhancement algorithm. A hybrid approach is proposed merging the generative mixture of Gaussians (MoG) model and the discriminative deep neural network (DNN). The proposed algorithm is executed in two phases, the training phase, which does not recur, and the test phase. First, the noise-free speech log-power spectral density is modeled as an MoG, representing the phoneme-based diversity in the speech signal. A DNN is then trained with phoneme labeled database of clean speech signals for phoneme classification with mel-frequency cepstral coefficients as the input features. In the test phase, a noisy utterance of an untrained speech is processed. Given the phoneme classification results of the noisy speech utterance, a speech presence probability (SPP) is obtained using both the generative and discriminative models. SPP-controlled attenuation is then applied to the noisy speech while simultaneously, the noise estimate is updated. The discriminative DNN maintains the continuity of the speech and the generative phoneme-based MoG preserves the speech spectral structure. Extensive experimental study using real speech and noise signals is provided. We also compare the proposed algorithm with alternative speech enhancement algorithms. We show that we obtain a significant improvement over previous methods in terms of speech quality measures. Finally, we analyze the contribution of all components of the proposed algorithm indicating their combined importance.
In recent years, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques and automatic speech recognition (ASR) techniques that are robust to reverberation. In this paper, we describe the REVERB challenge, which is an evaluation campaign that was designed to evaluate such speech enhancement (SE) and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions. Even though most existing benchmark tasks and challenges for distant speech processing focus on the noise robustness issue and sometimes only on a single-channel scenario, a particular novelty of the REVERB challenge is that it is carefully designed to test robustness against reverberation, based on both real, single-channel, and multichannel recordings. This challenge attracted 27 papers, which represent 25 systems specifically designed for SE purposes and 49 systems specifically designed for ASR purposes. This paper describes the problems dealt within the challenge, provides an overview of the submitted systems, and scrutinizes them to clarify what current processing strategies appear effective in reverberant speech processing.
The problem of blind and online speaker localization and separation using multiple microphones is addressed
based on the recursive expectation-maximization (REM) procedure. A two-stage REM-based algorithm is
proposed: 1) multi-speaker direction of arrival (DOA) estimation and 2) multi-speaker relative transfer
function (RTF) estimation. The DOA estimation task uses only the time frequency (TF) bins dominated by a
single speaker while the entire frequency range is not required to accomplish this task. In contrast, the RTF
estimation task requires the entire frequency range in order to estimate the RTF for each frequency bin.
Accordingly, a different statistical model is used for the two tasks. The first REM model is applied under the
assumption that the speech signal is sparse in the TF domain, and utilizes a mixture of Gaussians (MoG)
model to identify the TF bins associated with a single dominant speaker. The corresponding DOAs are
estimated using these bins. The second REM model is applied under the assumption that the speakers are
concurrently active in all TF bins and consequently applies a multichannel Wiener filter (MCWF) to separate
the speakers. As a result of the assumption of the concurrent speakers, a more precise TF map of the speakers’
activity is obtained. The RTFs are estimated using the outputs of the MCWF-beamformer (BF), which are
constructed using the DOAs obtained in the previous stage. Next, using the linearly constrained minimum
variance (LCMV)-BF that utilizes the estimated RTFs, the speech signals are separated. The algorithm is
evaluated using real-life scenarios of two speakers. Evaluation of the mean absolute error (MAE) of the
estimated DOAs and the separation capabilities, demonstrates significant improvement w.r.t. a baseline DOA
estimation and speaker separation algorithm.
This paper presents a new dataset of measured multichannel Room Impulse Responses (RIRs) named dEchorate. It includes annotations of early echo timings and 3D positions of microphones, real sources and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, acoustic echo retrieval, microphone calibration, echo labeling and reflectors position estimation. The dataset is provided with software utilities to easily access, manipulate and visualize the data as well as baseline methods for echo-related tasks.
The gain achieved by a superdirective beamformer operating in a diffuse noise-field is significantly higher than the gain attainable with conventional delay-and-sum weights. A classical result states that for a compact linear array consisting of N sensors which receives a plane-wave signal from the endfire direction, the optimal superdirective gain approaches N2. It has been noted that in the near-field regime higher gains can be attained. The gain can increase, in theory, without bound for increasing wavelength or decreasing source receiver distance. We aim to address the phenomenon of near-field superdirectivity in a comprehensive manner. We derive the optimal performance for the limiting case of an infinitesimal-aperture array receiving a spherical-wave signal. This is done with the aid of a sequence of linear transformations. The resulting gain expression is a polynomial, which depends on the number of sensors employed, the wavelength, and the source-receiver distance. The resulting gain curves are optimal and outperform weights corresponding to other superdirectivity methods. The practical case of a finite aperture array is discussed. We present conditions for which the gain of such an array would approach that predicted by the theory of the infinitesimal case. The white noise gain (WNG) metric of robustness is shown to increase in the near-field regime.
Two novel methods for speaker separation of multi-microphone recordings that can also detect speakers with infrequent activity are presented. The proposed methods are based on a statistical model of the probability of activity of the speakers across time. Each method takes a different approach for estimating the activity probabilities. The first method is derived using a linear programming (LP) problem for maximizing the correlation function between different time frames. It is shown that the obtained maxima correspond to frames which contain a single active speaker. Accordingly, we propose an algorithm for successive identification of frames dominated by each speaker. The second method aggregates the correlation values associated with each frame in a correlation vector. We show that these correlation vectors lie in a simplex with vertices that correspond to frames dominated by one of the speakers. In this method, we utilize convex geometry tools to sequentially detect the simplex vertices. The correlation functions associated with single-speaker frames, which are detected by either of the two proposed methods, are used for recovering the activity probabilities. A spatial mask is estimated based on the recovered probabilities and is utilized for separation and enhancement by means of both spatial and spectral processing. Experimental results demonstrate the performance of the proposed methods in various conditions on real-life recordings with different reverberation and noise levels, outperforming a state-of-the-art separation method.
In this paper, a study addressing the task of tracking multiple concurrent speakers in reverberant conditions is presented. Since both past and future observations can contribute to the current location estimate, we propose a forward-backward approach, which improves tracking accuracy by introducing near-future data to the estimator, in the cost of an additional short latency. Unlike classical target tracking, we apply a non-Bayesian approach, which does not make assumptions with respect to the target trajectories, except for assuming a realistic change in the parameters due to natural behaviour. The proposed method is based on the recursive expectation-maximization (REM) approach. The new method is dubbed forward-backward recursive expectation-maximization (FB-REM). The performance is demonstrated using an experimental study, where the tested scenarios involve both simulated and recorded signals, with typical reverberation levels and multiple moving sources. It is shown that the proposed algorithm outperforms the regular common causal (REM).
This paper develops a semi-supervised algorithm to address the challenging multi-source localization problem in a noisy and reverberant environment, using a spherical harmonics domain source feature of the relative harmonic coefficients. We present a comprehensive research of this source feature, including (i) an illustration confirming its sole dependence on the source position, (ii) a feature estimator in the presence of noise, (iii) a feature selector exploiting its inherent directivity over space.Source features at varied spherical harmonic modes, representing unique characterization of the soundfield, are fused by the Multi-Mode Gaussian Process modeling. Based on the unifying model, we then formulate the mapping function revealing the underlying relationship between the source feature(s) and position(s) using a Bayesian inference approach. Another issue of the overlapped components is addressed by a pre-processing technique performing overlapped frame detection, which in turn reduces this challenging problem to a single source localization. It is highlighted that this data-driven method has a strong potential to be implemented in practice because only a limited number of labeled measurements is required. We evaluate this proposed algorithm using simulated recordings between multiple speakers in diverse environments, and extensive results confirm improved performance in comparison with the state-of-art methods. Additional assessments using real-life recordings further prove the effectiveness of the method, even at unfavorable circumstances with severe source overlapping.
Besides reducing undesired sources, i.e., interfering sources and background noise, another important objective of a binaural beamforming algorithm is to preserve the spatial impression of the acoustic scene, which can be achieved by preserving the binaural cues of all sound sources. While the binaural minimum variance distortionless response (BMVDR) beamformer provides a good noise reduction performance and preserves the binaural cues of the desired source, it does not allow to control the reduction of the interfering sources and distorts the binaural cues of the interfering sources and the background noise. Hence, several extensions have been proposed. First, the binaural linearly constrained minimum variance (BLCMV) beamformer uses additional constraints, enabling to control the reduction of the interfering sources while preserving their binaural cues. Second, the BMVDR with partial noise estimation (BMVDR-N) mixes the output signals of the BMVDR with the noisy reference microphone signals, enabling to control the binaural cues of the background noise. Aiming at merging the advantages of both extensions, in this paper we propose the BLCMV with partial noise estimation (BLCMV-N). We show that the output signals of the BLCMV-N can be interpreted as a mixture between the noisy reference microphone signals and the output signals of a BLCMV using an adjusted interference scaling parameter. We provide a theoretical comparison between the BMVDR, the BLCMV, the BMVDR-N and the proposed BLCMV-N in terms of noise and interference reduction performance and binaural cue preservation. Experimental results using recorded signals as well as the results of a perceptual listening test show that the BLCMV-N is able to preserve the binaural cues of an interfering source (like the BLCMV), while enabling to trade off between noise reduction performance and binaural cue preservation of the background noise (like the BMVDR-N).
Estimation problems like room geometry estimation and localization of acoustic reflectors are of great interest and importance in robot and drone audition. Several methods for tackling these problems exist, but most of them rely on information about times-of-arrival (TOAs) of the acoustic echoes. These need to be estimated in practice, which is a difficult problem in itself, especially in robot applications which are characterized by high ego-noise. Moreover, even if TOAs are successfully extracted, the difficult problem of echolabeling needs to be solved. In this paper, we propose multiple expectation-maximization (EM) methods, for jointly estimating the TOAs and directions-of-arrival (DOA) of the echoes, with a uniform circular array (UCA) and a loudspeaker in its center for probing the environment. The different methods are derived to be optimal under different noise conditions. The experimental results show that the proposed methods outperform existing methods in terms of estimation accuracy in noisy conditions. For example, it can provide accurate estimates at SNR of 10 dB lower compared to TOA extraction from room impulse responses, which is often used. Furthermore, the results confirm that the proposed methods can account for scenarios with colored noise or faulty microphones. Finally, we show the applicability of the proposed methods in mapping of an indoor environment.
Ad hoc acoustic networks comprising multiple nodes, each of which consists of several microphones, are addressed. From the ad hoc nature of the node constellation, microphone positions are unknown. Hence, typical tasks, such as localization, tracking, and beamforming, cannot be directly applied. To tackle this challenging joint multiple speaker localization and array calibration task, we propose a novel variant of the expectation-maximization (EM) algorithm. The coordinates of multiple arrays relative to an anchor array are blindly estimated using naturally uttered speech signals of multiple concurrent speakers. The speakers’ locations, relative to the anchor array, are also estimated. The inter-distances of the microphones in each array, as well their orientations, are assumed known, which is a reasonable assumption for many modern mobile devices (in outdoor and in a several indoor scenarios). The well-known initialization problem of the batch EM algorithm is circumvented by an incremental procedure, also derived here. The proposed algorithm is tested by an extensive simulation study.
The problem of blind audio source separation (BASS) in noisy and reverberant conditions is addressed by a novel approach, termed Global and LOcal Simplex Separation (GLOSS), which integrates full- and narrow-band simplex representations. We show that the eigenvectors of the correlation matrix between time frames in a certain frequency band form a simplex that organizes the frames according to the speaker activities in the corresponding band. We propose to build two simplex representations: one global based on a broad frequency band and one local based on a narrow band. In turn, the two representations are combined to determine the dominant speaker in each time-frequency (TF). Using the identified dominating speakers, a spectral mask is computed and is utilized for extracting each of the speakers using spatial beamforming followed by spectral postfiltering. The performance of the proposed algorithm is demonstrated using real-life recordings in various noisy and reverberant conditions.
Hands-free speech systems are subject to performance degradation due to reverberation and noise. Common methods for enhancing reverberant and noisy speech require the knowledge of the speech, reverberation and noise power spectral densities (PSDs). Most literature on this topic assumes that the noise PSD matrix is known. However, in many practical acoustic scenarios, the noise PSD is unknown and should be estimated along with the speech and the reverberation PSDs. In this paper, the noise is modelled as a spatially homogeneous sound field, with an unknown time-varying PSD multiplied by a known time-invariant spatial coherence matrix. We derive two maximum likelihood estimators (MLEs) for the various PSDs, including the noise: The first is a non-blocking-based estimator, that jointly estimates the PSDs of the speech, reverberation and noise components. The second MLE is a blocking-based estimator, that blocks the speech signal and estimates the reverberation and noise PSDs. Since a closed-form solution does not exist, both estimators iteratively maximize the likelihood using the Fisher scoring method. In order to compare both methods, the corresponding Cramér-Rao Bounds (CRBs) are derived. For both the reverberation and the noise PSDs, it is shown that the non-blocking-based CRB is lower than the blocking-based CRB. Performance evaluation using both simulated and real reverberant and noisy signals, shows that the proposed estimators outperform competing estimators, and greatly reduce the effect of reverberation and noise.
Distortionless speech extraction in a reverberant environment can be achieved by applying a beamforming algorithm, provided that the relative transfer functions (RTFs) of the sources and the covariance matrix of the noise are known. In this paper, the challenge of RTF identification in a multi-speaker scenario is addressed. We propose a successive RTF identification (SRI) technique, based on the sole assumption that sources do not become simultaneously active. That is, we address the challenge of estimating the RTF of a specific speech source while assuming that the RTFs of all other active sources in the environment were previously estimated in an earlier stage. The RTF of interest is identified by applying the blind oblique projection (BOP)-SRI technique. When a new speech source is identified, the BOP algorithm is applied. BOP results in a null steering toward the RTF of interest, by means of applying an oblique projection to the microphone measurements. We prove that by artificially increasing the rank of the range of the projection matrix, the RTF of interest can be identified. An experimental study is carried out to evaluate the performance of the BOP-SRI algorithm in various signal to noise ratio (SNR) and signal to interference ratio (SIR) conditions and to demonstrate its effectiveness in speech extraction tasks.
This paper addresses the problem of multichannel online dereverberation. The proposed method is carried out in the short-time Fourier transform (STFT) domain, and for each frequency band independently. In the STFT domain, the time-domain room impulse response is approximately represented by the convolutive transfer function (CTF). The multichannel CTFs are adaptively identified based on the cross-relation method, and
using the recursive least square criterion. Instead of the complex- valued CTF convolution model, we use a nonnegative convolution model etween the STFT magnitude of the source signal and the CTF magnitude, which is just a coarse approximation of the former model, but is shown to be more robust against the CTF perturbations. Based on this nonnegative model, we propose an online STFT magnitude inverse filtering method. The inverse filters of the CTF magnitude are formulated based on the multiple-input/output inverse theorem (MINT), and adaptively stimated based on the gradient descent criterion. Finally, the inverse filtering is applied to the STFT magnitude of the microphone signals, obtaining an estimate of the STFT magnitude of the source signal. Experiments regarding both speech enhancement and automatic speech recognition are conducted, which demonstrate that the proposed method can effectively suppress reverberation, even for the difficult case of a moving speaker.
We present a fully Bayesian hierarchical approach for multichannel speech enhancement with time-varying audio channel. Our probabilistic approach relies on a Gaussian prior for the speech signal and a Gamma hyperprior for the speech precision, combined with a multichannel linear-Gaussian state space model for the acoustic channel. Furthermore, we assume a Wishart prior for the noise precision matrix. We derive a variational Expectation-Maximization (VEM) algorithm which uses a variant of multichannel Wiener filter (MCWF) to infer the sound source and a Kalman smoother to infer the acoustic channel. It is further shown that the VEM speech estimator can be recast as a multichannel minimum variance distortionless response (MVDR) beamformer followed by a single-channel variational postfilter. The proposed algorithm was evaluated using both simulated and real room environments with several noise types and reverberation levels. Both static and dynamic scenarios are considered. In terms of speech quality, it is shown that a significant improvement is obtained with respect to the noisy signal, and that the proposed method outperforms a baseline algorithm. In terms of channel alignment and tracking ability, a superior channel estimate is demonstrated.
This paper addresses the problems of blind multichannel identification and equalization for joint speech dere-verberation and noise reduction. The time-domain cross-relation method is hardly applicable for blind room impulse response identification due to the near-common zeros of the long impulse responses. We extend the cross relation method to the short-time Fourier transform (STFT) domain, in which the time domain impulse response is approximately represented by the convolutive transfer function (CTF) with much less coefficients. For the oversampled STFT, CTFs suffer from the common zeros caused by the non-flat frequency response of the STFT window. To overcome this, we propose to identify CTFs using the STFT framework with oversampled signals and critically sampled CTFs, which is a good trade-off between the frequency
aliasing of the signals and the common zeros problem of CTFs. The identified complex-valued CTFs are not accurate enough for multichannel equalization due to the frequency aliasing of the CTFs. Thence, we only use the CTF magnitudes, which leads to a nonnegative multichannel equalization method based on a nonnegative convolution model between the STFT magnitude of the source signal and the CTF magnitude. Compared with the complex-valued convolution model, this nonnegative convolution model is shown to be more robust against the CTF perturbations. To recover the STFT magnitude of the source signal and to reduce the additive noise, the ` 2 -norm fitting error between the STFT magnitude of the microphone signals and the nonnegative convolution is constrained to be less than a noise power related tolerance. Meanwhile, the ` 1 -norm of the STFT magnitude of the source signal is minimized to impose the sparsity.
Blind source separation (BSS) is addressed, using a novel data-driven approach, based on a well-established probabilistic model. The proposed method is specifically designed for separation of multichannel audio mixtures. The algorithm relies on spectral decomposition of the correlation matrix between different time frames. The probabilistic model implies that the column space of the correlation matrix is spanned by the probabilities of the various speakers across time. The number of speakers is recovered by the eigenvalue decay, and the eigenvectors form a simplex of the speakers’ probabilities. Time frames dominated by each of the speakers are identified exploiting convex geometry tools on the recovered simplex. The mixing acoustic channels are estimated utilizing the identified sets of frames, and a linear unmixing is performed to extract the individual speakers. The derived simplexes are visually demonstrated for mixtures of 2, 3 and 4 speakers. We also conduct a comprehensive experimental
study, showing high separation capabilities in various reverberation conditions.
Reduction of late reverberation can be achieved using spatio-spectral filters such as the multichannel Wiener filter (MWF). To compute this filter, an estimate of the late reverberation power spectral density (PSD) is required. In recent years, a multitude of late reverberation PSD estimators have been proposed. In this contribution, these estimators are categorized into several classes, their relations and differences are discussed, and a comprehensive experimental comparison is provided. To compare their performance, simulations in controlled as well as practical scenarios are conducted. It is shown that a common weakness of spatial coherence-based estimators is their performance in high direct-to-diffuse ratio (DDR) conditions. To mitigate this problem, a correction method is proposed and evaluated. It is shown that the proposed correction method can decrease the speech distortion without significantly affecting the reverberation reduction.
The problem of speaker tracking in noisy and reverberant enclosures is addressed. We present a hybrid algorithm, combining traditional tracking schemes with a new learning-based approach. A state-space representation, consisting of a propagation and observation models, is learned from signals measured by several distributed microphone pairs. The proposed representation is based on two data modalities corresponding
to high-dimensional acoustic features representing the full reverberant acoustic channels as well as low-dimensional TDOA estimates. The state-space representation is accompanied by a statistical model based on a Gaussian process used to relate the variations of the acoustic channels to the physical variations of the associated source positions, thereby forming a data-driven propagation model for the source movement. In the
observation model, the source positions are nonlinearly mapped to the associated TDOA readings. The obtained propagation and observation models establish the basis for employing an extended Kalman filter (EKF). Simulation results demonstrate the robustness of the proposed method in noisy and reverberant conditions.
Localization of acoustic sources has attracted a considerable amount of research attention in recent years. A major obstacle to achieving high localization accuracy is the presence of reverberation, the influence of which obviously increases with the number of active speakers in the room. Human hearing is capable of localizing acoustic sources even in extreme conditions.
In this study, we propose to combine a method based on human hearing mechanisms and a modified incremental distributed expectation-maximization algorithm (IDEM).
Rather than using phase difference measurements that are modeled by a mixture of complex-valued Gaussians, as proposed in the original IDEM framework, we propose to use time difference of arrival (TDoA) measurements in multiple subbands and model them by a mixture of real-valued truncated Gaussians. Moreover, we propose to first filter the measurements in order to reduce the effect of the multi-path conditions. The proposed
method is evaluated using both simulated data and real-life recordings.
The problem of blind separation of speech signals in the presence of noise using multiple microphones is addressed. Blind estimation of the acoustic parameters and the individual source signals are carried out by applying the expectation – maximization (EM) algorithm. Two models for the speech signals are used, namely an unknown deterministic signal model and a complex-Gaussian signal model. For the two alternatives, we
define a statistical model and develop EM-based algorithms to jointly estimate the acoustic parameters and the speech signals. The resulting algorithms are then compared from both theoretical and performance perspectives. In both cases, the latent data (differently defined for each alternative) is estimated in the E-step, where in the M-step, the two algorithms estimate the acoustic transfer functions of each source and the noise covariance matrix. The algorithms differ in the way the clean speech signals are used in the EM scheme. When the clean signal is assumed deterministic unknown, only the a posterior probabilities of the presence of each source are estimated in the E-step, while their time-frequency coefficients are designated as parameters, and are estimated in the M-step using the minimum variance distortionless response beamformer. If the clean speech signals are modelled as complex Gaussian signals, their power spectral densities (PSDs) are estimated in the E-step using the
multichannel Wiener filter output. The proposed algorithms were tested using reverberant noisy mixtures of two speech sources in different reverberation and noise conditions.
This paper addresses the problem of multiple- speaker localization in noisy and reverberant environments, using binaural recordings of an acoustic scene. A complex-valued Gaussian mixture model (CGMM) is adopted, whose components correspond to all the possible candidate source locations defined on a grid. After optimizing the CGMM-based objective function, given an observed set of complex-valued binaural features, both the number of sources and their locations are estimated by selecting the CGMM components with the largest weights. An entropy-based penalty term is added to the likelihood to impose sparsity over the set of CGMM component weights. This favors a small number of detected speakers with respect to the large number of initial candidate source locations. In addition, the direct-path relative transfer function (DP-RTF) is used to build robust binaural features. The DP-RTF, recently proposed for single-source localization, encodes inter-channel information
corresponding to the direct-path of sound propagation and is thus robust to reverberations. In this paper, we extend the DP-RTF estimation to the case of multiple sources. In the short-time Fourier transform domain, a consistency test is proposed to check whether a set of consecutive frames is associated to the same source or not. Reliable DP-RTF features are selected from the frames that pass the consistency test to be used for source localization. Experiments carried out using both simulation data and real data recorded with a robotic head confirm the efficiency of the proposed multi-source localization method.
The reverberation power spectral density (PSD) is often required for dereverberation and noise reduction algorithms. In this work, we compare two maximum likelihood (ML) estimators of the reverberation PSD in a noisy environment. In the first estimator, the direct path is first blocked. Then, the ML criterion for estimating the reverberation PSD is stated according to the probability density function (p.d.f.) of the blocking matrix (BM) outputs. In the second estimator, the speech component is not blocked. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. To compare the expected mean square error (MSE) between the two ML estimators of the reverberation PSD, the Cramér-Rao Bounds (CRBs) for the two ML estimators are derived. We show that the CRB for the joint reverberation and speech PSD estimator is lower than the CRB for estimating the reverberation PSD from the BM outputs. Experimental results show that the MSE of the two estimators indeed obeys the CRB curves. Experimental results of multi-microphone dereverberation and noise reduction algorithm show the benefits of using the ML estimators in comparison with another baseline estimators.
The problem of single source localization with ad hoc microphone networks in noisy and reverberant enclosures is addressed in this paper. A training set is formed by prerecorded measurements collected in advance, and consists of a limited number of labelled measurements, attached with corresponding positions, and a larger number of unlabelled measurements from unknown locations. Further information about the enclosure
characteristics or the microphone positions is not required. We propose a Bayesian inference approach for estimating a function that maps measurement-based features to the corresponding positions. The signals measured by the microphones represent different viewpoints, which are combined in a unified statistical framework. For this purpose, the mapping function is modelled by a Gaussian process with a covariance function that encapsulates both the connections between pairs of microphones and the relations among the samples in the training set. The parameters
of the process are estimated by optimizing a maximum likelihood (ML) criterion. In addition, a recursive adaptation mechanism is derived, where the new streaming measurements are used to update the model. Performance is demonstrated for both simulated data and real-life recordings in a variety of reverberation
and noise levels.
The problem of source separation and noise reduction using multiple microphones is addressed. The minimum
mean square error (MMSE) estimator for the multi-speaker case is derived and a novel decomposition of this estimator is pre-
sented. The MMSE estimator is decomposed into two stages: i) a multi-speaker linearly constrained minimum variance (LCMV)
beamformer (BF), and ii) a subsequent multi-speaker Wiener postfilter. The first stage separates and enhances the signals of
the individual speakers by utilizing the spatial characteristics of the speakers (as manifested by the respective acoustic transfer
functions (ATFs)) and the noise spatial correlation matrix, while the second stage exploits the speakers’ power spectral density
matrix to reduce the residual noise at the output of the first stage. The output vector of the multi-speaker LCMV BF is proven to be
the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral
amplitude estimator for the multi-speaker case is also derived given the multi-speaker LCMV BF outputs. The performance
evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically
verified that the multi-speaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when
compared with the single-speaker postfilter.
Speech enhancement and separation are core problems in audio signal processing, with commercial applications
in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are cru-
cial pre-processing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight mi-
crophones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those
of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting
with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated
and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between
these approaches is lacking at present. In this article, we propose to fill this gap by analyzing a large number of established
and recent techniques according to four transverse axes:
a) the acoustic impulse response model
b) the spatial filter design criterion
c) the parameter estimation algorithm
d) optional postfiltering.
We conclude this overview paper by providing a list
of software and data resources and by discussing perspectives and
future trends in the field.
The problem of source separation using an array of microphones in reverberant and noisy conditions is addressed. We consider applying the well-known linearly constrained minimum variance (LCMV) beamformer (BF) for extracting individual speakers. Constraints are defined using relative transfer functions (RTFs) for the sources, which are acoustic transfer functions (ATFs) ratios between any microphone and a reference microphone. The latter are usually estimated by methods which rely on single-talk time segments where only a single source is active and on reliable knowledge of the source activity. Two novel algorithms for estimation of RTFs using the TRINICON (Triple-N ICA for convolutive mixtures) framework are proposed, not resorting to the usually unavailable source activity pattern. The first algorithm estimates the RTFs of the sources by applying multiple two-channel geometrically constrained (GC) TRINICON units, where approximate direction of arrival (DOA) information for the sources is utilized for ensuring convergence to the desired solution. The GC-TRINICON is applied to all microphone pairs using a common reference microphone. In the second algorithm, we propose to estimate RTFs iteratively using GC-TRINICON, where instead of using a fixed reference microphone as before, we suggest to use the output signals of LCMV-BFs from the previous iteration as spatially processed references (SPRs) with improved signal-to-interference-and-noise ratio (SINR). For both algorithms, a simple detection of noise-only time segments is required for estimating the covariance matrix of noise and interference. We conduct an experimental study in which the performance of the proposed methods is confirmed and compared to corresponding supervised methods.
This paper addresses the problem of sound-source localization of a single speech source in noisy and reverberant environments. For a given binaural microphone setup, the binaural response corresponding to the direct-path propagation of a single source is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function of the two channels. We propose a method to estimate the DP-RTF from the noisy and reverberant microphone signals in the short-time Fourier transform (STFT) domain. First, the convolutive transfer function approximation is adopted to accurately represent the impulse response of the sensors in the STFT domain. Second, the DP-RTF is estimated by using the auto- and cross-power spectral densities at each frequency and over multiple frames. In the presence of stationary noise, an interframe spectral subtraction algorithm is proposed, which enables to achieve the estimation of noise-free auto- and cross-power spectral densities. Finally, the estimated DP-RTFs are concatenated across frequencies and used as a feature vector for the localization of speech source. Experiments with both simulated and real data show that the proposed localization method performs well, even under severe adverse acoustic conditions, and outperforms state-of-the-art localization methods under most of the acoustic conditions.
Smartglasses, in addition to their visual-output capabilities, often contain acoustic sensors for receiving the user’s voice. However, operation in noisy environments may lead to significant degradation of the received signal. To address this issue, we propose employing an acoustic sensor array which is mounted on the eyeglasses frames. The signals from the array are processed by an algorithm with the purpose of acquiring the desired near-field speech signal produced by the wearer while suppressing noise signals originating from the environment. The array is comprised of two acoustic vector-sensors (AVSs) which are located at the fore of the glasses’ temples. Each AVS consists of four collocated subsensors: one pressure sensor (with an omnidirectional response) and three particle-velocity sensors (with dipole responses) oriented in mutually orthogonal directions. The array configuration is designed to boost the input power of the desired signal, and to ensure that the characteristics of the noise at the different channels are sufficiently diverse (lending towards more effective noise suppression). Since changes in the array’s position correspond to the desired speaker’s movement, the relative source-receiver position remains unchanged; hence, the need to track fluctuations of the steering vector is avoided. Conversely, the spatial statistics of the noise are subject to rapid and abrupt changes due to sudden movement and rotation of the user’s head. Consequently, the algorithm must be capable of rapid adaptation toward such changes. We propose an algorithm which incorporates detection of the desired speech in the time-frequency domain, and employs this information to adaptively update estimates of the noise statistics. The speech detection plays a key role in ensuring the quality of the output signal. We conduct controlled measurements of the array in noisy scenarios. The proposed algorithm preforms favorably with respect to conventional algorithms.
In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that consists of the direct path and some early reflections and a late reverberant component that consists of all late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation, and ambient noise is presented. The expectation-maximization (EM) algorithm is used to estimate the signals and spatial parameters of the early speech component and the late reverberation components. As a result, a spatially filtered version of the early speech component is estimated in the E-step. The power spectral density (PSD) of the anechoic speech, the relative early transfer functions, and the PSD matrix of the late reverberation are estimated in the M-step of the EM algorithm. The algorithm is evaluated using real room impulse response recorded in our acoustic lab with a reverberation time set to 0.36 s and 0.61 s and several signal-to-noise ratio levels. It is shown that significant improvement is obtained and that the proposed algorithm outperforms baseline single-channel and multichannel dereverberation algorithms, as well as a state-of-the-art multichannel dereverberation algorithm.
This paper addresses the problem of separating audio sources from time-varying convolutive mixtures. We propose a probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization. The time-varying mixing filters are modeled by a continuous temporal stochastic process. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the time-varying mixing matrix, and that jointly estimate the source parameters. The sound sources are then separated by Wiener filters constructed with the estimators provided by the VEM algorithm. Extensive experiments on simulated data show that the proposed method outperforms a blockwise version of a state-of-the-art baseline method.
Conventional speaker localization algorithms, based merely on the received microphone signals, are often sensitive to adverse conditions, such as: high reverberation or low signal-to-noise ratio (SNR). In some scenarios, e.g., in meeting rooms or cars, it can be assumed that the source position is confined to a predefined area, and the acoustic parameters of the environment are approximately fixed. Such scenarios give rise to the assumption that the acoustic samples from the region of interest have a distinct geometrical structure. In this paper, we show that the high-dimensional acoustic samples indeed lie on a low-dimensional manifold and can be embedded into a low-dimensional space. Motivated by this result, we propose a semi-supervised source localization algorithm based on two-microphone measurements, which recovers the inverse mapping between the acoustic samples and their corresponding locations. The idea is to use an optimization framework based on manifold regularization, that involves smoothness constraints of possible solutions with respect to the manifold. The proposed algorithm, termed manifold regularization for localization, is adapted while new unlabelled measurements (from unknown source locations) are accumulated during runtime. Experimental results show superior localization performance when compared with a recently presented algorithm based on a manifold learning approach and with the generalized cross-correlation algorithm as a baseline. The algorithm achieves 2° accuracy in typical noisy and reverberant environments (reverberation time between 200 and 800 ms and SNR between 5 and 20 dB).
Statistically optimal spatial processors (also referred to as data-dependent beamformers) are widely-used spatial focusing techniques for desired source extraction. The Kalman filter-based beamformer (KFB) [1] is a recursive Bayesian method for implementing the beamformer. This letter provides new insights into the KFB. Specifically, we adopt the KFB framework to the task of speech extraction. We formalize the KFB with a set of linear constraints and present its equivalence to the linearly constrained minimum power (LCMP) beamformer. We further show that the optimal output power, required for implementing the KFB, is merely controlling the white noise gain (WNG) of the beamformer. We also show, that in static scenarios, the adaptation rule of the KFB reduces to the simpler affine projection algorithm (APA). The analytically derived results are verified and exemplified by a simulation study.
In recent years, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques and automatic speech recognition (ASR) techniques that are robust to reverberation. In this paper, we describe the REVERB challenge, which is an evaluation campaign that was designed to evaluate such speech enhancement (SE) and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions. Even though most existing benchmark tasks and challenges for distant speech processing focus on the noise robustness issue and sometimes only on a single-channel scenario, a particular novelty of the REVERB challenge is that it is carefully designed to test robustness against reverberation, based on both real, single-channel, and multichannel recordings. This challenge attracted 27 papers, which represent 25 systems specifically designed for SE purposes and 49 systems specifically designed for ASR purposes. This paper describes the problems dealt within the challenge, provides an overview of the submitted systems, and scrutinizes them to clarify what current processing strategies appear effective in reverberant speech processing.
The objective of binaural noise reduction algorithms is not only to selectively extract the desired speaker and to suppress interfering sources (e.g., competing speakers) and ambient background noise, but also to preserve the auditory impression of the complete acoustic scene. For directional sources this can be achieved by preserving the relative transfer function (RTF) which is defined as the ratio of the acoustical transfer functions relating the source and the two ears and corresponds to the binaural cues. In this paper, we theoretically analyze the performance of three algorithms that are based on the binaural minimum variance distortionless response (BMVDR) beamformer, and hence, process the desired source without distortion. The BMVDR beamformer preserves the binaural cues of the desired source but distorts the binaural cues of the interfering source. By adding an interference reduction (IR) constraint, the recently proposed BMVDR-IR beamformer is able to preserve the binaural cues of both the desired source and the interfering source. We further propose a novel algorithm for preserving the binaural cues of both the desired source and the interfering source by adding a constraint preserving the RTF of the interfering source, which will be referred to as the BMVDR-RTF beamformer. We analytically evaluate the performance in terms of binaural signal-to-interference-and-noise ratio (SINR), signal-to-interference ratio (SIR), and signal-to-noise ratio (SNR) of the three considered beamformers. It can be shown that the BMVDR-RTF beamformer outperforms the BMVDR-IR beamformer in terms of SINR and outperforms the BMVDR beamformer in terms of SIR. Among all beamformers which are distortionless with respect to the desired source and preserve the binaural cues of the interfering source, the newly proposed BMVDR-RTF beamformer is optimal in terms of SINR. Simulations using acoustic transfer functions measured on a binaural hearing aid validate our theoretical results.
Besides noise reduction, an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of all sound sources. For the desired speech source and the interfering sources, e.g., competing speakers, this can be achieved by preserving their relative transfer functions (RTFs). It has been shown that the binaural multi-channel Wiener filter (MWF) preserves the RTF of the desired speech source, but typically distorts the RTF of the interfering sources. To this end, in this paper we propose two extensions of the binaural MWF, i.e., the binaural MWF with RTF preservation (MWF-RTF) aiming to preserve the RTF of the interfering source and the binaural MWF with interference rejection (MWF-IR) aiming to completely suppress the interfering source. Analytical expressions for the performance of the binaural MWF, MWF-RTF and MWF-IR in terms of noise reduction, speech distortion and binaural cue preservation are derived, showing that the proposed extensions yield a better performance in terms of the signal-to-interference ratio and preservation of the binaural cues of the directional interference, while the overall noise reduction performance is degraded compared to the binaural MWF. Simulation results using binaural behind-the-ear impulse responses measured in a reverberant environment validate the derived analytical expressions for the theoretically achievable performance of the binaural MWF, MWF-RTF, and MWF-IR, showing that the performance highly depends on the position of the interfering source and the number of microphones. Furthermore, the simulation results show that the MWF-RTF yields a very similar overall noise reduction performance as the binaural MWF, while preserving the binaural cues of both the speech and the interfering source.
The directivity factor (DF) of a beamformer describes its spatial selectivity and ability to suppress diffuse noise which arrives from all directions. For a given array configuration, it is possible to design beamforming weights which maximize the DF for a particular look-direction, while enforcing nulls for a set of undesired directions. In general, the resulting DF is dependent upon the specific look- and null directions. Using the same array, one may apply a different set of weights designed for any other feasible set of look- and null directions. In this contribution we show that when the optimal DF is averaged over all look directions the result equals the number of sensors minus the number of null constraints. This result holds, regardless of the positions and spatial responses of the individual sensors, and of the null directions. The result generalizes to more complex wave-propagation domains (e.g., reverberation).
The problem of distributed localization for ad hoc wireless acoustic sensor networks (WASNs) is addressed in this paper. WASNs are characterized by low computational resources in each node and by limited connectivity between the nodes. Novel bi-directional tree-based distributed estimation-maximization (DEM) algorithms are proposed to circumvent these inherent limitations. We show that the proposed algorithms are capable of localizing static acoustic sources in reverberant enclosures without a priori information on the number of sources. Unlike serial estimation procedures (like ring-based algorithms), the new algorithms enable simultaneous computations in the nodes and exhibit greater robustness to communication failures. Specifically, the recursive distributed EM (RDEM) variant is better suited to online applications due to its recursive nature. Furthermore, the RDEM outperforms the other proposed variants in terms of convergence speed and simplicity. Performance is demonstrated by an extensive experimental study consisting of both simulated and actual environments.
Relative impulse responses between microphones are usually long and dense due to the reverberant acoustic environment. Estimating them from short and noisy recordings poses a long-standing challenge of audio signal processing. In this paper, we apply a novel strategy based on ideas of compressed sensing. Relative transfer function (RTF) corresponding to the relative impulse response can often be estimated accurately from noisy data but only for certain frequencies. This means that often only an incomplete measurement of the RTF is available. A complete RTF estimate can be obtained through finding its sparsest representation in the time-domain: that is, through computing the sparsest among the corresponding relative impulse responses. Based on this approach, we propose to estimate the RTF from noisy data in three steps. First, the RTF is estimated using any conventional method such as the nonstationarity-based estimator by Gannot or through blind source separation. Second, frequencies are determined for which the RTF estimate appears to be accurate. Third, the RTF is reconstructed through solving a weighted l1 convex program, which we propose to solve via a computationally efficient variant of the SpaRSA (Sparse Reconstruction by Separable Approximation) algorithm. An extensive experimental study with real-world recordings has been conducted. It has been shown that the proposed method is capable of improving many conventional estimators used as the first step in most situations.
In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that includes the direct path and some early reflections, and a late reverberant component that includes all the late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation and ambient noise is presented. A multi-microphone minimum mean square error estimator is used to obtain a spatially filtered version of the early speech component. The estimator constructed as a minimum variance distortionless response (MVDR) beamformer (BF) followed by a postfilter (PF). Three unique design features characterize the proposed method. First, the MVDR BF is implemented in a special structure, named the nonorthogonal generalized sidelobe canceller (NO-GSC). Compared with the more conventional orthogonal GSC structure, the new structure allows for a simpler implementation of the GSC blocks for various MVDR constraints. Second, In contrast to earlier works, RETFs are used in the MVDR criterion rather than either the entire RTFs or only the direct-path of the desired speech signal. An estimator of the RETFs is proposed as well. Third, the late reverberation and noise are processed by both the beamforming stage and the PF stage. Since the relative power of the noise and the late reverberation varies with the frame index, a computationally efficient method for the required matrix inversion is proposed to circumvent the cumbersome mathematical operation. The algorithm was evaluated and compared with two alternative multichannel algorithms and one single-channel algorithm using simulated data and data recorded in a room with a reverberation time of 0.5 s for various source-microphone array distances (1-4 m) and several signal-to-noise levels. The processed signals were tested using two commonly used objective measures, namely perceptual …
Speech signals recorded in a room are commonly degraded by reverberation. In most cases, both the speech signal and the acoustic system of the room are unknown and time-varying. In this paper, a scenario with a single desired sound source and slowly time-varying and spatially-white noise is considered, and a multi-microphone algorithm that simultaneously estimates the clean speech signal and the time-varying acoustic system is proposed. The recursive expectation-maximization scheme is employed to obtain both the clean speech signal and the acoustic system in an online manner. In the expectation step, the Kalman filter is applied to extract a new sample of the clean signal, and in the maximization step, the system estimate is updated according to the output of the Kalman filter. Experimental results show that the proposed method is able to significantly reduce reverberation and increase the speech quality. Moreover, the tracking ability of the algorithm was validated in practical scenarios using human speakers moving in a natural manner.
In multiple speaker scenarios, the linearly constrained minimum variance (LCMV) beamformer is a popular microphone array-based speech enhancement technique, as it allows minimizing the noise power while maintaining a set of desired responses towards different speakers. Here, we address the algorithmic challenges arising when applying the LCMV beamformer in wireless acoustic sensor networks (WASNs), which are a next-generation technology for audio acquisition and processing. We review three optimal distributed LCMV-based algorithms, which compute a network-wide LCMV beamformer output at each node without centralizing the microphone signals. Optimality here refers to equivalence to a centralized realization where a single processor has access to all signals. We derive and motivate the algorithms in an accessible top-down framework that reveals their underlying relations. We explain how their differences result from their different design criterion (node-specific versus common constraints sets), and their different priorities for communication bandwidth, computational power, and adaptivity. Furthermore, although originally proposed for a fully connected WASN, we also explain how to extend the reviewed algorithms to the case of a partially connected WASN, which is assumed to be pruned to a tree topology. Finally, we discuss the advantages and disadvantages of the various algorithms
The problem of source separation and noise reduction using multiple microphones is addressed. The minimum mean square error (MMSE) estimator for the multispeaker case is derived and a novel decomposition of this estimator is presented. The MMSE estimator is decomposed into two stages: first, a multispeaker linearly constrained minimum variance (LCMV) beamformer (BF); and second, a subsequent multispeaker Wiener postfilter. The first stage separates and enhances the signals of the individual speakers by utilizing the spatial characteristics of the speakers [as manifested by the respective acoustic transfer functions (ATFs)] and the noise power spectral density (PSD) matrix, while the second stage exploits the speakers’ PSD matrix to reduce the residual noise at the output of the first stage. The output vector of the multispeaker LCMV BF is proven to be the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral amplitude estimator for the multispeaker case is also derived given the multispeaker LCMV BF outputs. The performance evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically verified that the multispeaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when compared with the single-speaker postfilter.
The problem of blind and online speaker localization and separation using multiple microphones is addressed
based on the recursive expectation-maximization (REM) procedure. A two-stage REM-based algorithm is
proposed: 1) multi-speaker direction of arrival (DOA) estimation and 2) multi-speaker relative transfer
function (RTF) estimation. The DOA estimation task uses only the time frequency (TF) bins dominated by a
single speaker while the entire frequency range is not required to accomplish this task. In contrast, the RTF
estimation task requires the entire frequency range in order to estimate the RTF for each frequency bin.
Accordingly, a different statistical model is used for the two tasks. The first REM model is applied under the
assumption that the speech signal is sparse in the TF domain, and utilizes a mixture of Gaussians (MoG)
model to identify the TF bins associated with a single dominant speaker. The corresponding DOAs are
estimated using these bins. The second REM model is applied under the assumption that the speakers are
concurrently active in all TF bins and consequently applies a multichannel Wiener filter (MCWF) to separate
the speakers. As a result of the assumption of the concurrent speakers, a more precise TF map of the speakers’
activity is obtained. The RTFs are estimated using the outputs of the MCWF-beamformer (BF), which are
constructed using the DOAs obtained in the previous stage. Next, using the linearly constrained minimum
variance (LCMV)-BF that utilizes the estimated RTFs, the speech signals are separated. The algorithm is
evaluated using real-life scenarios of two speakers. Evaluation of the mean absolute error (MAE) of the
estimated DOAs and the separation capabilities, demonstrates significant improvement w.r.t. a baseline DOA
estimation and speaker separation algorithm.
Two novel methods for speaker separation of multi-microphone recordings that can also detect speakers with infrequent activity are presented. The proposed methods are based on a statistical model of the probability of activity of the speakers across time. Each method takes a different approach for estimating the activity probabilities. The first method is derived using a linear programming (LP) problem for maximizing the correlation function between different time frames. It is shown that the obtained maxima correspond to frames which contain a single active speaker. Accordingly, we propose an algorithm for successive identification of frames dominated by each speaker. The second method aggregates the correlation values associated with each frame in a correlation vector. We show that these correlation vectors lie in a simplex with vertices that correspond to frames dominated by one of the speakers. In this method, we utilize convex geometry tools to sequentially detect the simplex vertices. The correlation functions associated with single-speaker frames, which are detected by either of the two proposed methods, are used for recovering the activity probabilities. A spatial mask is estimated based on the recovered probabilities and is utilized for separation and enhancement by means of both spatial and spectral processing. Experimental results demonstrate the performance of the proposed methods in various conditions on real-life recordings with different reverberation and noise levels, outperforming a state-of-the-art separation method.
In this paper, we present an algorithm for direction of arrival (DOA) tracking and separation of multiple speakers with a microphone array using the factor graph statistical model. In our model, the speakers can be located in one of a predefined set of candidate DOAs, and each time-frequency (TF) bin can be associated with a single speaker. Accordingly, by attributing a statistical model to both the DOAs and the associations, as well as to the microphone array observations given these variables, we show that the conditional probability of these variables given the microphone array observations can be modeled as a factor graph. Using the loopy belief propagation (LBP) algorithm, we derive a novel inference scheme which simultaneously estimates both the DOAs and the associations. These estimates are used in turn for separating the sources, by directing a beamformer towards the estimated DOAs, and then applying a TF masking according to the estimated associations. A comprehensive experimental study demonstrates the benefits of the proposed algorithm in both simulated data and real-life measurements recorded in our
laboratory.
Besides reducing undesired sources, i.e., interfering sources and background noise, another important objective of a binaural beamforming algorithm is to preserve the spatial impression of the acoustic scene, which can be achieved by preserving the binaural cues of all sound sources. While the binaural minimum variance distortionless response (BMVDR) beamformer provides a good noise reduction performance and preserves the binaural cues of the desired source, it does not allow to control the reduction of the interfering sources and distorts the binaural cues of the interfering sources and the background noise. Hence, several extensions have been proposed. First, the binaural linearly constrained minimum variance (BLCMV) beamformer uses additional constraints, enabling to control the reduction of the interfering sources while preserving their binaural cues. Second, the BMVDR with partial noise estimation (BMVDR-N) mixes the output signals of the BMVDR with the noisy reference microphone signals, enabling to control the binaural cues of the background noise. Aiming at merging the advantages of both extensions, in this paper we propose the BLCMV with partial noise estimation (BLCMV-N). We show that the output signals of the BLCMV-N can be interpreted as a mixture between the noisy reference microphone signals and the output signals of a BLCMV using an adjusted interference scaling parameter. We provide a theoretical comparison between the BMVDR, the BLCMV, the BMVDR-N and the proposed BLCMV-N in terms of noise and interference reduction performance and binaural cue preservation. Experimental results using recorded signals as well as the results of a perceptual listening test show that the BLCMV-N is able to preserve the binaural cues of an interfering source (like the BLCMV), while enabling to trade off between noise reduction performance and binaural cue preservation of the background noise (like the BMVDR-N).
The problem of blind audio source separation (BASS) in noisy and reverberant conditions is addressed by a novel approach, termed Global and LOcal Simplex Separation (GLOSS), which integrates full- and narrow-band simplex representations. We show that the eigenvectors of the correlation matrix between time frames in a certain frequency band form a simplex that organizes the frames according to the speaker activities in the corresponding band. We propose to build two simplex representations: one global based on a broad frequency band and one local based on a narrow band. In turn, the two representations are combined to determine the dominant speaker in each time-frequency (TF). Using the identified dominating speakers, a spectral mask is computed and is utilized for extracting each of the speakers using spatial beamforming followed by spectral postfiltering. The performance of the proposed algorithm is demonstrated using real-life recordings in various noisy and reverberant conditions.
Blind source separation (BSS) is addressed, using a novel data-driven approach, based on a well-established probabilistic model. The proposed method is specifically designed for separation of multichannel audio mixtures. The algorithm relies on spectral decomposition of the correlation matrix between different time frames. The probabilistic model implies that the column space of the correlation matrix is spanned by the probabilities of the various speakers across time. The number of speakers is recovered by the eigenvalue decay, and the eigenvectors form a simplex of the speakers’ probabilities. Time frames dominated by each of the speakers are identified exploiting convex geometry tools on the recovered simplex. The mixing acoustic channels are estimated utilizing the identified sets of frames, and a linear unmixing is performed to extract the individual speakers. The derived simplexes are visually demonstrated for mixtures of 2, 3 and 4 speakers. We also conduct a comprehensive experimental
study, showing high separation capabilities in various reverberation conditions.
The problem of blind separation of speech signals in the presence of noise using multiple microphones is addressed. Blind estimation of the acoustic parameters and the individual source signals are carried out by applying the expectation – maximization (EM) algorithm. Two models for the speech signals are used, namely an unknown deterministic signal model and a complex-Gaussian signal model. For the two alternatives, we
define a statistical model and develop EM-based algorithms to jointly estimate the acoustic parameters and the speech signals. The resulting algorithms are then compared from both theoretical and performance perspectives. In both cases, the latent data (differently defined for each alternative) is estimated in the E-step, where in the M-step, the two algorithms estimate the acoustic transfer functions of each source and the noise covariance matrix. The algorithms differ in the way the clean speech signals are used in the EM scheme. When the clean signal is assumed deterministic unknown, only the a posterior probabilities of the presence of each source are estimated in the E-step, while their time-frequency coefficients are designated as parameters, and are estimated in the M-step using the minimum variance distortionless response beamformer. If the clean speech signals are modelled as complex Gaussian signals, their power spectral densities (PSDs) are estimated in the E-step using the
multichannel Wiener filter output. The proposed algorithms were tested using reverberant noisy mixtures of two speech sources in different reverberation and noise conditions.
The problem of source separation and noise reduction using multiple microphones is addressed. The minimum
mean square error (MMSE) estimator for the multi-speaker case is derived and a novel decomposition of this estimator is pre-
sented. The MMSE estimator is decomposed into two stages: i) a multi-speaker linearly constrained minimum variance (LCMV)
beamformer (BF), and ii) a subsequent multi-speaker Wiener postfilter. The first stage separates and enhances the signals of
the individual speakers by utilizing the spatial characteristics of the speakers (as manifested by the respective acoustic transfer
functions (ATFs)) and the noise spatial correlation matrix, while the second stage exploits the speakers’ power spectral density
matrix to reduce the residual noise at the output of the first stage. The output vector of the multi-speaker LCMV BF is proven to be
the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral
amplitude estimator for the multi-speaker case is also derived given the multi-speaker LCMV BF outputs. The performance
evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically
verified that the multi-speaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when
compared with the single-speaker postfilter.
Speech enhancement and separation are core problems in audio signal processing, with commercial applications
in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are cru-
cial pre-processing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight mi-
crophones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those
of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting
with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated
and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between
these approaches is lacking at present. In this article, we propose to fill this gap by analyzing a large number of established
and recent techniques according to four transverse axes:
a) the acoustic impulse response model
b) the spatial filter design criterion
c) the parameter estimation algorithm
d) optional postfiltering.
We conclude this overview paper by providing a list
of software and data resources and by discussing perspectives and
future trends in the field.
The problem of source separation using an array of microphones in reverberant and noisy conditions is addressed. We consider applying the well-known linearly constrained minimum variance (LCMV) beamformer (BF) for extracting individual speakers. Constraints are defined using relative transfer functions (RTFs) for the sources, which are acoustic transfer functions (ATFs) ratios between any microphone and a reference microphone. The latter are usually estimated by methods which rely on single-talk time segments where only a single source is active and on reliable knowledge of the source activity. Two novel algorithms for estimation of RTFs using the TRINICON (Triple-N ICA for convolutive mixtures) framework are proposed, not resorting to the usually unavailable source activity pattern. The first algorithm estimates the RTFs of the sources by applying multiple two-channel geometrically constrained (GC) TRINICON units, where approximate direction of arrival (DOA) information for the sources is utilized for ensuring convergence to the desired solution. The GC-TRINICON is applied to all microphone pairs using a common reference microphone. In the second algorithm, we propose to estimate RTFs iteratively using GC-TRINICON, where instead of using a fixed reference microphone as before, we suggest to use the output signals of LCMV-BFs from the previous iteration as spatially processed references (SPRs) with improved signal-to-interference-and-noise ratio (SINR). For both algorithms, a simple detection of noise-only time segments is required for estimating the covariance matrix of noise and interference. We conduct an experimental study in which the performance of the proposed methods is confirmed and compared to corresponding supervised methods.
We present a novel non-iterative and rigorously motivated approach for estimating hidden Markov models (HMMs) and factorial hidden Markov models (FHMMs) of high-dimensional signals. Our approach utilizes the asymptotic properties of a spectral, graph-based approach for dimensionality reduction and manifold learning, namely the diffusion framework. We exemplify our approach by applying it to the problem of single microphone speech separation, where the log-spectra of two unmixed speakers are modeled as HMMs, while their mixture is modeled as an FHMM. We derive two diffusion-based FHMM estimation schemes. One of which is experimentally shown to provide separation results that compare with contemporary speech separation approaches based on HMM. The second scheme allows a reduced computational burden.
Smartglasses, in addition to their visual-output capabilities, often contain acoustic sensors for receiving the user’s voice. However, operation in noisy environments may lead to significant degradation of the received signal. To address this issue, we propose employing an acoustic sensor array which is mounted on the eyeglasses frames. The signals from the array are processed by an algorithm with the purpose of acquiring the desired near-field speech signal produced by the wearer while suppressing noise signals originating from the environment. The array is comprised of two acoustic vector-sensors (AVSs) which are located at the fore of the glasses’ temples. Each AVS consists of four collocated subsensors: one pressure sensor (with an omnidirectional response) and three particle-velocity sensors (with dipole responses) oriented in mutually orthogonal directions. The array configuration is designed to boost the input power of the desired signal, and to ensure that the characteristics of the noise at the different channels are sufficiently diverse (lending towards more effective noise suppression). Since changes in the array’s position correspond to the desired speaker’s movement, the relative source-receiver position remains unchanged; hence, the need to track fluctuations of the steering vector is avoided. Conversely, the spatial statistics of the noise are subject to rapid and abrupt changes due to sudden movement and rotation of the user’s head. Consequently, the algorithm must be capable of rapid adaptation toward such changes. We propose an algorithm which incorporates detection of the desired speech in the time-frequency domain, and employs this information to adaptively update estimates of the noise statistics. The speech detection plays a key role in ensuring the quality of the output signal. We conduct controlled measurements of the array in noisy scenarios. The proposed algorithm preforms favorably with respect to conventional algorithms.
This paper addresses the problem of separating audio sources from time-varying convolutive mixtures. We propose a probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization. The time-varying mixing filters are modeled by a continuous temporal stochastic process. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the time-varying mixing matrix, and that jointly estimate the source parameters. The sound sources are then separated by Wiener filters constructed with the estimators provided by the VEM algorithm. Extensive experiments on simulated data show that the proposed method outperforms a blockwise version of a state-of-the-art baseline method.
The recently proposed binaural linearly constrained minimum variance (BLCMV) beamformer is an extension of the well-known binaural minimum variance distortionless response (MVDR) beamformer, imposing constraints for both the desired and the interfering sources. Besides its capabilities to reduce interference and noise, it also enables to preserve the binaural cues of both the desired and interfering sources, hence making it particularly suitable for binaural hearing aid applications. In this paper, a theoretical analysis of the BLCMV beamformer is presented. In order to gain insights into the performance of the BLCMV beamformer, several decompositions are introduced that reveal its capabilities in terms of interference and noise reduction, while controlling the binaural cues of the desired and the interfering sources. When setting the parameters of the BLCMV beamformer, various considerations need to be taken into account, e.g. based on the amount of interference and noise reduction and the presence of estimation errors of the required relative transfer functions (RTFs). Analytical expressions for the performance of the BLCMV beamformer in terms of noise reduction, interference reduction, and cue preservation are derived. Comprehensive simulation experiments, using measured acoustic transfer functions as well as real recordings on binaural hearing aids, demonstrate the capabilities of the BLCMV beamformer in various noise environments.
The directivity factor (DF) of a beamformer describes its spatial selectivity and ability to suppress diffuse noise which arrives from all directions. For a given array configuration, it is possible to design beamforming weights which maximize the DF for a particular look-direction, while enforcing nulls for a set of undesired directions. In general, the resulting DF is dependent upon the specific look- and null directions. Using the same array, one may apply a different set of weights designed for any other feasible set of look- and null directions. In this contribution we show that when the optimal DF is averaged over all look directions the result equals the number of sensors minus the number of null constraints. This result holds, regardless of the positions and spatial responses of the individual sensors, and of the null directions. The result generalizes to more complex wave-propagation domains (e.g., reverberation).
The problem of source separation and noise reduction using multiple microphones is addressed. The minimum mean square error (MMSE) estimator for the multispeaker case is derived and a novel decomposition of this estimator is presented. The MMSE estimator is decomposed into two stages: first, a multispeaker linearly constrained minimum variance (LCMV) beamformer (BF); and second, a subsequent multispeaker Wiener postfilter. The first stage separates and enhances the signals of the individual speakers by utilizing the spatial characteristics of the speakers [as manifested by the respective acoustic transfer functions (ATFs)] and the noise power spectral density (PSD) matrix, while the second stage exploits the speakers’ PSD matrix to reduce the residual noise at the output of the first stage. The output vector of the multispeaker LCMV BF is proven to be the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral amplitude estimator for the multispeaker case is also derived given the multispeaker LCMV BF outputs. The performance evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically verified that the multispeaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when compared with the single-speaker postfilter.
Speech communication systems are prone to performance degradation in reverberant and noisy acoustic environments. Dereverberation and noise reduction algorithms typically require several model parameters, e.g. the speech, reverberation and noise power spectral densities (PSDs). A commonly used assumption is that the noise PSD matrix is known. However, in practical acoustic scenarios, the noise PSD matrix is unknown
and should be estimated along with the speech and reverberation PSDs. In this paper, we consider the case of rank-deficient noise PSD matrix, which arises when the noise signal consists of multiple directional noise sources, whose number is less than the number of microphones. We derive two closed-form maximum likelihood estimators (MLEs). The first is a non-blocking-based estimator which jointly estimates the speech, reverberation and noise PSDs, and the second is a blocking-based estimator, which first blocks the speech signal and then jointly estimates the
reverberation and noise PSDs. Both estimators are analytically compared and analyzed, and mean square errors (MSEs) expressions are derived. Furthermore, Cramér-Rao Bounds (CRBs) on the estimated PSDs are derived. The proposed estimators are examined using both simulation and real reverberant and noisy signals, demonstrating the advantage of the proposed method compared to competing estimators.
Hands-free speech systems are subject to performance degradation due to reverberation and noise. Common methods for enhancing reverberant and noisy speech require the knowledge of the speech, reverberation and noise power spectral densities (PSDs). Most literature on this topic assumes that the noise PSD matrix is known. However, in many practical acoustic scenarios, the noise PSD is unknown and should be estimated along with the speech and the reverberation PSDs. In this paper, the noise is modelled as a spatially homogeneous sound field, with an unknown time-varying PSD multiplied by a known time-invariant spatial coherence matrix. We derive two maximum likelihood estimators (MLEs) for the various PSDs, including the noise: The first is a non-blocking-based estimator, that jointly estimates the PSDs of the speech, reverberation and noise components. The second MLE is a blocking-based estimator, that blocks the speech signal and estimates the reverberation and noise PSDs. Since a closed-form solution does not exist, both estimators iteratively maximize the likelihood using the Fisher scoring method. In order to compare both methods, the corresponding Cramér-Rao Bounds (CRBs) are derived. For both the reverberation and the noise PSDs, it is shown that the non-blocking-based CRB is lower than the blocking-based CRB. Performance evaluation using both simulated and real reverberant and noisy signals, shows that the proposed estimators outperform competing estimators, and greatly reduce the effect of reverberation and noise.
This paper addresses the problems of blind multichannel identification and equalization for joint speech dere-verberation and noise reduction. The time-domain cross-relation method is hardly applicable for blind room impulse response identification due to the near-common zeros of the long impulse responses. We extend the cross relation method to the short-time Fourier transform (STFT) domain, in which the time domain impulse response is approximately represented by the convolutive transfer function (CTF) with much less coefficients. For the oversampled STFT, CTFs suffer from the common zeros caused by the non-flat frequency response of the STFT window. To overcome this, we propose to identify CTFs using the STFT framework with oversampled signals and critically sampled CTFs, which is a good trade-off between the frequency
aliasing of the signals and the common zeros problem of CTFs. The identified complex-valued CTFs are not accurate enough for multichannel equalization due to the frequency aliasing of the CTFs. Thence, we only use the CTF magnitudes, which leads to a nonnegative multichannel equalization method based on a nonnegative convolution model between the STFT magnitude of the source signal and the CTF magnitude. Compared with the complex-valued convolution model, this nonnegative convolution model is shown to be more robust against the CTF perturbations. To recover the STFT magnitude of the source signal and to reduce the additive noise, the ` 2 -norm fitting error between the STFT magnitude of the microphone signals and the nonnegative convolution is constrained to be less than a noise power related tolerance. Meanwhile, the ` 1 -norm of the STFT magnitude of the source signal is minimized to impose the sparsity.
Reduction of late reverberation can be achieved using spatio-spectral filters such as the multichannel Wiener filter (MWF). To compute this filter, an estimate of the late reverberation power spectral density (PSD) is required. In recent years, a multitude of late reverberation PSD estimators have been proposed. In this contribution, these estimators are categorized into several classes, their relations and differences are discussed, and a comprehensive experimental comparison is provided. To compare their performance, simulations in controlled as well as practical scenarios are conducted. It is shown that a common weakness of spatial coherence-based estimators is their performance in high direct-to-diffuse ratio (DDR) conditions. To mitigate this problem, a correction method is proposed and evaluated. It is shown that the proposed correction method can decrease the speech distortion without significantly affecting the reverberation reduction.
The reverberation power spectral density (PSD) is often required for dereverberation and noise reduction algorithms. In this work, we compare two maximum likelihood (ML) estimators of the reverberation PSD in a noisy environment. In the first estimator, the direct path is first blocked. Then, the ML criterion for estimating the reverberation PSD is stated according to the probability density function (p.d.f.) of the blocking matrix (BM) outputs. In the second estimator, the speech component is not blocked. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. To compare the expected mean square error (MSE) between the two ML estimators of the reverberation PSD, the Cramér-Rao Bounds (CRBs) for the two ML estimators are derived. We show that the CRB for the joint reverberation and speech PSD estimator is lower than the CRB for estimating the reverberation PSD from the BM outputs. Experimental results show that the MSE of the two estimators indeed obeys the CRB curves. Experimental results of multi-microphone dereverberation and noise reduction algorithm show the benefits of using the ML estimators in comparison with another baseline estimators.
In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that consists of the direct path and some early reflections and a late reverberant component that consists of all late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation, and ambient noise is presented. The expectation-maximization (EM) algorithm is used to estimate the signals and spatial parameters of the early speech component and the late reverberation components. As a result, a spatially filtered version of the early speech component is estimated in the E-step. The power spectral density (PSD) of the anechoic speech, the relative early transfer functions, and the PSD matrix of the late reverberation are estimated in the M-step of the EM algorithm. The algorithm is evaluated using real room impulse response recorded in our acoustic lab with a reverberation time set to 0.36 s and 0.61 s and several signal-to-noise ratio levels. It is shown that significant improvement is obtained and that the proposed algorithm outperforms baseline single-channel and multichannel dereverberation algorithms, as well as a state-of-the-art multichannel dereverberation algorithm.
In recent years, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques and automatic speech recognition (ASR) techniques that are robust to reverberation. In this paper, we describe the REVERB challenge, which is an evaluation campaign that was designed to evaluate such speech enhancement (SE) and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions. Even though most existing benchmark tasks and challenges for distant speech processing focus on the noise robustness issue and sometimes only on a single-channel scenario, a particular novelty of the REVERB challenge is that it is carefully designed to test robustness against reverberation, based on both real, single-channel, and multichannel recordings. This challenge attracted 27 papers, which represent 25 systems specifically designed for SE purposes and 49 systems specifically designed for ASR purposes. This paper describes the problems dealt within the challenge, provides an overview of the submitted systems, and scrutinizes them to clarify what current processing strategies appear effective in reverberant speech processing.
In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that includes the direct path and some early reflections, and a late reverberant component that includes all the late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation and ambient noise is presented. A multi-microphone minimum mean square error estimator is used to obtain a spatially filtered version of the early speech component. The estimator constructed as a minimum variance distortionless response (MVDR) beamformer (BF) followed by a postfilter (PF). Three unique design features characterize the proposed method. First, the MVDR BF is implemented in a special structure, named the nonorthogonal generalized sidelobe canceller (NO-GSC). Compared with the more conventional orthogonal GSC structure, the new structure allows for a simpler implementation of the GSC blocks for various MVDR constraints. Second, In contrast to earlier works, RETFs are used in the MVDR criterion rather than either the entire RTFs or only the direct-path of the desired speech signal. An estimator of the RETFs is proposed as well. Third, the late reverberation and noise are processed by both the beamforming stage and the PF stage. Since the relative power of the noise and the late reverberation varies with the frame index, a computationally efficient method for the required matrix inversion is proposed to circumvent the cumbersome mathematical operation. The algorithm was evaluated and compared with two alternative multichannel algorithms and one single-channel algorithm using simulated data and data recorded in a room with a reverberation time of 0.5 s for various source-microphone array distances (1-4 m) and several signal-to-noise levels. The processed signals were tested using two commonly used objective measures, namely perceptual …
Speech signals recorded in a room are commonly degraded by reverberation. In most cases, both the speech signal and the acoustic system of the room are unknown and time-varying. In this paper, a scenario with a single desired sound source and slowly time-varying and spatially-white noise is considered, and a multi-microphone algorithm that simultaneously estimates the clean speech signal and the time-varying acoustic system is proposed. The recursive expectation-maximization scheme is employed to obtain both the clean speech signal and the acoustic system in an online manner. In the expectation step, the Kalman filter is applied to extract a new sample of the clean signal, and in the maximization step, the system estimate is updated according to the output of the Kalman filter. Experimental results show that the proposed method is able to significantly reduce reverberation and increase the speech quality. Moreover, the tracking ability of the algorithm was validated in practical scenarios using human speakers moving in a natural manner.
For IEEE papers:
© 19xx, 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.