Abstract—In this paper we propose a data-driven approach for multiple speaker tracking in reverberant enclosures. The method comprises two stages. The first stage executes a single source localization using semi-supervised learning on multiple manifold. The second stage uses unsupervised time-varying maximum likelihood estimation for tracking. The feature vectors, used by both stages, are the relative transfer functions (RTFs), which are known to be related to source positions. The number of sources is assumed to be known while the microphone positions are unknown. In the training stage, a large database of RTFs is given. A small percentage of the data is attributed with exact positions
and the rest is assumed to be unlabelled, i.e. the respective position is unknown. Then, a nonlinear, manifold-based, mapping function between the RTFs and the source positions is inferred. Applying this mapping function to all unlabelled RTFs constructs a dense grid of localized sources. In the test phase, this RTF grid serves as the centroids for a Mixture of Gaussians (MoG) model. The MoG parameters are estimated by applying recursive expectation-maximization (EM). The EM procedure relies on the sparsity and intermittency of the speech signals. A comprehensive
simulation study, with two overlapping speakers in various reverberation levels, demonstrates the usability of the proposed scheme, achieving high level of accuracy compared to a baseline method using a simpler propagation model.
The problem of blind and online speaker localization and separation using multiple microphones is addressed
based on the recursive expectation-maximization (REM) procedure. A two-stage REM-based algorithm is
proposed: 1) multi-speaker direction of arrival (DOA) estimation and 2) multi-speaker relative transfer
function (RTF) estimation. The DOA estimation task uses only the time frequency (TF) bins dominated by a
single speaker while the entire frequency range is not required to accomplish this task. In contrast, the RTF
estimation task requires the entire frequency range in order to estimate the RTF for each frequency bin.
Accordingly, a different statistical model is used for the two tasks. The first REM model is applied under the
assumption that the speech signal is sparse in the TF domain, and utilizes a mixture of Gaussians (MoG)
model to identify the TF bins associated with a single dominant speaker. The corresponding DOAs are
estimated using these bins. The second REM model is applied under the assumption that the speakers are
concurrently active in all TF bins and consequently applies a multichannel Wiener filter (MCWF) to separate
the speakers. As a result of the assumption of the concurrent speakers, a more precise TF map of the speakers’
activity is obtained. The RTFs are estimated using the outputs of the MCWF-beamformer (BF), which are
constructed using the DOAs obtained in the previous stage. Next, using the linearly constrained minimum
variance (LCMV)-BF that utilizes the estimated RTFs, the speech signals are separated. The algorithm is
evaluated using real-life scenarios of two speakers. Evaluation of the mean absolute error (MAE) of the
estimated DOAs and the separation capabilities, demonstrates significant improvement w.r.t. a baseline DOA
estimation and speaker separation algorithm.
This paper presents a new dataset of measured multichannel Room Impulse Responses (RIRs) named dEchorate. It includes annotations of early echo timings and 3D positions of microphones, real sources and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, acoustic echo retrieval, microphone calibration, echo labeling and reflectors position estimation. The dataset is provided with software utilities to easily access, manipulate and visualize the data as well as baseline methods for echo-related tasks.
In this study, we present a deep neural network-based online multi-speaker localization algorithm based on a multi-microphone array. Following the W-disjoint orthogonality principle in the spectral domain, time-frequency (TF) bin is dominated by a single speaker and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high-resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using simulated and real-life recordings in static and dynamic scenarios demonstrates that the proposed algorithm significantly outperforms both classic and recent deep-learning-based algorithms. Finally, as a byproduct, we further show that the proposed method is also capable of separating moving speakers by the application of the obtained TF masks.
The gain achieved by a superdirective beamformer operating in a diffuse noise-field is significantly higher than the gain attainable with conventional delay-and-sum weights. A classical result states that for a compact linear array consisting of N sensors which receives a plane-wave signal from the endfire direction, the optimal superdirective gain approaches N2. It has been noted that in the near-field regime higher gains can be attained. The gain can increase, in theory, without bound for increasing wavelength or decreasing source receiver distance. We aim to address the phenomenon of near-field superdirectivity in a comprehensive manner. We derive the optimal performance for the limiting case of an infinitesimal-aperture array receiving a spherical-wave signal. This is done with the aid of a sequence of linear transformations. The resulting gain expression is a polynomial, which depends on the number of sensors employed, the wavelength, and the source-receiver distance. The resulting gain curves are optimal and outperform weights corresponding to other superdirectivity methods. The practical case of a finite aperture array is discussed. We present conditions for which the gain of such an array would approach that predicted by the theory of the infinitesimal case. The white noise gain (WNG) metric of robustness is shown to increase in the near-field regime.
This paper addresses the problem of tracking a moving source, e.g., a robot, equipped with both receivers and a source, that is tracking its own location and simultaneously estimating the locations of multiple plane reflectors. We assume a noisy knowledge of the robot’s movement. We formulate this problem, which is also known as simultaneous localization and mapping (SLAM), as a hybrid estimation problem. We derive the extended Kalman filter (EKF) for both tracking the robot’s own location and estimating the room geometry. Since the EKF employs linearization at every step, we incorporate a regulated kinematic model, which facilitates a successful tracking. In addition, we consider the echo-labeling problem as solved and beyond the scope of this paper. We then develop the hybrid Cramér-Rao lower bound on the estimation accuracy of both the localization and mapping parameters. The algorithm is evaluated with respect to the bound via simulations, which shows that the EKF approaches the hybrid Cramér-Rao bound (CRB) (HCRB) as the number of observation increases. This result implies that for the examples tested in simulation, the HCRB is an asymptotically tight bound and that the EKF is an optimal estimator. Whether this property is true in general remains an open question.
Localization in reverberant environments remains an open challenge. Recently, supervised learning approaches have demonstrated very promising results in addressing reverberation. However, even with large data volumes, the number of labels available for supervised learning in such environments is usually small. We propose to address this issue with a semi-supervised learning (SSL) approach, based on deep generative modeling. Our chosen deep generative model, the variational autoencoder (VAE), is trained to generate the phase of relative transfer functions (RTFs) between microphones. In parallel, a direction of arrival (DOA) classifier network based on RTF-phase is also trained. The joint generative and discriminative model, deemed VAE-SSL, is trained using labeled and unlabeled RTF-phase sequences. In learning to generate and classify the sequences, the VAE-SSL extracts the physical causes of the RTF-phase (i.e., source location) from distracting signal characteristics such as noise and speech activity. This facilitates effective <italic>end-to-end</italic> operation of the VAE-SSL, which requires minimal preprocessing of RTF-phase. VAE-SSL is compared with two signal processing-based approaches, steered response power with phase transform (SRP-PHAT) and MUltiple SIgnal Classification (MUSIC), as well as fully supervised CNNs. The approaches are compared using data from two real acoustic environments – one of which was recently obtained at Technical University of Denmark specifically for our study. We find that VAE-SSL can outperform the conventional approaches and the CNN in label-limited scenarios. Further, the trained VAE-SSL system can generate new RTF-phase samples which capture the physics of the acoustic environment. Thus, the generative modeling in VAE-SSL provides a means of interpreting the learned representations. To the best of our knowledge, this paper presents the first approach to modeling the physics of acoustic propagation using deep generative modeling.
Two novel methods for speaker separation of multi-microphone recordings that can also detect speakers with infrequent activity are presented. The proposed methods are based on a statistical model of the probability of activity of the speakers across time. Each method takes a different approach for estimating the activity probabilities. The first method is derived using a linear programming (LP) problem for maximizing the correlation function between different time frames. It is shown that the obtained maxima correspond to frames which contain a single active speaker. Accordingly, we propose an algorithm for successive identification of frames dominated by each speaker. The second method aggregates the correlation values associated with each frame in a correlation vector. We show that these correlation vectors lie in a simplex with vertices that correspond to frames dominated by one of the speakers. In this method, we utilize convex geometry tools to sequentially detect the simplex vertices. The correlation functions associated with single-speaker frames, which are detected by either of the two proposed methods, are used for recovering the activity probabilities. A spatial mask is estimated based on the recovered probabilities and is utilized for separation and enhancement by means of both spatial and spectral processing. Experimental results demonstrate the performance of the proposed methods in various conditions on real-life recordings with different reverberation and noise levels, outperforming a state-of-the-art separation method.
In this paper, a study addressing the task of tracking multiple concurrent speakers in reverberant conditions is presented. Since both past and future observations can contribute to the current location estimate, we propose a forward-backward approach, which improves tracking accuracy by introducing near-future data to the estimator, in the cost of an additional short latency. Unlike classical target tracking, we apply a non-Bayesian approach, which does not make assumptions with respect to the target trajectories, except for assuming a realistic change in the parameters due to natural behaviour. The proposed method is based on the recursive expectation-maximization (REM) approach. The new method is dubbed forward-backward recursive expectation-maximization (FB-REM). The performance is demonstrated using an experimental study, where the tested scenarios involve both simulated and recorded signals, with typical reverberation levels and multiple moving sources. It is shown that the proposed algorithm outperforms the regular common causal (REM).
Objective: The present study implements an automatic method of assessing arousal in vocal data as well as dynamic system models to explore intrapersonal and interpersonal affect dynamics within psychotherapy and to determine whether these dynamics are associated with treatment outcomes. Method: The data of 21,133 mean vocal arousal observations were extracted from 279 therapy sessions in a sample of 30 clients treated by 24 therapists. Before and after each session, clients self-reported their well-being level, using the Outcome Rating Scale. Results: Both clients’ and therapists’ vocal arousal showed intrapersonal dampening. Specifically, although both therapists and clients departed from their baseline, their vocal arousal levels were “pulled” back to these baselines. In addition, both clients and therapists exhibited interpersonal dampening. Specifically, both the clients’ and the therapists’ levels of arousal were “pulled” toward the other party’s arousal level, and clients were “pulled” by their therapists’ vocal arousal toward their own baseline. These dynamics exhibited a linear change over the course of treatment: whereas interpersonal dampening decreased over time, there was an increase in intrapersonal dampening over time. In addition, higher levels of interpersonal dampening were associated with better session outcomes. Conclusions: These findings demonstrate the advantages of using automatic vocal measures to capture nuanced intrapersonal and interpersonal affect dynamics in psychotherapy and demonstrate how these dynamics are associated with treatment gains. (PsycInfo Database Record (c) 2021 APA, all rights reserved)
This paper develops a semi-supervised algorithm to address the challenging multi-source localization problem in a noisy and reverberant environment, using a spherical harmonics domain source feature of the relative harmonic coefficients. We present a comprehensive research of this source feature, including (i) an illustration confirming its sole dependence on the source position, (ii) a feature estimator in the presence of noise, (iii) a feature selector exploiting its inherent directivity over space.Source features at varied spherical harmonic modes, representing unique characterization of the soundfield, are fused by the Multi-Mode Gaussian Process modeling. Based on the unifying model, we then formulate the mapping function revealing the underlying relationship between the source feature(s) and position(s) using a Bayesian inference approach. Another issue of the overlapped components is addressed by a pre-processing technique performing overlapped frame detection, which in turn reduces this challenging problem to a single source localization. It is highlighted that this data-driven method has a strong potential to be implemented in practice because only a limited number of labeled measurements is required. We evaluate this proposed algorithm using simulated recordings between multiple speakers in diverse environments, and extensive results confirm improved performance in comparison with the state-of-art methods. Additional assessments using real-life recordings further prove the effectiveness of the method, even at unfavorable circumstances with severe source overlapping.
In this paper, we present an algorithm for direction of arrival (DOA) tracking and separation of multiple speakers with a microphone array using the factor graph statistical model. In our model, the speakers can be located in one of a predefined set of candidate DOAs, and each time-frequency (TF) bin can be associated with a single speaker. Accordingly, by attributing a statistical model to both the DOAs and the associations, as well as to the microphone array observations given these variables, we show that the conditional probability of these variables given the microphone array observations can be modeled as a factor graph. Using the loopy belief propagation (LBP) algorithm, we derive a novel inference scheme which simultaneously estimates both the DOAs and the associations. These estimates are used in turn for separating the sources, by directing a beamformer towards the estimated DOAs, and then applying a TF masking according to the estimated associations. A comprehensive experimental study demonstrates the benefits of the proposed algorithm in both simulated data and real-life measurements recorded in our
laboratory.
Besides reducing undesired sources, i.e., interfering sources and background noise, another important objective of a binaural beamforming algorithm is to preserve the spatial impression of the acoustic scene, which can be achieved by preserving the binaural cues of all sound sources. While the binaural minimum variance distortionless response (BMVDR) beamformer provides a good noise reduction performance and preserves the binaural cues of the desired source, it does not allow to control the reduction of the interfering sources and distorts the binaural cues of the interfering sources and the background noise. Hence, several extensions have been proposed. First, the binaural linearly constrained minimum variance (BLCMV) beamformer uses additional constraints, enabling to control the reduction of the interfering sources while preserving their binaural cues. Second, the BMVDR with partial noise estimation (BMVDR-N) mixes the output signals of the BMVDR with the noisy reference microphone signals, enabling to control the binaural cues of the background noise. Aiming at merging the advantages of both extensions, in this paper we propose the BLCMV with partial noise estimation (BLCMV-N). We show that the output signals of the BLCMV-N can be interpreted as a mixture between the noisy reference microphone signals and the output signals of a BLCMV using an adjusted interference scaling parameter. We provide a theoretical comparison between the BMVDR, the BLCMV, the BMVDR-N and the proposed BLCMV-N in terms of noise and interference reduction performance and binaural cue preservation. Experimental results using recorded signals as well as the results of a perceptual listening test show that the BLCMV-N is able to preserve the binaural cues of an interfering source (like the BLCMV), while enabling to trade off between noise reduction performance and binaural cue preservation of the background noise (like the BMVDR-N).
Estimation problems like room geometry estimation and localization of acoustic reflectors are of great interest and importance in robot and drone audition. Several methods for tackling these problems exist, but most of them rely on information about times-of-arrival (TOAs) of the acoustic echoes. These need to be estimated in practice, which is a difficult problem in itself, especially in robot applications which are characterized by high ego-noise. Moreover, even if TOAs are successfully extracted, the difficult problem of echolabeling needs to be solved. In this paper, we propose multiple expectation-maximization (EM) methods, for jointly estimating the TOAs and directions-of-arrival (DOA) of the echoes, with a uniform circular array (UCA) and a loudspeaker in its center for probing the environment. The different methods are derived to be optimal under different noise conditions. The experimental results show that the proposed methods outperform existing methods in terms of estimation accuracy in noisy conditions. For example, it can provide accurate estimates at SNR of 10 dB lower compared to TOA extraction from room impulse responses, which is often used. Furthermore, the results confirm that the proposed methods can account for scenarios with colored noise or faulty microphones. Finally, we show the applicability of the proposed methods in mapping of an indoor environment.
Ad hoc acoustic networks comprising multiple nodes, each of which consists of several microphones, are addressed. From the ad hoc nature of the node constellation, microphone positions are unknown. Hence, typical tasks, such as localization, tracking, and beamforming, cannot be directly applied. To tackle this challenging joint multiple speaker localization and array calibration task, we propose a novel variant of the expectation-maximization (EM) algorithm. The coordinates of multiple arrays relative to an anchor array are blindly estimated using naturally uttered speech signals of multiple concurrent speakers. The speakers’ locations, relative to the anchor array, are also estimated. The inter-distances of the microphones in each array, as well their orientations, are assumed known, which is a reasonable assumption for many modern mobile devices (in outdoor and in a several indoor scenarios). The well-known initialization problem of the batch EM algorithm is circumvented by an incremental procedure, also derived here. The proposed algorithm is tested by an extensive simulation study.
The problem of blind audio source separation (BASS) in noisy and reverberant conditions is addressed by a novel approach, termed Global and LOcal Simplex Separation (GLOSS), which integrates full- and narrow-band simplex representations. We show that the eigenvectors of the correlation matrix between time frames in a certain frequency band form a simplex that organizes the frames according to the speaker activities in the corresponding band. We propose to build two simplex representations: one global based on a broad frequency band and one local based on a narrow band. In turn, the two representations are combined to determine the dominant speaker in each time-frequency (TF). Using the identified dominating speakers, a spectral mask is computed and is utilized for extracting each of the speakers using spatial beamforming followed by spectral postfiltering. The performance of the proposed algorithm is demonstrated using real-life recordings in various noisy and reverberant conditions.
Speech communication systems are prone to performance degradation in reverberant and noisy acoustic environments. Dereverberation and noise reduction algorithms typically require several model parameters, e.g. the speech, reverberation and noise power spectral densities (PSDs). A commonly used assumption is that the noise PSD matrix is known. However, in practical acoustic scenarios, the noise PSD matrix is unknown
and should be estimated along with the speech and reverberation PSDs. In this paper, we consider the case of rank-deficient noise PSD matrix, which arises when the noise signal consists of multiple directional noise sources, whose number is less than the number of microphones. We derive two closed-form maximum likelihood estimators (MLEs). The first is a non-blocking-based estimator which jointly estimates the speech, reverberation and noise PSDs, and the second is a blocking-based estimator, which first blocks the speech signal and then jointly estimates the
reverberation and noise PSDs. Both estimators are analytically compared and analyzed, and mean square errors (MSEs) expressions are derived. Furthermore, Cramér-Rao Bounds (CRBs) on the estimated PSDs are derived. The proposed estimators are examined using both simulation and real reverberant and noisy signals, demonstrating the advantage of the proposed method compared to competing estimators.
Hands-free speech systems are subject to performance degradation due to reverberation and noise. Common methods for enhancing reverberant and noisy speech require the knowledge of the speech, reverberation and noise power spectral densities (PSDs). Most literature on this topic assumes that the noise PSD matrix is known. However, in many practical acoustic scenarios, the noise PSD is unknown and should be estimated along with the speech and the reverberation PSDs. In this paper, the noise is modelled as a spatially homogeneous sound field, with an unknown time-varying PSD multiplied by a known time-invariant spatial coherence matrix. We derive two maximum likelihood estimators (MLEs) for the various PSDs, including the noise: The first is a non-blocking-based estimator, that jointly estimates the PSDs of the speech, reverberation and noise components. The second MLE is a blocking-based estimator, that blocks the speech signal and estimates the reverberation and noise PSDs. Since a closed-form solution does not exist, both estimators iteratively maximize the likelihood using the Fisher scoring method. In order to compare both methods, the corresponding Cramér-Rao Bounds (CRBs) are derived. For both the reverberation and the noise PSDs, it is shown that the non-blocking-based CRB is lower than the blocking-based CRB. Performance evaluation using both simulated and real reverberant and noisy signals, shows that the proposed estimators outperform competing estimators, and greatly reduce the effect of reverberation and noise.
Distortionless speech extraction in a reverberant environment can be achieved by applying a beamforming algorithm, provided that the relative transfer functions (RTFs) of the sources and the covariance matrix of the noise are known. In this paper, the challenge of RTF identification in a multi-speaker scenario is addressed. We propose a successive RTF identification (SRI) technique, based on the sole assumption that sources do not become simultaneously active. That is, we address the challenge of estimating the RTF of a specific speech source while assuming that the RTFs of all other active sources in the environment were previously estimated in an earlier stage. The RTF of interest is identified by applying the blind oblique projection (BOP)-SRI technique. When a new speech source is identified, the BOP algorithm is applied. BOP results in a null steering toward the RTF of interest, by means of applying an oblique projection to the microphone measurements. We prove that by artificially increasing the rank of the range of the projection matrix, the RTF of interest can be identified. An experimental study is carried out to evaluate the performance of the BOP-SRI algorithm in various signal to noise ratio (SNR) and signal to interference ratio (SIR) conditions and to demonstrate its effectiveness in speech extraction tasks.
Acoustic data provide scientific and engineering insights in fields ranging from biology and communications to ocean and Earth science. We survey the recent advances and transformative potential of machine learning (ML), including deep learning, in the field of acoustics. ML is a broad family of techniques, which are often based in statistics, for automatically detecting and utilizing patterns in data. Relative to conventional acoustics and signal processing, ML is data-driven. Given sufficient training data, ML can discover complex relationships between features and desired labels or actions, or between features themselves. With large volumes of training data, ML can discover models describing complex acoustic phenomena such as human speech and reverberation. ML in acoustics is rapidly developing with compelling results and significant future promise. We first introduce ML, then highlight ML developments in four acoustics research areas: source localization in speech processing, source localization in ocean acoustics, bioacoustics, and environmental sounds in everyday scenes.
This paper addresses the problem of multichannel online dereverberation. The proposed method is carried out in the short-time Fourier transform (STFT) domain, and for each frequency band independently. In the STFT domain, the time-domain room impulse response is approximately represented by the convolutive transfer function (CTF). The multichannel CTFs are adaptively identified based on the cross-relation method, and
using the recursive least square criterion. Instead of the complex- valued CTF convolution model, we use a nonnegative convolution model etween the STFT magnitude of the source signal and the CTF magnitude, which is just a coarse approximation of the former model, but is shown to be more robust against the CTF perturbations. Based on this nonnegative model, we propose an online STFT magnitude inverse filtering method. The inverse filters of the CTF magnitude are formulated based on the multiple-input/output inverse theorem (MINT), and adaptively stimated based on the gradient descent criterion. Finally, the inverse filtering is applied to the STFT magnitude of the microphone signals, obtaining an estimate of the STFT magnitude of the source signal. Experiments regarding both speech enhancement and automatic speech recognition are conducted, which demonstrate that the proposed method can effectively suppress reverberation, even for the difficult case of a moving speaker.
This paper addresses the problem of speech separation and enhancement from multichannel convolutive and noisy mixtures, assuming known mixing filters. We propose to perform speech separation and enhancement in the short-time Fourier transform domain using the convolutive transfer function (CTF) approximation. Compared to time-domain filters, CTF has much less taps. Consequently it requires less computational cost and sometimes is more robust against the filter perturbations. We propose three methods: i) For the multisource case, the multi-channel inverse filtering method, i.e. the multiple input/output inverse theorem (MINT), is exploited in the CTF domain; ii) A beamforming-like multichannel inverse filtering method applying single-source MINT and using power minimization, which is suitable whenever the source CTFs are not all known; and iii) A basis pursuit method, where the sources are recovered by minimizing their ` 1 -norm to impose spectral sparsity, while, the ` 2 -norm fitting cost between microphone signals and mixing model is constrained to be lower than a tolerance. The noise can be reduced by setting this tolerance at the noise power level. Experiments under various acoustic conditions are carried out to evaluate and compare the three proposed methods. Comparison with four baseline methods –beamforming-based, two time-domain inverse filters and time-domain Lasso– show the applicability of the proposed methods.
A recursive maximum-likelihood algorithm (RML) is proposed that can be used when both the observations and the hidden data have continuous values and are statistically dependent between different time samples. The algorithm recursively approximates the probability density functions of the observed and hidden data by analytically computing the integrals with respect to the state variables, where the parameters are updated using gradient steps. A full convergence proof is given, based on the ordinary differential equation approach, which shows that the algorithm converges to a local minimum of the Kullback-Leibler divergence between the true and the estimated parametric probability density functions; a result which is useful even for a miss-specified parametric model. Compared to other RML algorithms proposed in the literature, this contribution extends the state-space model and provides a theoretical analysis in a non-trivial statistical model that was not analyzed so far. We further extend the RML analysis to constrained parameter estimation problems. Two examples, including nonlinear state-space models, are given to highlight this contribution.
The IEEE Audio and Acoustic Signal Processing Technical Committee (AASP TC) is one of 13 TCs in the IEEE Signal Processing Society. Its mission is to support, nourish, and lead scientific and technological development in all areas of AASP. These areas are currently seeing increased levels of interest and significant growth, providing a fertile ground for a broad range of specific and interdisciplinary research and development. Ranging from array processing for microphones and loudspeakers to music genre classification, from psychoacoustics to machine learning (ML), from consumer electronics devices to blue-sky research, this scope encompasses countless technical challenges and many hot topics. The TC has roughly 30 elected volunteer members drawn equally from leading academic and industrial organizations around the world, unified by the common aim of offering their expertise in the service of the scientific community.
We present a fully Bayesian hierarchical approach for multichannel speech enhancement with time-varying audio channel. Our probabilistic approach relies on a Gaussian prior for the speech signal and a Gamma hyperprior for the speech precision, combined with a multichannel linear-Gaussian state space model for the acoustic channel. Furthermore, we assume a Wishart prior for the noise precision matrix. We derive a variational Expectation-Maximization (VEM) algorithm which uses a variant of multichannel Wiener filter (MCWF) to infer the sound source and a Kalman smoother to infer the acoustic channel. It is further shown that the VEM speech estimator can be recast as a multichannel minimum variance distortionless response (MVDR) beamformer followed by a single-channel variational postfilter. The proposed algorithm was evaluated using both simulated and real room environments with several noise types and reverberation levels. Both static and dynamic scenarios are considered. In terms of speech quality, it is shown that a significant improvement is obtained with respect to the noisy signal, and that the proposed method outperforms a baseline algorithm. In terms of channel alignment and tracking ability, a superior channel estimate is demonstrated.
This paper addresses the problems of blind multichannel identification and equalization for joint speech dere-verberation and noise reduction. The time-domain cross-relation method is hardly applicable for blind room impulse response identification due to the near-common zeros of the long impulse responses. We extend the cross relation method to the short-time Fourier transform (STFT) domain, in which the time domain impulse response is approximately represented by the convolutive transfer function (CTF) with much less coefficients. For the oversampled STFT, CTFs suffer from the common zeros caused by the non-flat frequency response of the STFT window. To overcome this, we propose to identify CTFs using the STFT framework with oversampled signals and critically sampled CTFs, which is a good trade-off between the frequency
aliasing of the signals and the common zeros problem of CTFs. The identified complex-valued CTFs are not accurate enough for multichannel equalization due to the frequency aliasing of the CTFs. Thence, we only use the CTF magnitudes, which leads to a nonnegative multichannel equalization method based on a nonnegative convolution model between the STFT magnitude of the source signal and the CTF magnitude. Compared with the complex-valued convolution model, this nonnegative convolution model is shown to be more robust against the CTF perturbations. To recover the STFT magnitude of the source signal and to reduce the additive noise, the ` 2 -norm fitting error between the STFT magnitude of the microphone signals and the nonnegative convolution is constrained to be less than a noise power related tolerance. Meanwhile, the ` 1 -norm of the STFT magnitude of the source signal is minimized to impose the sparsity.
Blind source separation (BSS) is addressed, using a novel data-driven approach, based on a well-established probabilistic model. The proposed method is specifically designed for separation of multichannel audio mixtures. The algorithm relies on spectral decomposition of the correlation matrix between different time frames. The probabilistic model implies that the column space of the correlation matrix is spanned by the probabilities of the various speakers across time. The number of speakers is recovered by the eigenvalue decay, and the eigenvectors form a simplex of the speakers’ probabilities. Time frames dominated by each of the speakers are identified exploiting convex geometry tools on the recovered simplex. The mixing acoustic channels are estimated utilizing the identified sets of frames, and a linear unmixing is performed to extract the individual speakers. The derived simplexes are visually demonstrated for mixtures of 2, 3 and 4 speakers. We also conduct a comprehensive experimental
study, showing high separation capabilities in various reverberation conditions.
Distributed acoustic tracking estimates the trajectories of source positions using an acoustic sensor network. As it is often difficult to estimate the source-sensor range from individual nodes, the source positions have to be inferred from the direction-of-arrival (DoA) estimates. Due to reverberation and noise, the sound field becomes increasingly diffuse with increasing source-sensor distance, leading to a decreased Direction of Arrival (DoA)-estimation accuracy. To distinguish between accurate and uncertain DoA estimates, this letter proposes to incorporate the coherent-to-diffuse ratio as a measure of DoA reliability for single – source tracking. It is shown that the source positions, therefore, can be probabilistically triangulated by exploiting the spatial diversity of all nodes.
Reduction of late reverberation can be achieved using spatio-spectral filters such as the multichannel Wiener filter (MWF). To compute this filter, an estimate of the late reverberation power spectral density (PSD) is required. In recent years, a multitude of late reverberation PSD estimators have been proposed. In this contribution, these estimators are categorized into several classes, their relations and differences are discussed, and a comprehensive experimental comparison is provided. To compare their performance, simulations in controlled as well as practical scenarios are conducted. It is shown that a common weakness of spatial coherence-based estimators is their performance in high direct-to-diffuse ratio (DDR) conditions. To mitigate this problem, a correction method is proposed and evaluated. It is shown that the proposed correction method can decrease the speech distortion without significantly affecting the reverberation reduction.
The problem of speaker tracking in noisy and reverberant enclosures is addressed. We present a hybrid algorithm, combining traditional tracking schemes with a new learning-based approach. A state-space representation, consisting of a propagation and observation models, is learned from signals measured by several distributed microphone pairs. The proposed representation is based on two data modalities corresponding
to high-dimensional acoustic features representing the full reverberant acoustic channels as well as low-dimensional TDOA estimates. The state-space representation is accompanied by a statistical model based on a Gaussian process used to relate the variations of the acoustic channels to the physical variations of the associated source positions, thereby forming a data-driven propagation model for the source movement. In the
observation model, the source positions are nonlinearly mapped to the associated TDOA readings. The obtained propagation and observation models establish the basis for employing an extended Kalman filter (EKF). Simulation results demonstrate the robustness of the proposed method in noisy and reverberant conditions.
Localization of acoustic sources has attracted a considerable amount of research attention in recent years. A major obstacle to achieving high localization accuracy is the presence of reverberation, the influence of which obviously increases with the number of active speakers in the room. Human hearing is capable of localizing acoustic sources even in extreme conditions.
In this study, we propose to combine a method based on human hearing mechanisms and a modified incremental distributed expectation-maximization algorithm (IDEM).
Rather than using phase difference measurements that are modeled by a mixture of complex-valued Gaussians, as proposed in the original IDEM framework, we propose to use time difference of arrival (TDoA) measurements in multiple subbands and model them by a mixture of real-valued truncated Gaussians. Moreover, we propose to first filter the measurements in order to reduce the effect of the multi-path conditions. The proposed
method is evaluated using both simulated data and real-life recordings.
The problem of blind separation of speech signals in the presence of noise using multiple microphones is addressed. Blind estimation of the acoustic parameters and the individual source signals are carried out by applying the expectation – maximization (EM) algorithm. Two models for the speech signals are used, namely an unknown deterministic signal model and a complex-Gaussian signal model. For the two alternatives, we
define a statistical model and develop EM-based algorithms to jointly estimate the acoustic parameters and the speech signals. The resulting algorithms are then compared from both theoretical and performance perspectives. In both cases, the latent data (differently defined for each alternative) is estimated in the E-step, where in the M-step, the two algorithms estimate the acoustic transfer functions of each source and the noise covariance matrix. The algorithms differ in the way the clean speech signals are used in the EM scheme. When the clean signal is assumed deterministic unknown, only the a posterior probabilities of the presence of each source are estimated in the E-step, while their time-frequency coefficients are designated as parameters, and are estimated in the M-step using the minimum variance distortionless response beamformer. If the clean speech signals are modelled as complex Gaussian signals, their power spectral densities (PSDs) are estimated in the E-step using the
multichannel Wiener filter output. The proposed algorithms were tested using reverberant noisy mixtures of two speech sources in different reverberation and noise conditions.
This paper addresses the problem of multiple- speaker localization in noisy and reverberant environments, using binaural recordings of an acoustic scene. A complex-valued Gaussian mixture model (CGMM) is adopted, whose components correspond to all the possible candidate source locations defined on a grid. After optimizing the CGMM-based objective function, given an observed set of complex-valued binaural features, both the number of sources and their locations are estimated by selecting the CGMM components with the largest weights. An entropy-based penalty term is added to the likelihood to impose sparsity over the set of CGMM component weights. This favors a small number of detected speakers with respect to the large number of initial candidate source locations. In addition, the direct-path relative transfer function (DP-RTF) is used to build robust binaural features. The DP-RTF, recently proposed for single-source localization, encodes inter-channel information
corresponding to the direct-path of sound propagation and is thus robust to reverberations. In this paper, we extend the DP-RTF estimation to the case of multiple sources. In the short-time Fourier transform domain, a consistency test is proposed to check whether a set of consecutive frames is associated to the same source or not. Reliable DP-RTF features are selected from the frames that pass the consistency test to be used for source localization. Experiments carried out using both simulation data and real data recorded with a robotic head confirm the efficiency of the proposed multi-source localization method.
The reverberation power spectral density (PSD) is often required for dereverberation and noise reduction algorithms. In this work, we compare two maximum likelihood (ML) estimators of the reverberation PSD in a noisy environment. In the first estimator, the direct path is first blocked. Then, the ML criterion for estimating the reverberation PSD is stated according to the probability density function (p.d.f.) of the blocking matrix (BM) outputs. In the second estimator, the speech component is not blocked. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. To compare the expected mean square error (MSE) between the two ML estimators of the reverberation PSD, the Cramér-Rao Bounds (CRBs) for the two ML estimators are derived. We show that the CRB for the joint reverberation and speech PSD estimator is lower than the CRB for estimating the reverberation PSD from the BM outputs. Experimental results show that the MSE of the two estimators indeed obeys the CRB curves. Experimental results of multi-microphone dereverberation and noise reduction algorithm show the benefits of using the ML estimators in comparison with another baseline estimators.
The problem of single source localization with ad hoc microphone networks in noisy and reverberant enclosures is addressed in this paper. A training set is formed by prerecorded measurements collected in advance, and consists of a limited number of labelled measurements, attached with corresponding positions, and a larger number of unlabelled measurements from unknown locations. Further information about the enclosure
characteristics or the microphone positions is not required. We propose a Bayesian inference approach for estimating a function that maps measurement-based features to the corresponding positions. The signals measured by the microphones represent different viewpoints, which are combined in a unified statistical framework. For this purpose, the mapping function is modelled by a Gaussian process with a covariance function that encapsulates both the connections between pairs of microphones and the relations among the samples in the training set. The parameters
of the process are estimated by optimizing a maximum likelihood (ML) criterion. In addition, a recursive adaptation mechanism is derived, where the new streaming measurements are used to update the model. Performance is demonstrated for both simulated data and real-life recordings in a variety of reverberation
and noise levels.
The problem of source separation and noise reduction using multiple microphones is addressed. The minimum
mean square error (MMSE) estimator for the multi-speaker case is derived and a novel decomposition of this estimator is pre-
sented. The MMSE estimator is decomposed into two stages: i) a multi-speaker linearly constrained minimum variance (LCMV)
beamformer (BF), and ii) a subsequent multi-speaker Wiener postfilter. The first stage separates and enhances the signals of
the individual speakers by utilizing the spatial characteristics of the speakers (as manifested by the respective acoustic transfer
functions (ATFs)) and the noise spatial correlation matrix, while the second stage exploits the speakers’ power spectral density
matrix to reduce the residual noise at the output of the first stage. The output vector of the multi-speaker LCMV BF is proven to be
the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral
amplitude estimator for the multi-speaker case is also derived given the multi-speaker LCMV BF outputs. The performance
evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically
verified that the multi-speaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when
compared with the single-speaker postfilter.
Speech enhancement and separation are core problems in audio signal processing, with commercial applications
in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are cru-
cial pre-processing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight mi-
crophones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those
of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting
with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated
and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between
these approaches is lacking at present. In this article, we propose to fill this gap by analyzing a large number of established
and recent techniques according to four transverse axes:
a) the acoustic impulse response model
b) the spatial filter design criterion
c) the parameter estimation algorithm
d) optional postfiltering.
We conclude this overview paper by providing a list
of software and data resources and by discussing perspectives and
future trends in the field.
As we are surrounded by an increased number of mobile devices equipped with wireless links and multiple microphones, e.g., smartphones, tablets, laptops and hearing aids, using them collaboratively for acoustic processing is a promising platform for emerging applications. These devices make up an acoustic sensor network comprised of nodes, i.e. distributed devices equipped with microphone arrays, communication unit and processing unit. Algorithms for speaker separation and localization using such a network require a precise knowledge of the nodes’ locations and orientations. To acquire this knowledge, a recently introduced approach proposed a combined direction of arrival (DoA) and time difference of arrival (TDoA) target function for off-line calibration with dedicated recordings. This paper proposes an extension of this approach to a novel online method with two new features: First, by employing an evolutionary algorithm on incremental measurements, it is online and fast enough for real-time application. Second, by using the sparse spike representation computed in a cochlear model for TDoA estimation, the amount of information shared between the nodes by transmission is reduced while the accuracy is increased. The proposed approach is able to calibrate an acoustic senor network online during a meeting in a reverberant conference room.
The challenge of blindly resynchronizing the data acquisition processes in a wireless acoustic sensor network (WASN) is addressed in this paper. The sampling rate offset (SRO) is precisely modeled as a time scaling. The applicability of a wideband correlation processor for estimating the SRO, even in a reverberant and multiple source environment, is presented. An explicit expression for the ambiguity function, which in our case involves time scaling of the received signals, is derived by applying truncated band-limited interpolation. We then propose the recursive band-limited interpolation (RBI) algorithm for recursive SRO estimation. A complete resynchronization scheme utilizing the RBI algorithm, in parallel with the SRO compensation module, is presented. The resulting resynchronization method operates in the time domain in a sequential manner and is thus capable of tracking a potentially time-varying SRO. We compared the performance of the proposed RBI algorithm to other available methods in a simulation study. The importance of resynchronization in a beamforming application is demonstrated by both a simulation study and experiments with a real WASN. Finally, we present an experimental study evaluating the expected SRO level between typical data acquisition devices.
The problem of source separation using an array of microphones in reverberant and noisy conditions is addressed. We consider applying the well-known linearly constrained minimum variance (LCMV) beamformer (BF) for extracting individual speakers. Constraints are defined using relative transfer functions (RTFs) for the sources, which are acoustic transfer functions (ATFs) ratios between any microphone and a reference microphone. The latter are usually estimated by methods which rely on single-talk time segments where only a single source is active and on reliable knowledge of the source activity. Two novel algorithms for estimation of RTFs using the TRINICON (Triple-N ICA for convolutive mixtures) framework are proposed, not resorting to the usually unavailable source activity pattern. The first algorithm estimates the RTFs of the sources by applying multiple two-channel geometrically constrained (GC) TRINICON units, where approximate direction of arrival (DOA) information for the sources is utilized for ensuring convergence to the desired solution. The GC-TRINICON is applied to all microphone pairs using a common reference microphone. In the second algorithm, we propose to estimate RTFs iteratively using GC-TRINICON, where instead of using a fixed reference microphone as before, we suggest to use the output signals of LCMV-BFs from the previous iteration as spatially processed references (SPRs) with improved signal-to-interference-and-noise ratio (SINR). For both algorithms, a simple detection of noise-only time segments is required for estimating the covariance matrix of noise and interference. We conduct an experimental study in which the performance of the proposed methods is confirmed and compared to corresponding supervised methods.
We present a novel non-iterative and rigorously motivated approach for estimating hidden Markov models (HMMs) and factorial hidden Markov models (FHMMs) of high-dimensional signals. Our approach utilizes the asymptotic properties of a spectral, graph-based approach for dimensionality reduction and manifold learning, namely the diffusion framework. We exemplify our approach by applying it to the problem of single microphone speech separation, where the log-spectra of two unmixed speakers are modeled as HMMs, while their mixture is modeled as an FHMM. We derive two diffusion-based FHMM estimation schemes. One of which is experimentally shown to provide separation results that compare with contemporary speech separation approaches based on HMM. The second scheme allows a reduced computational burden.
In this paper, we present a single-microphone speech enhancement algorithm. A hybrid approach is proposed merging the generative mixture of Gaussians (MoG) model and the discriminative deep neural network (DNN). The proposed algorithm is executed in two phases, the training phase, which does not recur, and the test phase. First, the noise-free speech log-power spectral density is modeled as an MoG, representing the phoneme-based diversity in the speech signal. A DNN is then trained with phoneme labeled database of clean speech signals for phoneme classification with mel-frequency cepstral coefficients as the input features. In the test phase, a noisy utterance of an untrained speech is processed. Given the phoneme classification results of the noisy speech utterance, a speech presence probability (SPP) is obtained using both the generative and discriminative models. SPP-controlled attenuation is then applied to the noisy speech while simultaneously, the noise estimate is updated. The discriminative DNN maintains the continuity of the speech and the generative phoneme-based MoG preserves the speech spectral structure. Extensive experimental study using real speech and noise signals is provided. We also compare the proposed algorithm with alternative speech enhancement algorithms. We show that we obtain a significant improvement over previous methods in terms of speech quality measures. Finally, we analyze the contribution of all components of the proposed algorithm indicating their combined importance.
This paper addresses the problem of sound-source localization of a single speech source in noisy and reverberant environments. For a given binaural microphone setup, the binaural response corresponding to the direct-path propagation of a single source is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function of the two channels. We propose a method to estimate the DP-RTF from the noisy and reverberant microphone signals in the short-time Fourier transform (STFT) domain. First, the convolutive transfer function approximation is adopted to accurately represent the impulse response of the sensors in the STFT domain. Second, the DP-RTF is estimated by using the auto- and cross-power spectral densities at each frequency and over multiple frames. In the presence of stationary noise, an interframe spectral subtraction algorithm is proposed, which enables to achieve the estimation of noise-free auto- and cross-power spectral densities. Finally, the estimated DP-RTFs are concatenated across frequencies and used as a feature vector for the localization of speech source. Experiments with both simulated and real data show that the proposed localization method performs well, even under severe adverse acoustic conditions, and outperforms state-of-the-art localization methods under most of the acoustic conditions.
Smartglasses, in addition to their visual-output capabilities, often contain acoustic sensors for receiving the user’s voice. However, operation in noisy environments may lead to significant degradation of the received signal. To address this issue, we propose employing an acoustic sensor array which is mounted on the eyeglasses frames. The signals from the array are processed by an algorithm with the purpose of acquiring the desired near-field speech signal produced by the wearer while suppressing noise signals originating from the environment. The array is comprised of two acoustic vector-sensors (AVSs) which are located at the fore of the glasses’ temples. Each AVS consists of four collocated subsensors: one pressure sensor (with an omnidirectional response) and three particle-velocity sensors (with dipole responses) oriented in mutually orthogonal directions. The array configuration is designed to boost the input power of the desired signal, and to ensure that the characteristics of the noise at the different channels are sufficiently diverse (lending towards more effective noise suppression). Since changes in the array’s position correspond to the desired speaker’s movement, the relative source-receiver position remains unchanged; hence, the need to track fluctuations of the steering vector is avoided. Conversely, the spatial statistics of the noise are subject to rapid and abrupt changes due to sudden movement and rotation of the user’s head. Consequently, the algorithm must be capable of rapid adaptation toward such changes. We propose an algorithm which incorporates detection of the desired speech in the time-frequency domain, and employs this information to adaptively update estimates of the noise statistics. The speech detection plays a key role in ensuring the quality of the output signal. We conduct controlled measurements of the array in noisy scenarios. The proposed algorithm preforms favorably with respect to conventional algorithms.
In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that consists of the direct path and some early reflections and a late reverberant component that consists of all late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation, and ambient noise is presented. The expectation-maximization (EM) algorithm is used to estimate the signals and spatial parameters of the early speech component and the late reverberation components. As a result, a spatially filtered version of the early speech component is estimated in the E-step. The power spectral density (PSD) of the anechoic speech, the relative early transfer functions, and the PSD matrix of the late reverberation are estimated in the M-step of the EM algorithm. The algorithm is evaluated using real room impulse response recorded in our acoustic lab with a reverberation time set to 0.36 s and 0.61 s and several signal-to-noise ratio levels. It is shown that significant improvement is obtained and that the proposed algorithm outperforms baseline single-channel and multichannel dereverberation algorithms, as well as a state-of-the-art multichannel dereverberation algorithm.
This paper addresses the problem of separating audio sources from time-varying convolutive mixtures. We propose a probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization. The time-varying mixing filters are modeled by a continuous temporal stochastic process. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the time-varying mixing matrix, and that jointly estimate the source parameters. The sound sources are then separated by Wiener filters constructed with the estimators provided by the VEM algorithm. Extensive experiments on simulated data show that the proposed method outperforms a blockwise version of a state-of-the-art baseline method.
Conventional speaker localization algorithms, based merely on the received microphone signals, are often sensitive to adverse conditions, such as: high reverberation or low signal-to-noise ratio (SNR). In some scenarios, e.g., in meeting rooms or cars, it can be assumed that the source position is confined to a predefined area, and the acoustic parameters of the environment are approximately fixed. Such scenarios give rise to the assumption that the acoustic samples from the region of interest have a distinct geometrical structure. In this paper, we show that the high-dimensional acoustic samples indeed lie on a low-dimensional manifold and can be embedded into a low-dimensional space. Motivated by this result, we propose a semi-supervised source localization algorithm based on two-microphone measurements, which recovers the inverse mapping between the acoustic samples and their corresponding locations. The idea is to use an optimization framework based on manifold regularization, that involves smoothness constraints of possible solutions with respect to the manifold. The proposed algorithm, termed manifold regularization for localization, is adapted while new unlabelled measurements (from unknown source locations) are accumulated during runtime. Experimental results show superior localization performance when compared with a recently presented algorithm based on a manifold learning approach and with the generalized cross-correlation algorithm as a baseline. The algorithm achieves 2° accuracy in typical noisy and reverberant environments (reverberation time between 200 and 800 ms and SNR between 5 and 20 dB).
The recently proposed binaural linearly constrained minimum variance (BLCMV) beamformer is an extension of the well-known binaural minimum variance distortionless response (MVDR) beamformer, imposing constraints for both the desired and the interfering sources. Besides its capabilities to reduce interference and noise, it also enables to preserve the binaural cues of both the desired and interfering sources, hence making it particularly suitable for binaural hearing aid applications. In this paper, a theoretical analysis of the BLCMV beamformer is presented. In order to gain insights into the performance of the BLCMV beamformer, several decompositions are introduced that reveal its capabilities in terms of interference and noise reduction, while controlling the binaural cues of the desired and the interfering sources. When setting the parameters of the BLCMV beamformer, various considerations need to be taken into account, e.g. based on the amount of interference and noise reduction and the presence of estimation errors of the required relative transfer functions (RTFs). Analytical expressions for the performance of the BLCMV beamformer in terms of noise reduction, interference reduction, and cue preservation are derived. Comprehensive simulation experiments, using measured acoustic transfer functions as well as real recordings on binaural hearing aids, demonstrate the capabilities of the BLCMV beamformer in various noise environments.
Statistically optimal spatial processors (also referred to as data-dependent beamformers) are widely-used spatial focusing techniques for desired source extraction. The Kalman filter-based beamformer (KFB) [1] is a recursive Bayesian method for implementing the beamformer. This letter provides new insights into the KFB. Specifically, we adopt the KFB framework to the task of speech extraction. We formalize the KFB with a set of linear constraints and present its equivalence to the linearly constrained minimum power (LCMP) beamformer. We further show that the optimal output power, required for implementing the KFB, is merely controlling the white noise gain (WNG) of the beamformer. We also show, that in static scenarios, the adaptation rule of the KFB reduces to the simpler affine projection algorithm (APA). The analytically derived results are verified and exemplified by a simulation study.
In recent years, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques and automatic speech recognition (ASR) techniques that are robust to reverberation. In this paper, we describe the REVERB challenge, which is an evaluation campaign that was designed to evaluate such speech enhancement (SE) and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions. Even though most existing benchmark tasks and challenges for distant speech processing focus on the noise robustness issue and sometimes only on a single-channel scenario, a particular novelty of the REVERB challenge is that it is carefully designed to test robustness against reverberation, based on both real, single-channel, and multichannel recordings. This challenge attracted 27 papers, which represent 25 systems specifically designed for SE purposes and 49 systems specifically designed for ASR purposes. This paper describes the problems dealt within the challenge, provides an overview of the submitted systems, and scrutinizes them to clarify what current processing strategies appear effective in reverberant speech processing.
The objective of binaural noise reduction algorithms is not only to selectively extract the desired speaker and to suppress interfering sources (e.g., competing speakers) and ambient background noise, but also to preserve the auditory impression of the complete acoustic scene. For directional sources this can be achieved by preserving the relative transfer function (RTF) which is defined as the ratio of the acoustical transfer functions relating the source and the two ears and corresponds to the binaural cues. In this paper, we theoretically analyze the performance of three algorithms that are based on the binaural minimum variance distortionless response (BMVDR) beamformer, and hence, process the desired source without distortion. The BMVDR beamformer preserves the binaural cues of the desired source but distorts the binaural cues of the interfering source. By adding an interference reduction (IR) constraint, the recently proposed BMVDR-IR beamformer is able to preserve the binaural cues of both the desired source and the interfering source. We further propose a novel algorithm for preserving the binaural cues of both the desired source and the interfering source by adding a constraint preserving the RTF of the interfering source, which will be referred to as the BMVDR-RTF beamformer. We analytically evaluate the performance in terms of binaural signal-to-interference-and-noise ratio (SINR), signal-to-interference ratio (SIR), and signal-to-noise ratio (SNR) of the three considered beamformers. It can be shown that the BMVDR-RTF beamformer outperforms the BMVDR-IR beamformer in terms of SINR and outperforms the BMVDR beamformer in terms of SIR. Among all beamformers which are distortionless with respect to the desired source and preserve the binaural cues of the interfering source, the newly proposed BMVDR-RTF beamformer is optimal in terms of SINR. Simulations using acoustic transfer functions measured on a binaural hearing aid validate our theoretical results.
Besides noise reduction, an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of all sound sources. For the desired speech source and the interfering sources, e.g., competing speakers, this can be achieved by preserving their relative transfer functions (RTFs). It has been shown that the binaural multi-channel Wiener filter (MWF) preserves the RTF of the desired speech source, but typically distorts the RTF of the interfering sources. To this end, in this paper we propose two extensions of the binaural MWF, i.e., the binaural MWF with RTF preservation (MWF-RTF) aiming to preserve the RTF of the interfering source and the binaural MWF with interference rejection (MWF-IR) aiming to completely suppress the interfering source. Analytical expressions for the performance of the binaural MWF, MWF-RTF and MWF-IR in terms of noise reduction, speech distortion and binaural cue preservation are derived, showing that the proposed extensions yield a better performance in terms of the signal-to-interference ratio and preservation of the binaural cues of the directional interference, while the overall noise reduction performance is degraded compared to the binaural MWF. Simulation results using binaural behind-the-ear impulse responses measured in a reverberant environment validate the derived analytical expressions for the theoretically achievable performance of the binaural MWF, MWF-RTF, and MWF-IR, showing that the performance highly depends on the position of the interfering source and the number of microphones. Furthermore, the simulation results show that the MWF-RTF yields a very similar overall noise reduction performance as the binaural MWF, while preserving the binaural cues of both the speech and the interfering source.
The directivity factor (DF) of a beamformer describes its spatial selectivity and ability to suppress diffuse noise which arrives from all directions. For a given array configuration, it is possible to design beamforming weights which maximize the DF for a particular look-direction, while enforcing nulls for a set of undesired directions. In general, the resulting DF is dependent upon the specific look- and null directions. Using the same array, one may apply a different set of weights designed for any other feasible set of look- and null directions. In this contribution we show that when the optimal DF is averaged over all look directions the result equals the number of sensors minus the number of null constraints. This result holds, regardless of the positions and spatial responses of the individual sensors, and of the null directions. The result generalizes to more complex wave-propagation domains (e.g., reverberation).
The problem of distributed localization for ad hoc wireless acoustic sensor networks (WASNs) is addressed in this paper. WASNs are characterized by low computational resources in each node and by limited connectivity between the nodes. Novel bi-directional tree-based distributed estimation-maximization (DEM) algorithms are proposed to circumvent these inherent limitations. We show that the proposed algorithms are capable of localizing static acoustic sources in reverberant enclosures without a priori information on the number of sources. Unlike serial estimation procedures (like ring-based algorithms), the new algorithms enable simultaneous computations in the nodes and exhibit greater robustness to communication failures. Specifically, the recursive distributed EM (RDEM) variant is better suited to online applications due to its recursive nature. Furthermore, the RDEM outperforms the other proposed variants in terms of convergence speed and simplicity. Performance is demonstrated by an extensive experimental study consisting of both simulated and actual environments.
Relative impulse responses between microphones are usually long and dense due to the reverberant acoustic environment. Estimating them from short and noisy recordings poses a long-standing challenge of audio signal processing. In this paper, we apply a novel strategy based on ideas of compressed sensing. Relative transfer function (RTF) corresponding to the relative impulse response can often be estimated accurately from noisy data but only for certain frequencies. This means that often only an incomplete measurement of the RTF is available. A complete RTF estimate can be obtained through finding its sparsest representation in the time-domain: that is, through computing the sparsest among the corresponding relative impulse responses. Based on this approach, we propose to estimate the RTF from noisy data in three steps. First, the RTF is estimated using any conventional method such as the nonstationarity-based estimator by Gannot or through blind source separation. Second, frequencies are determined for which the RTF estimate appears to be accurate. Third, the RTF is reconstructed through solving a weighted l1 convex program, which we propose to solve via a computationally efficient variant of the SpaRSA (Sparse Reconstruction by Separable Approximation) algorithm. An extensive experimental study with real-world recordings has been conducted. It has been shown that the proposed method is capable of improving many conventional estimators used as the first step in most situations.
In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that includes the direct path and some early reflections, and a late reverberant component that includes all the late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation and ambient noise is presented. A multi-microphone minimum mean square error estimator is used to obtain a spatially filtered version of the early speech component. The estimator constructed as a minimum variance distortionless response (MVDR) beamformer (BF) followed by a postfilter (PF). Three unique design features characterize the proposed method. First, the MVDR BF is implemented in a special structure, named the nonorthogonal generalized sidelobe canceller (NO-GSC). Compared with the more conventional orthogonal GSC structure, the new structure allows for a simpler implementation of the GSC blocks for various MVDR constraints. Second, In contrast to earlier works, RETFs are used in the MVDR criterion rather than either the entire RTFs or only the direct-path of the desired speech signal. An estimator of the RETFs is proposed as well. Third, the late reverberation and noise are processed by both the beamforming stage and the PF stage. Since the relative power of the noise and the late reverberation varies with the frame index, a computationally efficient method for the required matrix inversion is proposed to circumvent the cumbersome mathematical operation. The algorithm was evaluated and compared with two alternative multichannel algorithms and one single-channel algorithm using simulated data and data recorded in a room with a reverberation time of 0.5 s for various source-microphone array distances (1-4 m) and several signal-to-noise levels. The processed signals were tested using two commonly used objective measures, namely perceptual …
Speech signals recorded in a room are commonly degraded by reverberation. In most cases, both the speech signal and the acoustic system of the room are unknown and time-varying. In this paper, a scenario with a single desired sound source and slowly time-varying and spatially-white noise is considered, and a multi-microphone algorithm that simultaneously estimates the clean speech signal and the time-varying acoustic system is proposed. The recursive expectation-maximization scheme is employed to obtain both the clean speech signal and the acoustic system in an online manner. In the expectation step, the Kalman filter is applied to extract a new sample of the clean signal, and in the maximization step, the system estimate is updated according to the output of the Kalman filter. Experimental results show that the proposed method is able to significantly reduce reverberation and increase the speech quality. Moreover, the tracking ability of the algorithm was validated in practical scenarios using human speakers moving in a natural manner.
In multiple speaker scenarios, the linearly constrained minimum variance (LCMV) beamformer is a popular microphone array-based speech enhancement technique, as it allows minimizing the noise power while maintaining a set of desired responses towards different speakers. Here, we address the algorithmic challenges arising when applying the LCMV beamformer in wireless acoustic sensor networks (WASNs), which are a next-generation technology for audio acquisition and processing. We review three optimal distributed LCMV-based algorithms, which compute a network-wide LCMV beamformer output at each node without centralizing the microphone signals. Optimality here refers to equivalence to a centralized realization where a single processor has access to all signals. We derive and motivate the algorithms in an accessible top-down framework that reveals their underlying relations. We explain how their differences result from their different design criterion (node-specific versus common constraints sets), and their different priorities for communication bandwidth, computational power, and adaptivity. Furthermore, although originally proposed for a fully connected WASN, we also explain how to extend the reviewed algorithms to the case of a partially connected WASN, which is assumed to be pruned to a tree topology. Finally, we discuss the advantages and disadvantages of the various algorithms
The problem of localizing and tracking a known number of concurrent speakers in noisy and reverberant enclosures is addressed in this paper. We formulate the localization task as a maximum likelihood (ML) parameter estimation problem, and solve it by utilizing the expectation-maximization (EM) procedure. For the tracking scenario, we propose to adapt two recursive EM (REM) variants. The first, based on Titterington’s scheme, is a Newton-based recursion. In this work we also extend Titterington’s method to deal with constrained maximization, encountered in the problem at hand. The second is based on Cappé and Moulines’ scheme. We discuss the similarities and dissimilarities of these two variants and show their applicability to the tracking problem by a simulated experimental study.
The beampattern of an array consisting of N elements is determined by the beampatterns of the individual elements, their placement, and the weights assigned to them. For each look direction, it is possible to design weights that maximize the array directivity factor (DF). For the case of an array of omnidirectional elements using optimal weights, it has been shown that the average DF over all look directions equals the number of elements. The validity of this theorem is not dependent on array geometry. We generalize this theorem by means of an alternative proof. The chief contributions of this letter are: (a) a compact and direct proof, (b) generalization to arrays containing directional elements (such as cardioids and dipoles), and (c) generalization to arbitrary wave propagation models. A discussion of the theorem’s ramifications on array processing is provided.
Beamforming with wireless acoustic sensor networks (WASNs) has recently drawn the attention of the research community. As the number of microphones grows it is difficult, and in some applications impossible, to determine their layout beforehand. A common practice in analyzing the expected performance is to utilize statistical considerations. In the current contribution, we consider applying the speech distortion weighted multi-channel Wiener filter (SDW-MWF) to enhance a desired source propagating in a reverberant enclosure where the microphones are randomly located with a uniform distribution. Two noise fields are considered, namely, multiple coherent interference signals and a diffuse sound field. Utilizing the statistics of the acoustic transfer function (ATF), we derive a statistical model for two important criteria of the beamformer (BF): the signal to interference ratio (SIR), and the white noise gain. Moreover, we propose reliability functions, which determine the probability of the SIR and white noise gain to exceed a predefined level. We verify the proposed model with an extensive simulative study.
Signal processing methods have significantly changed over the last several decades. Traditional methods were usually based on parametric statistical inference and linear filters. These frameworks have helped to develop efficient algorithms that have often been suitable for implementation on digital signal processing (DSP) systems. Over the years, DSP systems have advanced rapidly, and their computational capabilities have been substantially increased. This development has enabled contemporary signal processing algorithms to incorporate more computations. Consequently, we have recently experienced a growing interaction between signal processing and machine-learning approaches, e.g., Bayesian networks, graphical models, and kernel-based methods, whose computational burden is usually high.
This paper proposes a distributed multiple constraints generalized sidelobe canceler (GSC) for speech enhancement in an N -node fully connected wireless acoustic sensor network (WASN) comprising M microphones. Our algorithm is designed to operate in reverberant environments with constrained speakers (including both desired and competing speakers). Rather than broadcasting M microphone signals, a significant communication bandwidth reduction is obtained by performing local beamforming at the nodes, and utilizing only transmission channels. Each node processes its own microphone signals together with the N + P transmitted signals. The GSC-form implementation, by separating the constraints and the minimization, enables the adaptation of the BF during speech-absent time segments, and relaxes the requirement of other distributed LCMV based algorithms to re-estimate the sources RTFs after each iteration. We provide a full convergence proof of the proposed structure to the centralized GSC-beamformer (BF). An extensive experimental study of both narrowband and (wideband) speech signals verifíes the theoretical analysis.
A transient is an abrupt or impulsive sound followed by decaying oscillations, e.g., keyboard typing and door knocking. Such sounds often arise as interference in everyday applications, e.g., hearing aids, hands-free accessories, mobile phones, and conference-room devices. In this paper, we present an algorithm for single-channel transient interference suppression. The main component of the proposed algorithm is the estimation of the spectral variance of the interference. We propose a statistical model of the transient interference and combine it with non-local filtering. We exploit the unique spectral structure of the transients along with their impulsive temporal nature to distinct them from speech. A particular attention is given to handling both short- and long-duration transients. Experimental results show that the proposed algorithm enables significant transient suppression for a variety of transient types.
In this paper, we present a supervised graph-based framework for sequential processing and employ it to the problem of transient interference suppression. Transients typically consist of an initial peak followed by decaying short-duration oscillations. Such sounds, e.g., keyboard typing and door knocking, often arise as an interference in everyday applications: hearing aids, hands-free accessories, mobile phones, and conference-room devices. We describe a graph construction using a noisy speech signal and training recordings of typical transients. The main idea is to capture the transient interference structure, which may emerge from the construction of the graph. The graph parametrization is then viewed as a data-driven model of the transients and utilized to define a filter that extracts the transients from noisy speech measurements. Unlike previous transient interference suppression studies, in this work the graph is constructed in advance from training recordings. Then, the graph is extended to newly acquired measurements, providing a sequential filtering framework of noisy speech.
We address the application of the linearly constrained minimum variance (LCMV) beamformer in sensor networks. In signal processing applications, it is common to have a redundancy in the number of nodes, fully covering the area of interest. Here we consider suboptimal LCMV beamformers utilizing only a subset of the available sensors for signal enhancement applications. Multiple desired and interfering sources scenarios in multipath environments are considered. We assume that an oracle entity determines the group of sensors participating in the spatial filtering, denoted as the active sensors. The oracle is also responsible for updating the constraints set according to either sensors or sources activity or dynamics. Any update of the active sensors or of the constraints set necessitates recalculation of the beamformer and increases the power consumption. As power consumption is a most valuable resource in sensor networks, it is important to derive efficient update schemes. In this paper, we derive procedures for adding or removing either an active sensor or a constraint from an existing LCMV beamformer. Closed-form, as well as generalized sidelobe canceller (GSC)-form implementations, are derived. These procedures use the previous beamformer to save calculations in the updating process. We analyze the computational burden of the proposed procedures and show that it is much lower than the computational burden of the straightforward calculation of their corresponding beamformers.
Modeling natural and artificial systems has played a key role in various applications and has long been a task that has drawn enormous efforts. In this work, instead of exploring predefined models, we aim to identify implicitly the system degrees of freedom. This approach circumvents the dependency of a specific predefined model for a specific task or system and enables a generic data-driven method to characterize a system based solely on its output observations. We claim that each system can be viewed as a black box controlled by several independent parameters. Moreover, we assume that the perceptual characterization of the system output is determined by these independent parameters. Consequently, by recovering the independent controlling parameters, we find in fact a generic model for the system. In this work, we propose a supervised algorithm to recover the controlling parameters of natural and artificial linear systems. The proposed algorithm relies on nonlinear independent component analysis using diffusion kernels and spectral analysis. Employment of the proposed algorithm on both synthetic and practical examples has shown accurate recovery of parameters.
A vector-sensor consisting of a monopole sensor collocated with orthogonally oriented dipole sensors is used for direction of arrival (DOA) estimation in the presence of an isotropic noise-field or internal device noise. A maximum likelihood (ML) DOA estimator is derived and subsequently shown to be a special case of DOA estimation by means of a search for the direction of maximum steered response power (SRP). The problem of SRP maximization with respect to a vector-sensor can be solved with a computationally inexpensive algorithm. The ML estimator achieves asymptotic efficiency and thus outperforms existing estimators with respect to the mean square angular error (MSAE) measure. The beampattern associated with the ML estimator is shown to be identical to that used by the minimum power distortionless response beamformer for the purpose of signal enhancement.
The Kalman filter is one of the most widely applied tools in the statistical signal processing field, especially in the context of causal online applications [1]. This article presents an introduction to the Kalman filter; the desired signal and its corresponding measurements are modeled, the Kalman filter is formulated and presented with an intuitive explanation of the involved equations, applications of the filter are given in the context of speech processing, and examples of two popular applications in speech enhancement and speaker tracking are provided.
Particle filtering has been shown to be an effective approach to solving the problem of acoustic source localization in reverberant environments. In reverberant environment, the direct- arrival of the single source is accompanied by multiple spurious arrivals. Multiple-hypothesis model associated with these arrivals can be used to alleviate the unreliability often attributed to the acoustic source localization problem. Until recently, this multiple- hypothesis approach was only applied to bootstrap-based particle filter schemes. Recently, the extended Kalman particle filter (EPF) scheme which allows for an improved tracking capability was proposed for the localization problem. The EPF scheme utilizes a global extended Kalman filter (EKF) which strongly depends on prior knowledge of the correct hypotheses. Due to this, the extension of the multiple-hypothesis model for this scheme is not trivial. In this paper, the EPF scheme is adapted to the multiple-hypothesis model to track a single acoustic source in reverberant environments. Our work is supported by an extensive experimental study using both simulated data and data recorded in our acoustic lab. Various algorithms and array constellations were evaluated. The results demonstrate the superiority of the proposed algorithm in both tracking and switching scenarios. It is further shown that splitting the array into several sub-arrays improves the robustness of the estimated source location.
Enhancement of speech signals for hands-free communication systems has attracted significant research efforts in the last few decades. Still, many aspects and applications remain open and require further research. One of the important open problems is the single-channel transient noise reduction. In this paper, we present a novel approach for transient noise reduction that relies on non-local (NL) neighborhood filters. In particular, we propose an algorithm for the enhancement of a speech signal contaminated by repeating transient noise events. We assume that the time duration of each reoccurring transient event is relatively short compared to speech phonemes and model the speech source as an auto-regressive (AR) process. The proposed algorithm consists of two stages. In the first stage, we estimate the power spectral density (PSD) of the transient noise by employing a NL neighborhood filter. In the second stage, we utilize the optimally modified log spectral amplitude (OM-LSA) estimator for denoising the speech using the noise PSD estimate from the first stage. Based on a statistical model for the measurements and diffusion interpretation of NL filtering, we obtain further insight into the algorithm behavior. In particular, for given transient noise, we determine whether estimation of the noise PSD is feasible using our approach, how to properly set the algorithm parameters, and what is the expected performance of the algorithm. Experimental study shows good results in enhancing speech signals contaminated by transient noise, such as typical household noises, construction sounds, keyboard typing, and metronome clacks.
An acoustic vector sensor provides measurements of both the pressure and particle velocity of a sound field in which it is placed. These measurements are vectorial in nature and can be used for the purpose of source localization. A straightforward approach towards determining the direction of arrival (DOA) utilizes the acoustic intensity vector, which is the product of pressure and particle velocity. The accuracy of an intensity vector based DOA estimator in the presence of noise has been analyzed previously. In this paper, the effects of reverberation upon the accuracy of such a DOA estimator are examined. It is shown that particular realizations of reverberation differ from an ideal isotropically diffuse field, and induce an estimation bias which is dependant upon the room impulse responses (RIRs). The limited knowledge available pertaining the RIRs is expressed statistically by employing the diffuse qualities of reverberation to extend Polack’s statistical RIR model. Expressions for evaluating the typical bias magnitude as well as its probability distribution are derived.
We consider a bidirectional time division duplex (TDD) multiple-input multiple-output (MIMO) communication system with time-varying channel and additive white Gaussian noise (AWGN). A blind bidirectional channel tracking algorithm, based on the projection approximation subspace tracking (PAST) algorithm, is applied in both terminals. The resulting singular value decomposition (SVD) of the channel matrix is then used to approximately diagonalize the channel. The proposed method is applied to an orthogonal frequency-division multiplexing-(OFDM-)MIMO setting with a typical indoor time-domain reflection model. The computational cost of the proposed algorithm, compared with other state-of-the-art algorithms, is relatively small. The Kalman filter is utilized for establishing a benchmark for the obtained performance of the proposed tracking algorithm. The performance degradation relative to a full channel state information (CSI) due to the application of the tracking algorithm is evaluated in terms of average effective rate and the outage probability and compared with alternative tracking algorithms. The obtained results are also compared with a benchmark obtained by the Kalman filter with known input signal and channel characteristics. It is shown that the expected degradation in performance of frequency-domain algorithms (which do not exploit the smooth frequency response of the channel) is only minor compared with time-domain algorithms in a range of reasonable signal-to-noise ratio (SNR) levels. The proposed bidirectional frequency-domain tracking algorithm, proposed in this paper, is shown to attain communication rates close to the benchmark and to outperform a competing algorithm. The paper is concluded by evaluating the proposed blind tracking method in terms of the outage probability and the symbol error rate (SER) versus. SNR for binary phase shift keying (BPSK) and 4-Quadrature amplitude modulation (QAM) constellations.
The minimum variance distortionless response (MVDR) beamformer, also known as Capon’s beamformer, is widely studied in the area of speech enhancement. The MVDR beamformer can be used for both speech dereverberation and noise reduction. This paper provides new insights into the MVDR beamformer. Specifically, the local and global behavior of the MVDR beamformer is analyzed and novel forms of the MVDR filter are derived and discussed. In earlier works it was observed that there is a tradeoff between the amount of speech dereverberation and noise reduction when the MVDR beamformer is used. Here, the tradeoff between speech dereverberation and noise reduction is analyzed thoroughly. The local and global behavior, as well as the tradeoff, is analyzed for different noise fields such as, for example, a mixture of coherent and non-coherent noise fields, entirely non-coherent noise fields and diffuse noise fields. It is shown that maximum noise reduction is achieved when the MVDR beamformer is used for noise reduction only. The amount of noise reduction that is sacrificed when complete dereverberation is required depends on the direct-to-reverberation ratio of the acoustic impulse response between the source and the reference microphone. The performance evaluation supports the theoretical analysis and demonstrates the tradeoff between speech dereverberation and noise reduction. When desiring both speech dereverberation and noise reduction, the results also demonstrate that the amount of noise reduction that is sacrificed decreases when the number of microphones increases.
In speech communication systems the received microphone signals are degraded by room reverberation and ambient noise that decrease the fidelity and intelligibility of the desired speaker. Reverberant speech can be separated into two components, viz. early speech and late reverberant speech. Recently, various algorithms have been developed to suppress late reverberant speech. One of the main challenges is to develop an estimator for the so-called late reverberant spectral variance (LRSV) which is required by most of these algorithms. In this letter a statistical reverberation model is proposed that takes the energy contribution of the direct-path into account. This model is then used to derive a more general LRSV estimator, which in a particular case reduces to an existing LRSV estimator. Experimental results show that the developed estimator is advantageous in case the source-microphone distance is smaller than the critical distance.
In this paper, we propose a convolutive transfer function generalized sidelobe canceler (CTF-GSC), which is an adaptive beamformer designed for multichannel speech enhancement in reverberant environments. Using a complete system representation in the short-time Fourier transform (STFT) domain, we formulate a constrained minimization problem of total output noise power subject to the constraint that the signal component of the output is the desired signal, up to some prespecified filter. Then, we employ the general sidelobe canceler (GSC) structure to transform the problem into an equivalent unconstrained form by decoupling the constraint and the minimization. The CTF-GSC is obtained by applying a convolutive transfer function (CTF) approximation on the GSC scheme, which is a more accurate and a less restrictive than a multiplicative transfer function (MTF) approximation. Experimental results demonstrate that the proposed beamformer outperforms the transfer function GSC (TF-GSC) in reverberant environments and achieves both improved noise reduction and reduced speech distortion.
In many practical environments we wish to extract several desired speech signals, which are contaminated by nonstationary and stationary interfering signals. The desired signals may also be subject to distortion imposed by the acoustic room impulse responses (RIRs). In this paper, a linearly constrained minimum variance (LCMV) beamformer is designed for extracting the desired signals from multimicrophone measurements. The beamformer satisfies two sets of linear constraints. One set is dedicated to maintaining the desired signals, while the other set is chosen to mitigate both the stationary and nonstationary interferences. Unlike classical beamformers, which approximate the RIRs as delay-only filters, we take into account the entire RIR [or its respective acoustic transfer function (ATF)]. The LCMV beamformer is then reformulated in a generalized sidelobe canceler (GSC) structure, consisting of a fixed beamformer (FBF), blocking matrix (BM), and adaptive noise canceler (ANC). It is shown that for spatially white noise field, the beamformer reduces to a FBF, satisfying the constraint sets, without power minimization. It is shown that the application of the adaptive ANC contributes to interference reduction, but only when the constraint sets are not completely satisfied. We show that relative transfer functions (RTFs), which relate the desired speech sources and the microphones, and a basis for the interference subspace suffice for constructing the beamformer. The RTFs are estimated by applying the generalized eigenvalue decomposition (GEVD) procedure to the power spectral density (PSD) matrices of the received signals and the stationary noise. A basis for the interference subspace is estimated by collecting eigenvectors, calculated in segments where nonstationary interfering sources are active and the desired sources are inactive. The rank of the basis is then reduced by the application of the orthogonal triangular decomposition (QRD). This procedure relaxes the common requirement for nonoverlapping activity periods of the interference sources. A comprehensive experimental study in both simulated and real environments demonstrates the performance of the proposed beamformer.
In this paper, we present a relative transfer function (RTF) identification method for speech sources in reverberant environments. The proposed method is based on the convolutive transfer function (CTF) approximation, which enables to represent a linear convolution in the time domain as a linear convolution in the short-time Fourier transform (STFT) domain. Unlike the restrictive and commonly used multiplicative transfer function (MTF) approximation, which becomes more accurate when the length of a time frame increases relative to the length of the impulse response, the CTF approximation enables representation of long impulse responses using short time frames. We develop an unbiased RTF estimator that exploits the nonstationarity and presence probability of the speech signal and derive an analytic expression for the estimator variance. Experimental results show that the proposed method is advantageous compared to common RTF identification methods in various acoustic environments, especially when identifying long RTFs typical to real rooms.
Noise fields encountered in real-life scenarios can often be approximated as spherical or cylindrical noise fields. The characteristics of the noise field can be described by a spatial coherence function. For simulation purposes, researchers in the signal processing community often require sensor signals that exhibit a specific spatial coherence function. In addition, they often require a specific type of noise such as temporally correlated noise, babble speech that comprises a mixture of mutually independent speech fragments, or factory noise. Existing algorithms are unable to generate sensor signals such as babble speech and factory noise observed in an arbitrary noise field. In this paper an efficient algorithm is developed that generates multisensor signals under a predefined spatial coherence constraint. The benefit of the developed algorithm is twofold. Firstly, there are no restrictions on the spatial coherence function. Secondly, to generate MM sensor signals the algorithm requires only MM mutually independent noise signals. The performance evaluation shows that the developed algorithm is able to generate a more accurate spatial coherence between the generated sensor signals compared to the so-called image method that is frequently used in the signal processing community.
Hands-free devices are often used in a noisy and reverberant environment. Therefore, the received microphone signal does not only contain the desired near-end speech signal but also interferences such as room reverberation that is caused by the near-end source, background noise and a far-end echo signal that results from the acoustic coupling between the loudspeaker and the microphone. These interferences degrade the fidelity and intelligibility of near-end speech. In the last two decades, post filters have been developed that can be used in conjunction with a single microphone acoustic echo canceller to enhance the near-end speech. In previous works, spectral enhancement techniques have been used to suppress residual echo and background noise for single microphone acoustic echo cancellers. However, dereverberation of the near-end speech was not addressed in this context. Recently, practically feasible spectral enhancement techniques to suppress reverberation have emerged. In this paper, we derive a novel spectral variance estimator for the late reverberation of the near-end speech. Residual echo will be present at the output of the acoustic echo canceller when the acoustic echo path cannot be completely modeled by the adaptive filter. A spectral variance estimator for the so-called late residual echo that results from the deficient length of the adaptive filter is derived. Both estimators are based on a statistical reverberation model. The model parameters depend on the reverberation time of the room, which can be obtained using the estimated acoustic echo path. A novel postfilter is developed which suppresses late reverberation of the near-end speech, residual echo and background noise, and maintains a constant residual background noise level. Experimental results demonstrate the beneficial use of the developed system for reducing reverberation, residual echo, and background noise.
Full-duplex hands-free man/machine interface often suffers from directional nonstationary interference, such as a competing speaker, as well as stationary interferences which may comprise both directional and nondirectional signals. The transfer-function generalized sidelobe canceller (TF-GSC) exploits the nonstationarity of the speech signal to enhance it when the undesired interfering signals are stationary. Unfortunately, the assumptions leading to the derivation of the TF-GSC are violated when a nonstationary interference is present. In this paper, we propose an adaptive beamformer, based on the TF-GSC, that is suitable for cancelling nonstationary interferences in noisy reverberant environments. We modify two of the TF-GSC components to enable suppression of the nonstationary undesired signal. A modified fixed beamformer (FBF) is designed to block the nonstationary interfering signal while maintaining the desired speech signal. A modified blocking matrix (BM) is designed to block both the desired signal and the nonstationary interference. We introduce a novel method for updating the blocking matrix in double talk scenarios, which exploits the nonstationarity of both the desired and interfering speech signals. Experimental results demonstrate the performance of the proposed algorithm in noisy and reverberant environments and show its superiority over the original TF-GSC.
In this work, we evaluate the performance of a recently proposed adaptive beamformer, namely Dual source Transfer-Function Generalized Sidelobe Canceller (DTF-GSC). The DTF-GSC is useful for enhancing a speech signal received by an array of microphones in a noisy and reverberant environment. We demonstrate the applicability of the DTF-GSC in some representative reverberant and non-reverberant environments under various noise field conditions. The performance is evaluated based on the power spectral density (PSD) deviation imposed on the desired signal at the beamformer output, the achievable noise reduction, and the interference reduction. We show that the resulting expressions for the PSD deviation and noise reduction depend on the actual acoustical environment, the noise field, and the estimation accuracy of the relative transfer functions (RTFs), defined as the ratio between each acoustical transfer function (ATF) and a reference ATF. The achievable interference reduction is generally independent of the noise field. Experimental results demonstrate the sensitivity of the system’s performance to array misalignments.
Man machine interaction requires an acoustic interface for providing full duplex hands-free communication. The transfer-function generalized sidelobe canceller (TF-GSC) is an adaptive beamformer suitable for enhancing a speech signal received by an array of microphones in a noisy and reverberant environment. When an echo signal is also present in the microphone output signals, cascade schemes of acoustic echo cancellation and TF-GSC can be employed for suppressing both interferences. However, the performances obtainable by cascade schemes are generally insufficient. An acoustic echo canceller (AEC) that precedes the adaptive beamformer suffers from the noise component at its input. Acoustic echo cancellation following the adaptive beamformer lacks robustness due to time variations in the echo path affecting beamformer adaptation. In this paper, we introduce an echo transfer-function generalized sidelobe canceller (ETF-GSC), which combines the TF-GSC with an acoustic echo canceller. The proposed scheme consists of a primary TF-GSC for dealing with the noise interferences, and a secondary modified TF-GSC for dealing with the echo cancellation. The secondary TF-GSC includes an echo canceller embedded within a replica of the primary TF-GSC components. We show that using this structure, the problems encountered in the cascade schemes can be appropriately avoided. Experimental results demonstrate improved performance of the ETF-GSC compared to cascade schemes in noisy and reverberant environments.
The advantages of optics that include processing speed and information throughput, modularity and versatility could be incorporated into one of the most interesting and applicable topics of digital communication related to Viterbi decoders. We aim to accelerate the processing rate and capabilities of Viterbi decoders applied for convolution codes, speech recognition, inter symbol interference (ISI) mitigation problems. The suggested configuration for realizing the decoder is based upon fast optical switches. The configuration is very modular and can easily be increased to Viterbi decoder based upon state machine with larger number of states and depth of the trellis diagram.
A dual-step approach for speaker localization based on a microphone array is addressed in this paper. In the first stage, which is not the main concern of this paper, the time difference between arrivals of the speech signal at each pair of microphones is estimated. These readings are combined in the second stage to obtain the source location. In this paper, we focus on the second stage of the localization task. In this contribution, we propose to exploit the speaker’s smooth trajectory for improving the current position estimate. Three localization schemes, which use the temporal information, are presented. The first is a recursive form of the Gauss method. The other two are extensions of the Kalman filter to the nonlinear problem at hand, namely, the extended Kalman filter and the unscented Kalman filter. These methods are compared with other algorithms, which do not make use of the temporal information. An extensive experimental study demonstrates the advantage of using the spatial-temporal methods. To gain some insight on the obtainable performance of the localization algorithm, an approximate analytical evaluation, verified by an experimental study, is conducted. This study shows that in common TDOA-based localization scenarios—where the microphone array has small interelement spread relative to the source position—the elevation and azimuth angles can be accurately estimated, whereas the Cartesian coordinates as well as the range are poorly estimated.
Determining the spatial position of a speaker finds a growing interest in video conference scenarios where automated camera steering and tracking are required. Speaker localization can be achieved with a dual-step approach. In the preliminary stage a microphone array is used to extract the time difference of arrival (TDOA) of the speech signal. These readings are then used by the second stage for the actual localization. In this work we present novel, frequency domain, approaches for TDOA calculation in a reverberant and noisy environment. Our methods are based on the speech quasi-stationarity property, noise stationarity and on the fact that the speech and the noise are uncorrelated. The mathematical derivations in this work are followed by an extensive experimental study which involves static and tracking scenarios.
In speech enhancement applications microphone array postfiltering allows additional reduction of noise components at a beamformer output. Among microphone array structures the recently proposed general transfer function generalized sidelobe canceller (TF-GSC) has shown impressive noise reduction abilities in a directional noise field, while still maintaining low speech distortion. However, in a diffused noise field less significant noise reduction is obtainable. The performance is even further degraded when the noise signal is nonstationary. In this contribution we propose three postfiltering methods for improving the performance of microphone arrays. Two of which are based on single-channel speech enhancers and making use of recently proposed algorithms concatenated to the beamformer output. The third is a multichannel speech enhancer which exploits noise-only components constructed within the TF-GSC structure. This work concentrates on the assessment of the proposed postfiltering structures. An extensive experimental study, which consists of both objective and subjective evaluation in various noise fields, demonstrates the advantage of the multichannel postfiltering compared to the single-channel techniques.
In recent work, we considered a microphone array located in a reverberated room, where general transfer functions (TFs) relate the source signal and the microphones, for enhancing a speech signal contaminated by interference. It was shown that it is sufficient to use the ratio between the different TFs rather than the TFs themselves in order to implement the suggested algorithm. An unbiased estimate of the TFs ratios was obtained by exploiting the nonstationarity of the speech signal. In this correspondence, we present an analysis of a distortion indicator, namely power spectral density (PSD) deviation, imposed on the desired signal by our newly suggested transfer function generalized sidelobe canceller (TF-GSC) algorithm. It is well known that for speech signals, PSD deviation between the reconstructed signal and the original one is the main contribution for speech quality degradation. As we are mainly dealing with speech signals, we analyze the PSD deviation rather than the regular waveform distortion. The resulting expression depends on the TFs involved, the noise field, and the quality of estimation of the TF’s ratios. For the latter dependency, we provide an approximated analysis of estimation procedure that is based on the signal’s nanstationarity and explore its dependency on the actual speech signal and on the signal-to-noise ratio (SNR) level. The theoretical expression is then used to establish empirical evaluation of the PSD deviation for several TFs of interest, various noise fields, and a wide range of SNR levels. It is shown that only a minor amount of PSD deviation is imposed on the beamformer output. The analysis presented in this correspondence is in good agreement with the actual performance presented in the former TF-GSC paper.
A novel approach for multimicrophone speech dereverberation is presented. The method is based on the construction of the null subspace of the data matrix in the presence of colored noise, using the generalized singular-value decomposition (GSVD) technique, or the generalized eigenvalue decomposition (GEVD) of the respective correlation matrices. The special Silvester structure of the filtering matrix, related to this subspace, is exploited for deriving a total least squares (TLS) estimate for the acoustical transfer functions (ATFs). Other less robust but computationally more efficient methods are derived based on the same structure and on the QR decomposition (QRD). A preliminary study of the incorporation of the subspace method into a subband framework proves to be efficient, although some problems remain open. Speech reconstruction is achieved by virtue of the matched filter beamformer (MFBF). An experimental study supports the potential of the proposed methods.
We present a novel approach for real-time multichannel speech enhancement in environments of nonstationary noise and time-varying acoustical transfer functions (ATFs). The proposed system integrates adaptive beamforming, ATF identification, soft signal detection, and multichannel postfiltering. The noise canceller branch of the beamformer and the ATF identification are adaptively updated online, based on hypothesis test results. The noise canceller is updated only during stationary noise frames, and the ATF identification is carried out only when desired source components have been detected. The hypothesis testing is based on the nonstationarity of the signals and the transient power ratio between the beamformer primary output and its reference noise signals. Following the beamforming and the hypothesis testing, estimates for the signal presence probability and for the noise power spectral density are derived. Subsequently, an optimal spectral gain function that minimizes the mean square error of the log-spectral amplitude (LSA) is applied. Experimental results demonstrate the usefulness of the proposed system in nonstationary noise environments.
We address the problem of cancelling a stationary noise component from its static mixtures with a nonstationary signal of interest. Two different approaches, both based on second-order statistics, are considered. The first is the blind source separation (BSS) approach which aims at estimating the mixing parameters via approximate joint diagonalization of estimated correlation matrices. Proper exploitation of the nonstationary nature of the desired signal, in contrast to the stationarity of the noise, allows parameterization of the joint diagonalization problem in terms of a nonlinear weighted least squares (WLS) problem. The second approach is a denoising approach, which translates into direct estimation of just one of the mixing coefficients via solution of a linear WLS problem, followed by the use of this coefficient to create a noise-only signal to be properly eliminated from the mixture. Under certain assumptions, the BSS approach is asymptotically optimal, yet computationally more intense, since it involves an iterative nonlinear WLS solution, whereas the second approach only requires a closed-form linear WLS solution. We analyze and compare the performance of the two approaches and provide some simulation results which confirm our analysis. Comparison to other methods is also provided.
We present a spectral domain, speech enhancement algorithm. The new algorithm is based on a mixture model for the short time spectrum of the clean speech signal, and on a maximum assumption in the production of the noisy speech spectrum. In the past this model was used in the context of noise robust speech recognition. In this paper we show that this model is also effective for improving the quality of speech signals corrupted by additive noise. The computational requirements of the algorithm can be significantly reduced, essentially without paying performance penalties, by incorporating a dual codebook scheme with tied variances. Experiments, using recorded speech signals and actual noise sources, show that in spite of its low computational requirements, the algorithm shows improved performance compared to alternative speech enhancement algorithms.
We consider a sensor array located in an enclosure, where arbitrary transfer functions (TFs) relate the source signal and the sensors. The array is used for enhancing a signal contaminated by interference. Constrained minimum power adaptive beamforming, which has been suggested by Frost (1972) and, in particular, the generalized sidelobe canceler (GSC) version, which has been developed by Griffiths and Jim (1982), are the most widely used beamforming techniques. These methods rely on the assumption that the received signals are simple delayed versions of the source signal. The good interference suppression attained under this assumption is severely impaired in complicated acoustic environments, where arbitrary TFs may be encountered. In this paper, we consider the arbitrary TF case. We propose a GSC solution, which is adapted to the general TF case. We derive a suboptimal algorithm that can be implemented by estimating the TFs ratios, instead of estimating the TFs. The TF ratios are estimated by exploiting the nonstationarity characteristics of the desired signal. The algorithm is applied to the problem of speech enhancement in a reverberating room. The discussion is supported by an experimental study using speech and noise signals recorded in an actual room acoustics environment.
Speech quality and intelligibility might significantly deteriorate in the presence of background noise, especially when the speech signal is subject to subsequent processing. In particular, speech coders and automatic speech recognition (ASR) systems that were designed or trained to act on clean speech signals might be rendered useless in the presence of background noise. Speech enhancement algorithms have therefore attracted a great deal of interest. In this paper, we present a class of Kalman filter-based algorithms with some extensions, modifications, and improvements of previous work. The first algorithm employs the estimate-maximize (EM) method to iteratively estimate the spectral parameters of the speech and noise parameters. The enhanced speech signal is obtained as a byproduct of the parameter estimation algorithm. The second algorithm is a sequential, computationally efficient, gradient descent algorithm. We discuss various topics concerning the practical implementation of these algorithms. Extensive experimental study using real speech and noise signals is provided to compare these algorithms with alternative speech enhancement algorithms, and to compare the performance of the iterative and sequential algorithms.
Objective: Affective flexibility, the capacity to respond to life’s varying environmental changes in a dynamic and adaptive manner, is considered a central aspect of psychological health in many psychotherapeutic approaches. The present study examined whether affective two-dimensional (i.e., arousal and valence) temporal variability extracted from voice and facial expressions would be associated with positive changes over the course of psychotherapy, at the session, client, and treatment levels.
Method: 22,741 mean vocal arousal and facial expression valence observations were extracted from 137 therapy sessions in a sample of 30 clients treated for major depressive disorder by nine therapists. Before and after each session, the clients self-reported their level of well-being on the Outcome Rating Scale. Session-level affective temporal variability was assessed as the mean square of successive differences (MSSD) between consecutive two-dimensional affective measures. Results: Session outcome was positively associated with temporal variability at the session level (i.e., within clients, between sessions) and at the client level (i.e., between clients). Importantly, these associations held when controlling for average session- and client-level valence scores. In addition, the expansion of temporal variability throughout treatment was associated with steeper positive session outcome trajectories over the course of treatment.
Conclusions: The continuous assessment of both vocal and facial affective expressions and the ability to extract measures of affective temporal variability from within-session data may enable therapists to better respond and modulate clients’ affective flexibility; however, further research is necessary to determine whether there is a causal link between affective temporal variability and psychotherapy outcomes.
Introduction: To date, studies focusing on the connection between psychological functioning and autonomic nervous system (ANS) activity usually adopted the one-dimensional model of autonomic balance, according to which activation of one branch of the ANS is accompanied by an inhibition of the other. However, the sympathetic and parasympathetic branches also activate independently; thus, co-activation and co-inhibition may occur, which is demonstrated by a two-dimensional model of ANS activity. Here, we apply such models to assess how markers of the autonomic space relate to several critical psychological constructs: emotional contagion (EC), general anxiety, and positive and negative affect (PA and NA). We also examined gender differences in those psychophysiological relations.
Methods: In the present study, we analyzed data from 408 healthy students, who underwent a 5-min group baseline period as part of their participation in several experiments and completed self-reported questionnaires. Electrocardiogram (ECG), electrodermal activity (EDA), and respiration were recorded. Respiratory sinus arrhythmia (RSA), pre-ejection period (PEP), as well as cardiac autonomic balance (CAB) and regulation (CAR) and cross-system autonomic balance (CSAB) and regulation (CSAR), were calculated.
Results: Notably, two-dimensional models were more suitable for predicting and describing most psychological constructs. Gender differences were found in psychological and physiological aspects as well as in psychophysiological relations. Women’s EC scores were negatively correlated with sympathetic activity and positively linked to parasympathetic dominance. Men’s PA and NA scores were positively associated with sympathetic activity. PA in men also had a positive link to an overall activation of the ANS, and a negative link to parasympathetic dominance.
Discussion: The current results expand our understanding of the psychological aspects of the autonomic space model and psychophysiological associations. Gender differences and strengths and weaknesses of alternative physiological models are discussed.
In the last decade, the signal processing (SP) community has witnessed a paradigm shift from model-based to data-driven methods. Machine learning (ML)—more specifically, deep learning—methodologies are nowadays widely used in all SP fields, e.g., audio, speech, image, video, multimedia, and multimodal/multisensor processing, to name a few. Many data-driven methods also incorporate domain knowledge to improve problem modeling, especially when computational burden, training data scarceness, and memory size are important constraints.
In this paper we propose a data-driven approach for multiple speaker tracking in reverberant enclosures. The speakers are uttering, possibly overlapping, speech signals while moving in the environment. The method comprises two stages. The first stage executes a single source localization using semi-supervised learning on multiple manifolds. The second stage, which is unsupervised, uses time-varying maximum likelihood estimation for tracking. The feature vectors, used by both stages, are the relative transfer functions (RTFs), which are known to be related to source positions. The number of sources is assumed to be known while the microphone positions are unknown. In the training stage, a large database of RTFs is given. A small percentage of the data is attributed with exact positions (namely, labelled data) and the rest is assumed to be unlabelled, i.e. the respective position is unknown. Then, a nonlinear, manifold-based, mapping function between the RTFs and the source positions is inferred. Applying this mapping function to all unlabelled RTFs constructs a dense grid of localized sources. In the test phase, this RTF grid serves as the centroids for a Mixture of Gaussians (MoG) model. The MoG parameters are estimated by applying a recursive variant of the expectation-maximization (EM) procedure that relies on the sparsity and intermittency of the speech signals. We present a comprehensive simulation study in various reverberation levels, including static and dynamic scenarios, for both two or three (partially) overlapping speakers. For the dynamic case we provide simulations with several speakers trajectories, including intersecting sources. The proposed scheme outperforms baseline methods that use a simpler propagation model in terms of localization accuracy and tracking capabilities.
Audio signal processing has passed many landmarks in its development as a research topic. Many are well known, such as the development of the phonograph in the second half of the 19th century and technology associated with digital telephony that burgeoned in the late 20th century and is still a hot topic in multiple guises. Interestingly, the development of audio technology has been fueled not only by advancements in the capabilities of technology but also by high consumer expectations and customer engagement. From surround sound movie theaters to the latest in-ear devices, people love sound and soon build new audio technology into their daily lives as an essential and expected feature.
Direction-of-arrival (DOA) estimation for multiple simultaneous speakers in reverberant environments is still one of the challenging tasks in the audio signal processing field. A recent approach addresses this problem using a spherical harmonics domain feature named relative harmonic coefficients (RHC). Based on a bin-wise operation across the STFT (short-time Fourier transform) domain, this method detects the direct-path RHC in the first stage, followed by single source localization in the second stage. However, the method is computationally expensive as each STFT bin requires an exhaustive grid search over the two-dimensional (2-D) directional space. In this paper, we propose a significantly more computationally efficient alternative that decouples the azimuth and elevation 2-D search to two separate one-dimensional (1-D) search. The proposed multi-speaker localization algorithm comprises of two main steps, responsible for: (i) achieving a joint direct-path RHC detection and decoupled DOA estimation using 1-D search; and (ii) counting the number of speakers and estimating their DOAs based on the estimates from direct-path dominated STFT bins. Experiments using both simulated and real-life reverberant recordings confirm the significant computational complexity reduction while achieving competitive localization accuracy, compared to the baseline approaches. Although our proposed method performs in an unsupervised manner, it proves to be applicable even under unfavorable acoustic environments with a high reverberation level (e.g., T60=1 second).
The objective of binaural multi-microphone speech enhancement algorithms can be viewed as a multi-criteria design problem as there are several requirements to be met. The objective is not only to extract the target speaker without distortion, but also to suppress interfering sources (e.g., competing speakers) and ambient background noise, while preserving the auditory impression of the complete acoustic scene. Such a multi-objective problem (MOP) can be solved using a Pareto frontier, which provides a useful trade-off between the different criteria. In this paper, we propose a unified Pareto optimization framework, which is achieved by defining a generalized mean squared error (MSE) cost function, derived from a MOP. The solution to the multi-criteria problem is grounded on a solid mathematical foundation. The MSE cost function consists of a weighted sum of speech distortion (SD), partial interference reduction (IR), and partial noise reduction (NR) terms with scaling parameters that control the amount of IR and NR. The filter minimizing this generalized cost function, denoted Pareto optimal binaural multichannel Wiener filter (Pareto-BMWF), constitutes a generalization of various binaural MWF-based and binaural MVDR-based beamformers. This solution is optimal for any set of parameters. The improved speech enhancement capabilities are experimentally demonstrated using real-signal recordings when estimation errors are present and the binaural cue preservation capabilities are analyzed.
The problem of source separation and noise reduction using multiple microphones is addressed. The minimum mean square error (MMSE) estimator for the multispeaker case is derived and a novel decomposition of this estimator is presented. The MMSE estimator is decomposed into two stages: first, a multispeaker linearly constrained minimum variance (LCMV) beamformer (BF); and second, a subsequent multispeaker Wiener postfilter. The first stage separates and enhances the signals of the individual speakers by utilizing the spatial characteristics of the speakers [as manifested by the respective acoustic transfer functions (ATFs)] and the noise power spectral density (PSD) matrix, while the second stage exploits the speakers’ PSD matrix to reduce the residual noise at the output of the first stage. The output vector of the multispeaker LCMV BF is proven to be the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral amplitude estimator for the multispeaker case is also derived given the multispeaker LCMV BF outputs. The performance evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically verified that the multispeaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when compared with the single-speaker postfilter.
A spherical harmonics domain source feature called relative harmonic coefficients (RHC) has recently been applied to address the source direction-of-arrival (DOA) estimation problem. This paper presents a compact evaluation and comparison between two existing RHC based DOA estimators: (i) a method using a full grid search over the two-dimensional (2-D) directional space, (ii) a decoupled estimator which uses one-dimensional (1-D) search to separately localize the source’s elevation and azimuth. We also propose a new estimator using a gradient descent search over the 2-D directional grid space. Extensive experiments in both simulated and real-life environments are conducted to examine and analyze the performance of all the underlying DOA estimators. Two objective metrics, including localization accuracy and algorithm complexity, are adopted for an evaluation and comparison between all estimators.
In this paper, we consider the problem of acoustic source localization by acoustic sensor networks (ASNs) using a promising, learning-based technique that adapts to the acoustic environment. In particular, we look at the scenario when a node in the ASN is displaced from its position during training. As the mismatch between the ASN used for learning the localization model and the one after a node displacement leads to erroneous position estimates, a displacement has to be detected and the displaced nodes need to be identified. We propose a method that considers the disparity in position estimates made by leave-one-node-out (LONO) sub-networks and uses a Markov random field (MRF) framework to infer the probability of each LONO position estimate being aligned, misaligned or unreliable while accounting for the noise inherent to the estimator. This probabilistic approach is advantageous over naïve detection methods, as it outputs a normalized value that encapsulates conditional information provided by each LONO sub-network on whether the reading is in misalignment with the overall network. Experimental results confirm that the performance of the proposed method is consistent in identifying compromised nodes in various acoustic conditions.
In this paper, we present a multiple-speaker direction of arrival (DOA) tracking algorithm with a microphone array that utilizes the recursive EM (REM) algorithm proposed by Cappé and Moulines. In our model, all sources can be located in one of a predefined set of candidate DOAs. Accordingly, the received signals from all microphones are modeled as Mixture of Gaussians (MoG) vectors in which each speaker is associated with a corresponding Gaussian. The localization task is then formulated as a maximum likelihood (ML) problem, where the MoG weights and the power spectral density (PSD) of the speakers are the unknown parameters. The REM algorithm is then utilized to estimate the ML parameters in an online manner, facilitating multiple source tracking. By using Fisher-Neyman factorization, the outputs of the minimum variance distortionless response (MVDR)-beamformer (BF) are shown to be sufficient statistics for estimating the parameters of the problem at hand. With that, the terms for the E-step are significantly simplified to a scalar form. An experimental study demonstrates the benefits of the using proposed algorithm in both a simulated data-set and real recordings from the acoustic source localization and tracking (LOCATA) data-set.
In this study we present a mixture of deep experts (MoDE) neural-network architecture for single microphone speech enhancement. Our architecture comprises a set of deep neural networks (DNNs), each of which is an ‘expert’ in a different speech spectral pattern such as phoneme. A gating DNN is responsible for the latent variables which are the weights assigned to each expert’s output given a speech segment. The experts estimate a mask from the noisy input and the final mask is then obtained as a weighted average of the experts’ estimates, with the weights determined by the gating DNN. A soft spectral attenuation, based on the estimated mask, is then applied to enhance the noisy speech signal. As a byproduct, we gain reduction at the complexity in test time. We show that the experts specialization allows better robustness to unfamiliar noise types.
We propose a semi-supervised localization approach based on deep generative modeling with variational autoencoders (VAE). Localization in reverberant environments remains a challenge, which machine learning (ML) has shown promise in addressing. Even with large data volumes, the number of labels available for supervised learning in reverberant environments is usually small. We address this issue by perform semi-supervised learning (SSL) with convolutional VAEs. The VAE is trained to generate the phase of relative transfer functions (RTFs), in parallel with a DOA classifier, on both labeled and unlabeled RTF samples. The VAE-SSL approach is compared with SRP-PHAT and fully-supervised CNNs. We find that VAE-SLL can outperform both SRP-PHAT and CNN in label-limited scenarios.
A data-driven approach for multiple speakers localization in reverberant enclosures is presented. The approach combines semi-supervised learning on multiple manifolds with unsupervised maximum likelihood estimation. The relative transfer functions (RTFs) are used in both stages of the proposed algorithm as feature vectors, which are known to be related to source positions. The microphone positions are not known. In the training stage, a nonlinear, manifold-based, mapping between RTFs and source locations is inferred using single-speaker utterances. The inference procedure utilizes two RTF datasets: A small set of RTFs with their associated position labels; and a large set of unlabelled RTFs. This mapping is used to generate a dense grid of localized sources that serve as the centroids of a Mixture of Gaussians (MoG) model, used in the test stage of the algorithm to cluster RTFs extracted from multiple-speakers utterances. Clustering is applied by applying the expectation-maximization (EM) procedure that relies on the sparsity and intermittency of the speech signals. A preliminary experimental study, with either two or three overlapping speakers in various reverberation levels, demonstrates that the proposed scheme achieves high localization accuracy compared to a baseline method using a simpler propagation model.
Traditional source direction-of-arrival (DOA) estimation algorithms generally localize the elevation and azimuth simultaneously, requiring an exhaustive search over the two-dimensional (2-D) space. By contrast, this paper presents two decoupled source DOA estimation algorithms using a recently introduced source feature called the relative harmonic coefficients. They are capable to recover the source’s elevation and azimuth separately, since the elevation and azimuth components in the relative harmonic coefficients are decoupled. The proposed algorithms are highlighted by a large reduction of computational complexity, thus enable a direct application for sound source tracking. Simulation results, using both a static and moving sound source, confirm the proposed methods are computationally efficient while achieving competitive localization accuracy.
We introduce a database of multi-channel recordings performed in an acoustic lab with adjustable reverberation time. The recordings provide detailed information about room acoustics for positions of a source within a confined area. In particular, the main positions correspond to 4104 vertices of a cube-shaped dense grid within a 46 × 36 × 32 cm volume. The database can serve for simulations of a real-world situations and as a tool for detailed analyses of beampatterns of spatial processing methods. It could be used also for training and testing of mathematical models of the acoustic field.
This paper presents a fully Bayesian hierarchical model for blind audio source separation in a noisy environment. Our probabilistic approach is based on Gaussian priors for the speech signals, Gamma hyperpriors for the speech precisions and a Gamma prior for the noise precision. The time-varying acoustic channels are modelled with a linear-Gaussian state-space model. The inference is carried out using a variational Expectation-Maximization (VEM) algorithm, leading to a variant of the multi-speaker multichannel Wiener filter (MCWF) to separate and enhance the audio sources, and a Kalman smoother to infer the acoustic channels. The VEM speech estimator can be decomposed into two stages: A multi-speaker linearly constrained minimum variance (LCMV) beamformer followed by a variational multi-speaker postfilter. The proposed algorithm is evaluated in a static scenario using recorded room impulse responses (RIRs) with two reverberation levels, showing superior performance compared to competing methods.
We propose a semi-supervised localization approach based on deep generative modeling with variational autoencoders (VAE). Localization in reverberant environments remains a challenge, which machine learning (ML) has shown promise in addressing. Even with large data volumes, the number of labels available for supervised learning in reverberant environments is usually small. We address this issue by perform semi-supervised learning (SSL) with convolutional VAEs. The VAE is trained to generate the phase of relative transfer functions (RTFs), in parallel with a DOA classifier, on both labeled and unlabeled RTF samples. The VAE-SSL approach is compared with SRP-PHAT and fully-supervised CNNs. We find that VAE-SLL can outperform both SRP-PHAT and CNN in label-limited scenarios.
The problem of multi-microphone blind audio source separation in noisy environment is addressed. The estimation of the acoustic signals and the associated parameters is carried out using the expectation-maximization algorithm. Two separation algorithms are developed using either deterministic representation or stochastic Gaussian distribution for modelling the speech signals. Under the deterministic model, the speech sources are estimated in the M-step by applying in parallel multiple minimum variance distortionless response (MVDR) beamformers, while under the stochastic model, the speech signals are estimated in the E-step by applying in parallel multiple multichannel Wiener filters (MCWF). In the simulation study, we generated a large dataset of microphone signals, by convolving speech signals, with overlapping activity patterns, by measured acoustic impulse responses. It is shown that the proposed methods outperform a baseline method in terms of speech quality and intelligibility.
In this paper we propose a fully Bayesian hierarchical model for multi-speaker direction of arrival (DoA) estimation and separation in noisy environments, utilizing the W-disjoint orthogonality property of the speech sources. Our probabilistic approach employs a mixture of Gaussians formulation with centroids associated with a grid of candidate speakers’ DoAs. The hierarchical Bayesian model is established by attributing priors to the various parameters. We then derive a variational Expectation-Maximization algorithm that estimates the DoAs by selecting the most probable candidates, and separates the speakers using a variant of the multichannel Wiener filter that takes into account the responsibility of each candidate in describing the received data. The proposed algorithm is evaluated using real room impulse responses from a freely-available database, in terms of both DoA estimates accuracy and separation scores. It is shown that the proposed method outperforms competing methods.
In this contribution, a novel maximum likelihood (ML) based direction of arrival (DOA) estimator for concurrent speakers in a noisy reverberant environment is presented. The DOA estimation task is formulated in the short-time Fourier transform (STFT) in two stages. In the first stage, a single local DOA per time-frequency (TF) bin is selected, using the W-disjoint orthogonality property of the speech signal in the STFT domain. The local DOA is obtained as the maximum of the narrow-band likelihood localization spectrum at each TF bin. In addition, for each local DOA, a confidence measure is calculated, determining the confidence in the local estimate. In the second stage, the wide-band localization spectrum is calculated using a weighted histogram of the local DOA estimates with the confidence measures as weights. Finally, the wide-band DOA estimation is obtained by selecting the peaks in the wide-band localization spectrum. The results of our experimental study demonstrate the benefit of the proposed algorithm in a reverberant environment as compared with the classical steered response power phase transform (SRP-PHAT) algorithm.
In speech enhancement, the use of supervised algorithms in the form of deep neural networks (DNNs) has become tremendously popular in recent years. The target function of the DNN (and the associated estimators) is often either a masking function applied to the noisy spectrum, or the clean log-spectrum. In this work, we show that both separate cost functions are unsuitable for dealing with narrowband noise, and propose a new composite estimator in the log-spectrum domain. The new technique relies on a single DNN that outputs both a masking function and an estimated log-spectrum. Both outputs are used for the composite enhancement. The proposed estimator demonstrates superior performance for speech utterances contaminated by additive narrowband noise, while maintaining the enhancement quality of the baseline algorithms for wideband noise.
In this study we propose a deep clustering algorithm that extends the k-means algorithm. Each cluster is represented by an autoencoder instead of a single centroid vector. Each data point is associated with the autoencoder which yields the minimal reconstruction error. The optimal clustering is found by learning a set of autoencoders that minimize the global reconstruction mean-square error loss. The network architecture is a simplified version of a previous method that is based on mixture-of-experts. The proposed method is evaluated on standard image corpora and performs on par with state-of-the-art methods which are based on much more complicated network architectures.
This paper presents an unsupervised multi-source localization algorithm using a recently introduced feature called the relative harmonic coefficients. We derive a closed-form expression of the feature and briefly summarize its unique properties. We then exploit this feature to develop a single-source frame/bin detector which simplifies the challenging problem of multiple source localization into a single source localization problem. We show that the underlying method is suitable for localization using overlapped, disjoint as well as simultaneous multi-source recordings. Experimental results in both simulated and real-life reverberant environments confirm improved localization accuracy of the proposed method in comparison with the existing state-of-art approach.
Speech signals captured by a microphone mounted to a smart soundbar or speaker are inherently contaminated by echos. Modern smart devices are usually characterized by low computational capabilities and low memory resources; in these cases, a low-complexity acoustic echo canceller (AEC) may be preferred even though a tolerable degradation in the cancellation occurs. In principle, devices with multiple loudspeakers need an individual AEC for each loudspeaker because the transfer function (TF) from each loudspeaker to the microphone must be estimated. In this paper, we present an normalized least mean square (NLMS) algorithm for a multi-loudspeaker case using relative loudspeaker transfer functions (RLTFs). In each iteration, the RLTFs between each loudspeaker and the reference loudspeaker are estimated first, and then the primary TF between the reference loudspeaker and the microphone. Assuming loudspeakers that are close to each other, the RLTFs can be estimated using fewer coefficients w.r.t. the primary TF, yielding a reduction of 3:4 in computational complexity and 1:2 in memory usage. The algorithm is evaluated using both simulated and real room impulse responses (RIRs) of two loudspeakers with a reverberation time set to 0.3 s and several distances between the loudspeakers.
This paper proposes to use directional spectra as new features for manifold-learning-based acoustic source localization. We claim that directional spectra not only contain directional information, but are rather discriminative for different positions in a reverberant enclosure. We use these proposed features to build a manifold-learning-based localization algorithm which is applied to single-array localization as well as to Acoustic Sensor Network (ASN) localization. The performance of the proposed algorithm is benchmarked by comprehensive experiments carried out in a simulated environment, with comparison to a blind approach based on triangulation, as well as by Gaussian Process Regression (GPR)-based localization.
We present a multi-microphone multi-speaker direction of arrival (DOA) tracking algorithm. In the proposed algorithm, the DOA values are discretized to a set of candidate DOAs. Accordingly, and following the W-disjoint orthogonality (WDO) property of the speech signal, each time-frequency (TF) bin in the short-time Fourier transform (STFT) domain is associated with a single DOA candidate. The conditional probability of each TF observation given its corresponding DOA association, is modeled as a multivariate complex-Gaussian distribution, with the power spectral density (PSD) of each source an unknown parameter. By applying the Fisher-Neyman factorization, it can be shown that this conditional probability is proportional to the signal-to-noise ratio (SNR) at the outputs of minimum variance distortionless response (MVDR)-beamformers (BFs), directed towards all candidate DOAs. We model these observations as either a frequency-wise parallel Hidden Markov Model (HMM) or as a coupled HMM with coupling between adjacent frequency bins. The posterior probability of these associations is inferred by applying an extended FB (FB) algorithm, and the actual DOAs can be inferred from this posterior. An experimental study demonstrates the benefits of the proposed algorithm using both a simulated dataset and real recordings drawn from the acoustic source localization and tracking (LOCATA) dataset.
In this paper we propose a Deep Autoencoder Mixture Clustering (DAMIC) algorithm based on a mixture of deep autoencoders where each cluster is represented by an autoencoder. A clustering network transforms the data into another space and then selects one of the clusters. Next, the autoencoder associated with this cluster is used to reconstruct the data-point. The clustering algorithm jointly learns the nonlinear data representation and the set of autoencoders. The optimal clustering is found by minimizing the reconstruction loss of the mixture of autoencoder network. Unlike other deep clustering algorithms, no regularization term is needed to avoid data collapsing to a single point. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.
Sound source localization is a cumbersome task in challenging reverberation conditions. Recently, there is a growing interest in developing learning-based localization methods. In this approach, acoustic features are extracted from the measured signals and then given as input to a model that maps them to the corresponding source positions. Typically, a massive dataset of labeled samples from known positions is required to train such models. Here, we present a novel weakly-supervised deep-learning localization method that exploits only a few labeled (anchor) samples with known positions, together with a larger set of unlabeled samples, for which we only know their relative physical ordering. We design an architecture that uses a stochastic combination of triplet-ranking loss for the unlabeled samples and physical loss for the anchor samples, to learn a nonlinear deep embedding that maps acoustic features to an azimuth angle of the source. The combined loss can be optimized effectively using standard gradient-based approach. Evaluating the proposed approach on simulated data, we demonstrate its significant improvement over two previous learning-based approaches for various reverberation levels, while maintaining consistent performance with varying sizes of labeled data.
The time-of-arrivals (TOAs) of acoustic echoes is a prerequisite in, e.g., room geometry estimation and localization of acoustic reflectors, which can be an enabling technology for autonomous robots and drones. However, solving these problems alone using TOAs introduces the difficult problem of echolabeling. Moreover, it is typically suggested to estimate the TOAs by estimating the room impulse response, and finding the peaks of it, but this approach is vulnerable against noise (e.g., ego noise). We therefore propose an expectation-maximization (EM) method for estimating both the TOAs and direction-of-arrivals (DOAs) of acoustic echoes using a loudspeaker and a uniform circular array (UCA). Our results show that this approach is more robust against noise compared to the traditional peak finding approach. Moreover, they show that the TOA and DOA information can be combined to estimate wall positions directly without considering echolabeling.
A concurrent speaker direction of arrival (DOA) estimator in a reverberant environment is presented. The reverberation phenomenon, if not properly addressed, is known to degrade the performance of DOA estimators. In this paper, we investigate a variational Bayesian (VB) inference framework for clustering time-frequency (TF) bins to candidate angles. The received microphone signals are modelled as a sum of anechoic speech and the reverberation component. Our model relies on Gaussian prior for the speech signal and Gamma prior for the speech precision. The noise covariance matrix is modelled by a time-invariant full-rank coherence matrix multiplied by time-varying gain with Gamma prior as well. The benefits of the presented model are verified in a simulation study using measured room impulse responses.
The scenario of a mixture of two speakers captured by a microphone array in a noisy and reverberant environment is considered. If the problems of source separation and dereverberation are treated separately, performance degradation may result. It is well-known that the performance of blind source separation (BSS) algorithms degrades in the presence of reverberation, unless reverberation effects are properly addressed (leading to the so-called convolutive BSS algorithms). Similarly, the performance of common dereverberation algorithms will severely degrade if an interference signal is also captured by the same microphone array. The aim of the proposed method is to jointly separate and dereverberate the two speech sources, by extending the Kalman expectation-maximization for dereverberation (KEMD) algorithm, previously proposed by the authors. A statistical model is attributed to this scenario, using the convolutive transfer function (CTF) approximation, and the expectation-maximization (EM) scheme is applied to obtain a maximum likelihood (ML) estimate of the parameters. In the expectation step, the separated clean signals are extracted from the observed data by the application of a Kalman Filter, utilizing the parameters that were estimated in the previous iteration. The maximization step updates the parameters estimation according to the Estep output. Simulation results shows that the proposed method improves both the separation of the signals and their overall quality.
In this paper, we present a multi-microphone speech separation algorithm based on masking inferred from the speakers direction of arrival (DOA). According to the W-disjoint orthogonality property of speech signals, each time-frequency (TF) bin is dominated by a single speaker. This TF bin can therefore be associated with a single DOA. In our procedure, we apply a deep neural network (DNN) with a U-net architecture to infer the DOA of each TF bin from a concatenated set of the spectra of the microphone signals. Separation is obtained by multiplying the reference microphone by the masks associated with the different DOAs. Our proposed deep direction estimation for speech separation (DDESS) method is inspired by the recent advances in deep clustering methods. Unlike already established methods that apply the clustering in a latent embedded space, in our approach the embedding is closely associated with the spatial information, as manifested by the different speakers’ directions of arrival.
In this paper, we present a multiple-speaker direction of arrival (DOA) tracking algorithm with a microphone array that utilizes the recursive EM (REM) algorithm proposed by Cappé and Moulines. In our model, all sources can be located in one of a predefined set of candidate DOAs. Accordingly, the received signals from all microphones are modeled as Mixture of Gaussians (MoG) vectors in which each speaker is associated with a corresponding Gaussian. The localization task is then formulated as a maximum likelihood (ML) problem, where the MoG weights and the power spectral density (PSD) of the speakers are the unknown parameters. The REM algorithm is then utilized to estimate the ML parameters in an online manner, facilitating multiple source tracking. By using Fisher-Neyman factorization, the outputs of the minimum variance distortionless response (MVDR)-beamformer (BF) are shown to be sufficient statistics for estimating the parameters of the problem at hand. With that, the terms for the E-step are significantly simplified to a scalar form. An experimental study demonstrates the benefits of the using proposed algorithm in both a simulated data-set and real recordings from the acoustic source localization and tracking (LOCATA) data-set.
This paper investigates localization of an arbitrary number of simultaneously active speakers in an acoustic enclosure. We propose an algorithm capable of estimating the number of speakers, using reliability information to obtain robust estimation results in adverse acoustic scenarios and estimating individual probability distributions describing the position of each speaker using convex geometry tools. To this end, we start from an established algorithm for localization of acoustic sources based on the EM algorithm. There, the estimation of the number of sources as well as the handling of reverberation has not been addressed sufficiently. We show improvement in the localization of a higher number of sources and in the robustness in adverse conditions including interference from competing speakers, reverberation and noise.
In this paper, the problem of speech dereverberation in a noiseless scenario is addressed in a hierarchical Bayesian framework. Our probabilistic approach relies on a Gaussian model for the early speech signal combined with a multichannel Gaussian model for the relative early transfer function (RETF). The late reverberation is modelled as a Gaussian additive interference, and the speech and reverberation precisions are modelled with Gamma distribution. We derive a variational Expectation-Maximization (VEM) algorithm which uses a variant of the multichannel Wiener filter (MCWF) to infer the early speech component while suppressing the late reverberation. The proposed algorithm was evaluated using real room impulse responses (RIRs) recorded in our acoustic lab with a reverberation time set to 0.36 s and 0.61 s. It is shown that a significant improvement is obtained with respect to the reverberant signal, and that the proposed algorithm outperforms a baseline algorithm. In terms of channel alignment, a superior channel estimate is demonstrated.
Voice activity detection (VAD), namely determining whether a speech signal is active or inactive, and single talk detector (STD), namely detecting that only one speaker is active, are important building blocks in many speech processing applications. A speaker-localization stage (such as the steered response power (SRP)) is often concurrently implemented on the same device.In this paper, the spatial properties of the SRP are utilized for improving the performance of both the voice activity detector (VAD) and the STD. We propose to measure the entropy at the SRP output and compare with the typical entropy of noise-only frames. This feature utilizes spatial information and may therefore become advantageous in nonstationary noise environments. The STD can then be implemented by determining local minimum values of the entropy measure of the SRP.The proposed VAD was tested for a single speaker with two cases, directional background noise with changing level and with a background music source. The proposed STD was tested using real recordings of two concurrent speakers.
A direction of arrival (DOA) estimator for concurrent speakers in a reverberant environment is presented. The DOA estimation task is formulated in the short-time Fourier transform (STFT) in two stages. In the first stage, a single narrow-band DOA per time-frequency (T-F) is selected, since the speech sources are assumed to exhibit disjoint activity in the STFT domain. The narrow-band DOA is obtained as the maximum of the narrow-band steered response power phase transform (SRP-PHAT) localization spectrum at that T-F bin. In addition, for each narrow-band DOA, a quality measure is calculated, which provides the confidence in the estimated decision. In the second stage, the wide-band localization spectrum is calculated using a weighted histogram of the narrow-band DOAs with the quality measures as weight. Finally, the wide-band DOA estimation is obtained by selecting the peaks in the wide-band localization spectrum. The results of our experimental study demonstrate the benefit of the proposed algorithm as compared to the wide-band SRP-PHAT algorithm in a reverberant environment.
Speech enhancement and source separation are well-known challenges in the context of hands-free communication and automatic speech recognition. The multichannel Wiener filter (MCWF) that satisfies the minimum mean square error (MMSE) criterion, is a fundamental speech enhancement tool. However, it can suffer from speech distortion, especially when the noise level is high. The speech distortion weighted multichannel Wiener filter (SDW-MWF) was therefore proposed to control the tradeoff between noise reduction and speech distortion for the single-speaker case. In this paper, we generalize this estimator and propose a method for controlling this tradeoff in the multi-speaker case. The proposed estimator is decomposed into two successive stages: 1) a multi-speaker linearly constrained minimum variance (LCMV), which is solely determined by the spatial characteristics of the speakers; and 2) a multi-speaker Wiener postfilter (PF), which is responsible for reducing the residual noise. The proposed PF consists of several controlling parameters that can almost independently control the tradeoff between the distortion of each speaker and the total noise reduction.
Adaptive beamforming is widely used for speech enhancement in telephony and speech recognition applications. We focus on scenarios with a single desired speaker in a non-stationary environmental noise. Many modern beamformers are designed using the desired speaker transfer function (TF), or the respective relative transfer function (RTF). If the relative source position is fixed, tracking the RTF can be avoided. On top of reducing the computational complexity’ this may also circumvent the beamformer from enhancing competing sources. In this work, to target such applications, we propose a technique for obtaining a spatially robust generalized sidelobe canceler (GSC) beamformer with controlled white noise gain (WNG). The proposed implementation will introduce robustness to mismatch between the assumed and actual RTFs while maintaining sufficiently large WNG. It allows for high flexibility in shaping the desired response, while maintaining low computational complexity.
Multi-microphone, DNN-based, speech enhancement and speaker separation/extraction algorithms have recently gained increasing popularity. The enhancement capabilities of spatial processor can be very high, provided that all its building blocks are accurately estimated. Data-driven estimation approaches can be very attractive since they do not rely on accurate statistical models, which is usually unavailable. However, training a DNN with multi-microphone data is a challenging task, due to inevitable differences between the train and test phases. In this work, we present an estimation procedure for controlling a linearly-constrained minimum variance (LCMV) beamformer for speaker extraction and noise reduction. We propose an attention-based DNN for speaker diarization that is applicable to the task at hand. In the proposed scheme, each microphone signal propagates through a dedicated DNN and an attention mechanism selects the most informative microphone. This approach has the potential of mitigating the mismatch between training and test phases and can therefore lead to an improved speaker extraction performance.
The relative transfer function (RTF) is a generalization of the delay-based array manifold, which is applicable to reverberant environments with multiple reflections. Beamformers that utilize RTF are known to outperform the simpler beamforming techniques that use delay-based steering vectors. Adopting established models of the acoustic transfer functions and utilizing recent contributions which derive the probability distribution of the ratio of independent complex-Gaussian random variables, we derive a probability distribution model for the RTF. The model is verified and compared to the empirical distribution in multiple Monte-Carlo experiments.
Separation of underdetermined speech mixtures, where the number of speakers is greater than the number of microphones, is a challenging task. Due to the intermittent behaviour of human conversations, typically, the instantaneous number of active speakers does not exceed the number of microphones, namely the mixture is locally (over-)determined. This scenario is addressed in this paper using a dual stage approach: diarization followed by separation. The diarization stage is based on spectral decomposition of the correlation matrix between different time frames. Specifically, the spectral gap reveals the overall number of speakers, and the computed eigenvectors form a simplex of the activity of the speakers across time. In the separation stage, the diarization results are utilized for estimating the mixing acoustic channels, as well as for constructing an unmixing scheme for extracting the individual speakers. The performance is demonstrated in a challenging scenario with six speakers and only four microphones. The proposed method shows perfect recovery of the overall number of speakers, close to perfect diarization accuracy, and high separation capabilities in various reverberation conditions.
Speech signals, captured by a microphone array mounted to a smart loudspeaker device, can be contaminated by ambient noise. In this paper, we present an online multichannel algorithm, based on the recursive EM (REM) procedure, to suppress ambient noise and enhance the speech signal. In the E-step of the proposed algorithm, a multichannel Wiener filter (MCWF) is applied to enhance the speech signal. The MCWF parameters, that is, the power spectral density (PSD) of the anechoic speech, the steering vector, and the PSD matrix of the noise, are estimated in the M-step. The proposed algorithm is specifically suitable for online applications since it uses only past and current observations and requires no iterations. To evaluate the proposed algorithm we used two sets of measurements. In the first set, static scenarios were generated by convolving speech utterances with real room impulse responses (RIRs) recorded in our acoustic lab with reverberation time set to 0.16 s and several signal to directional noise ratio (SDNR) levels. The second set was used to evaluate dynamic scenarios by using real recordings acquired by CEVA “smart and connected” development platform. Two practical use cases were evaluated: 1) estimating the steering vector with a known noise PSD matrix and 2) estimating the noise PSD matrix with a known steering vector. In both use cases, the proposed algorithm outperforms baseline multichannel denoising algorithms.
Application of the linearly constrained minimum variance (LCMV) beamformer (BF) to speaker extraction tasks in real-life scenarios necessitates a sophisticated control mechanism to facilitate the estimation of the noise spatial cross-power spectral density (cPSD) matrix and the relative transfer function (RTF) of all sources of interest. We propose a deep neural network (DNN)-based multichannel concurrent speakers detector (MCCSD) that utilizes all available microphone signals to detect the activity patterns of all speakers. Time frames classified as no active speaker frames will be utilized to estimate the cPSD, while time frames with a single detected speaker will be utilized for estimating the associated RTF. No estimation will take place during concurrent speaker activity. Experimental results show that the multi-channel approach significantly improves its single-channel counterpart.
Speech derverberation using a single microphone is addressed in this paper. Motivated by the recent success of the fully convolutional networks (FCN) in many image processing applications, we investigate their applicability to enhance the speech signal represented by short-time Fourier transform (STFT) images. We present two variations: a “U-Net” which is an encoder-decoder network with skip connections and a generative adversarial network (GAN) with U-Net as generator, which yields a more intuitive cost function for training. To evaluate our method we used the data from the REVERB challenge, and compared our results to other methods under the same conditions. We have found that our method outperforms the competing methods in most cases.
Estimation of the relative transfer functions (RTFs) vector of a desired speech source is a fundamental problem in the design of data-dependent spatial filters. We present two common estimation methods, namely the covariance-whitening (CW) and the covariance-subtraction (CS) methods. The CW method has been shown in prior work to outperform the CS method. However, thus far its performance has not been analyzed. In this paper, we analyze the performance of the CW and CS methods and show that in the cases of spatially white noise and of uniform powers of desired speech source and coherent interference over all microphones, the CW method is superior. The derivations are validated by comparing them to their empirical counterparts in Monte Carlo experiments. In fact, the CW method outperforms the CS method in all tested scenarios, although there may be rare scenarios for which this is not the case.
This paper addresses the localization of an unknown number of acoustic sources in an enclosure. We extend a well established algorithm for localization of acoustic sources, which is based on the Expectation Maximization (EM) algorithm for clustering of phase differences by a Gaussian mixture model. Supporting a more appropriate probabilistic model for spherical data such as direction of arrival or phase differences, the von Mises distribution is used to derive a localization algorithm for multiple simultaneously active sources. Experiments with simulated room impulse responses confirm the superiority of the proposed algorithm to the existing method in terms of localization performance.
This paper addresses the problem of online multiple moving speakers localization in reverberant environments. The direct-path relative transfer function (DP-RTF), as defined by the ratio between the first taps of the convolutive transfer function (CTF) of two microphones, encodes the inter-channel direct-path information and is thus used as a localization feature being robust against reverberation. The CTF estimation is based on the cross-relation method. In this work, the recursive least-square method is proposed to solve the cross-relation problem, due to its relatively low computational cost and its good convergence rate. The DP-RTF feature estimated at each time-frequency bin is assumed to correspond to a single speaker. A complex Gaussian mixture model is used to assign each observed feature to one among several speakers. The recursive expectation-maximization algorithm is adopted to update online the model parameters. The method is evaluated with a new dataset containing multiple moving speakers, where the ground-truth speaker trajectories are recorded with a motion capture system.
In this study we address models with latent variable in the context of neural networks. We analyze a neural network architecture, mixture of deep experts (MoDE), that models latent variables using the mixture of expert paradigm. Learning the parameters of latent variable models is usually done by the expectation-maximization (EM) algorithm. However, it is well known that back-propagation gradient-based algorithms are the preferred strategy for training neural networks. We show that in the case of neural networks with latent variables, the back-propagation algorithm is actually a recursive variant of the EM that is more suitable for training neural networks. To demonstrate the viability of the proposed MoDE network it is applied to the task of speech presence probability estimation, widely applicable to many speech processing problem, e.g. speaker diarization and separation, speech enhancement and noise reduction. Experimental results show the benefits of the proposed architecture over standard fully-connected networks with the same number of parameters.
In this paper, we present a new control mechanism for LCMV beamforming. Application of the LCMV beamformer to speaker separation tasks requires accurate estimates of its building blocks, e.g. the noise spatial cross-power spectral density (cPSD) matrix and the relative transfer function (RTF) of all sources of interest. An accurate classification of the input frames to various speaker activity patterns can facilitate such an estimation procedure. We propose a DNN-based concurrent speakers detector (CSD) to classify the noisy frames. The CSD, trained in a supervised manner using a DNN, classifies noisy frames into three classes: 1) all speakers are inactive – used for estimating the noise spatial cPSD matrix; 2) a single speaker is active – used for estimating the RTF of the active speaker; and 3) more than one speaker is active – discarded for estimation purposes. Finally, using the estimated blocks, the LCMV beamformer is constructed and applied for extracting the desired speaker from a noisy mixture of speakers.
Despite attracting significant research efforts, the problem of source localization in noisy and reverberant environments remains challenging. Novel learning-based methods attempt to solve the problem by modelling the acoustic environment from the observed data. Typically, appropriate feature vectors are defined, and then used for constructing a model, which maps the extracted features to the corresponding source positions. In this paper, we focus on localizing a source using a distributed network with several arrays of unidirectional microphones. We introduce new feature vectors, which utilize the special characteristic of unidirectional microphones, receiving different parts of the reverberated speech. The new features are computed locally for each array, using the power-ratios between its measured signals, and are used to construct a local model, representing the unique view point of each array. The models of the different arrays, conveying distinct and complementing structures, are merged by a Multi-View Gaussian Process (MVGP), mapping the new features to their corresponding source positions. Based on this unifying model, a Bayesian estimator is derived, exploiting the relations conveyed by the covariance terms of the MVGP. The resulting localizer is shown to be robust to noise and reverberation, utilizing a computationally efficient feature extraction.
The multichannel inverse filtering method, i.e. multiple input/output inverse theorem (MINT), is widely used. However, it is usually performed in the time domain, and based on the long room impulse responses, thus it has a high computational complexity and a large number of near-common zeros. In this paper, we propose to perform MINT in the short-time Fourier transform (STFT) domain, in which the time-domain filter is approximated by the convolutive transfer function. The oversampled STFT is used to avoid frequency aliasing, which however leads to a common zero region in the subband frequency response due to the frequency response of the STFT window. A new inverse filtering target function concerning the STFT window is proposed to overcome this problem. In addition, unlike most studies using MINT for single source dereverberation, the multisource MINT is proposed for both source separation and dereverberation.
This paper addresses the problem of blind adaptive beamforming using a hierarchical Bayesian model. Our probabilistic approach relies on a Gaussian prior for the speech signal and a Gamma hyperprior for the speech precision, combined with a multichannel linear-Gaussian state-space model for the possibly time-varying acoustic channel. Furthermore, we assume a Gamma prior for the ambient noise precision. We present a variational Expectation-Maximization (VEM) algorithm that employs a variant of multi-channel Wiener filter (MCWF) to estimate the sound source and a Kalman smoother to estimate the acoustic channel of the room. It is further shown that the VEM speech estimator can be decomposed into two stages: A multichannel minimum variance distortionless response (MVDR) beamformer and a subsequent single-channel variational postfilter. The proposed algorithm is evaluated in terms of speech quality, for a static scenario with recorded room impulse responses (RIRs). It is shown that a significant improvement is obtained with respect to the noisy signal, and that the proposed algorithm outperforms a baseline algorithm. In terms of channel alignment, a superior channel estimate is demonstrated compared to the causal Kalman filter.
Natural conversations are spontaneous exchanges involving two or more people speaking in an intermittent manner. Therefore one expects such conversation to have intervals where some of the speakers are silent. Yet, most (multichannel) audio source separation (MASS) methods consider the sound sources to be continuously emitting on the total duration of the processed mixture. In this paper we propose a probabilistic model for MASS where the sources may have pauses. The activity of the sources is modeled as a hidden state, the diarization state, enabling us to activate/de-activate the sound sources at time frame resolution. We plug the diarization model within the spatial covariance matrix model proposed for MASS in [1], and obtain an improvement in performance over the state of the art when separating mixtures with intermittent speakers.
Deep neural networks (DNNs) have recently became a viable methodology for single microphone speech enhancement. The most common approach, is to feed the noisy speech features into a fully-connected DNN to either directly enhance the speech signal or to infer a mask which can be used for the speech enhancement. In this case, one network has to deal with the large variability of the speech signal. Most approaches also discard the speech continuity. In this paper, we propose a deep recurrent mixture of experts (DRMoE) architecture that addresses these two issues. In order to reduce the large speech variability, we split the network into a mixture of networks (denoted experts), each of which specializes in a specific and simpler task and a gating network. The time-continuity of the speech signal is taken into account by implementing the experts and the gating network as a recurrent neural network (RNN). Experimental study shows that the proposed algorithm produces higher objective measurements scores compared to both a single RNN and a deep mixture of experts (DMoE) architectures.
A novel approach to calibrate the geometry of microphones using a single sound event is proposed. A variant of the expectation-maximization algorithm is employed to estimate the spatial coherence matrix of the reverberant sound field directly from the microphone signals. By matching the spatial coherence to theoretical models, the pairwise microphone distances are estimated. From this, the overall geometry is computed. Simulations and lab recordings are used to show that the proposed method outperforms a related approach that assumes a perfectly diffused sound field.
A linearly constrained minimum variance (LCMV) beamformer aims to completely remove interference and optimize the signal-to-noise ratio (SNR). We examine an array geometry consisting of multiple sub-arrays. Our analysis shows that the increased intersensor distance typical of such setups is beneficial for the task of signal separation. Another unique feature of distributed arrays is the necessity of sharing information from different locations, which may pose a burden in terms of power and bandwidth resources. We discuss a scheme with minimalistic transmission requirements involving a preprocessing operation at each sub-array node. Expressions for the penalties due to preprocessing with local parameters are derived and corroborated with computer simulations.
A distortionless speech extraction in a reverberant environment can be achieved by an application of a beamforming algorithm, provided that the relative transfer functions (RTFs) of the sources and the covariance matrix of the noise are known. In this contribution, we consider the RTF identification challenge in a multi-source scenario. We propose a successive RTF identification (SRI), based on a sole assumption that sources become successively active. The proposed algorithm identifies the RTF of the ith speech source assuming that the RTFs of all other sources in the environment and the power spectral density (PSD) matrix of the noise were previously estimated. The proposed RTF identification algorithm is based on the neural network Mix-Max (NN-MM) single microphone speech enhancement algorithm, followed by a least-squares (LS) system identification method. The proposed RTF estimation algorithm is validated by simulation.
The linearly constrained minimum variance (LCMV)-beamformer (BF) is a viable solution for desired source extraction from a mixture of speakers in a noisy environment. The performance in terms of speech distortion, interference cancellation and noise reduction depends on the estimation of a set of parameters. This paper presents a new mechanism to update the parameters of the LCMV-BF. A new speech presence probability (SPP)-based voice activity detector (VAD) controls the noise covariance matrix update, and a speaker position identifier (SPI) procedure controls the relative transfer functions (RTFs) update. A postfilter is then applied to the BF output to further attenuate the residual noise signal. A series of experiments using real-life recordings confirm the speech enhancement capabilities of the proposed algorithm.
A direction of arrival (DOA) estimator for concurrent speakers in a noisy environment with unknown noise power is presented. Spatially colored noise, if not properly addressed, is known to degrade the performance of DOA estimators. In our contribution, the DOA estimation task is formulated as a maximum likelihood (ML) problem, which is solved using the expectation-maximization (EM) procedure. The received microphone signals are modelled as a sum of the speech and noise components. The noise power spectral density (PSD) matrix is modelled by a time-invariant full-rank coherence matrix multiplied by the noise power. The PSDs of the speech and noise components are estimated as part of the EM procedure. The benefit of the presented algorithm in a simulated noisy environment using measured room impulse responses is demonstrated.
Intuitive spoken dialogues are a prerequisite for human-robot interaction. In many practical situations, robots must be able to identify and focus on sources of interest in the presence of interfering speakers. Techniques such as spatial filtering and blind source separation are therefore often used, but rely on accurate knowledge of the source location. In practice, sound emitted in enclosed environments is subject to reverberation and noise. Hence, sound source localization must be robust to both diffuse noise due to late reverberation, as well as spurious detections due to early reflections. For improved robustness against reverberation, this paper proposes a novel approach for sound source tracking that constructively exploits the spatial diversity of a microphone array installed in a moving robot. In previous work, we developed speaker localization approaches using expectation-maximization (EM) approaches and using Bayesian approaches. In this paper we propose to combine the EM and Bayesian approach in one framework for improved robustness against reverberation and noise.
We present a probabilistic model for joint source separation and diarisation of multichannel convolutive speech mixtures. We build upon the framework of local Gaussian model (LGM) with non-negative matrix factorization (NMF). The diarisation is introduced as a temporal labeling of each source in the mix as active or inactive at the short-term frame level. We devise an EM algorithm in which the source separation process is aided by the diarisation state, since the latter indicates the sources actually present in the mixture. The diarisation state is tracked with a Hidden Markov Model (HMM) with emission probabilities calculated from the estimated source signals. The proposed EM has separation performance comparable with a state-of-the-art LGM NMF method, while outperforming a state-of-the-art speaker diarisation pipeline.
Beamforming algorithms in binaural hearing aids are crucial to improve speech understanding in background noise for hearing impaired persons. In this study, we compare and evaluate the performance of two recently proposed minimum variance (MV) beamforming approaches for binaural hearing aids. The binaural linearly constrained MV (BLCMV) beamformer applies linear constraints to maintain the target source and mitigate the interfering sources, taking into account the reverberant nature of sound propagation. The inequality constrained MV (ICMV) beamformer applies inequality constraints to maintain the target source and mitigate the interfering sources, utilizing estimates of the direction of arrivals (DOAs) of the target and interfering sources. The similarities and differences between these two approaches is discussed and the performance of both algorithms is evaluated using simulated data and using real-world recordings, particularly focusing on the robustness to estimation errors of the relative transfer functions (RTFs) and DOAs. The BLCMV achieves a good performance if the RTFs are accurately estimated while the ICMV shows a good robustness to DOA estimation errors.
The problem of source separation, dereverberation and noise reduction using a microphone array is addressed in this paper. The observed speech is modeled by two components, namely the early speech (including the direct path and some early reflections) and the late reverberation. The minimum mean square error (MMSE) estimator of the early speech components of the various speakers is derived, which jointly suppresses the noise and the overall reverberation from all speakers. The overall time-varying level of the reverberation is estimated using two different estimators, an estimator based on a temporal model and an estimator based on a spatial model. The experimental study consists of measured acoustic transfer functions (ATFs) and directional noise with various signal-to-noise ratio levels. The separation, dereverberation and noise reduction performance is examined in terms of perceptual evaluation of speech quality (PESQ) and signal-to-interference plus noise ratio improvement.
Speaker tracking in a reverberant enclosure with an ad hoc network of multiple distributed microphones is addressed in this paper. A set of prerecorded measurements in the enclosure of interest is used to construct a data-driven statistical model. The function mapping the measurement-based features to the corresponding source position represents complex unknown relations, hence it is modelled as a random Gaussian process. The process is defined by a covariance function which encapsulates the relations among the available measurements and the different views presented by the distributed microphones. This model is intertwined with a Kalman filter to capture both the smoothness of the source movement in the time-domain and the smoothness with respect to patterns identified in the set of available prerecorded measurements. Simulation results demonstrate the ability of the proposed method to localize a moving source in reverberant conditions.
A blind source separation technique in noisy environment is proposed based on spectral masking and minimum variance distortionless response (MVDR) beamformer (BF). Formulating the maximum-likelihood of the direction of arrivals (DOAs) and solving it using the expectation-maximization, enables the extraction of the masks and the associated MVDR BF as byproducts. The proposed direction of arrival estimator uses an explicit model of the ambient noise, which results in more accurate DOA estimates and good blind source separation. The experimental study demonstrates both the DOA estimation results and the separation capabilities of the proposed method using real room impulse responses in diffuse noise field.
An important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of the sources, in addition to noise reduction. The binaural multichannel Wiener filter (MWF) preserves the binaural cues of the target but distorts the noise binaural cues. To optimally benefit from binaural unmasking and to preserve the spatial impression for the hearing aid user, two extensions of the binaural MWF have therefore been proposed, namely, the MWF with partial noise estimation (MWF-N) and MWF with interference reduction (MWF-IR). In this paper, the binaural cue preservation of these extensions is analyzed theoretically. Although both extensions are aimed at incorporating the binaural cue preservation of the interferer in the binaural MWF cost function, their properties are different. For the MWF-N, while the binaural cues of the target are preserved, there is a tradeoff between the noise reduction and the preservation of the binaural cues of the interferer component. For the MWF-IR, while the binaural cues of the interferer are preserved, those of the target may be slightly distorted. The theoretical results are validated by simulations using binaural hearing aids, demonstrating the capabilities of these beamformers in a reverberant environment.
Recently, we have presented a semi-supervised approach for sound source localization based on manifold regularization. The idea is to estimate the function that maps each relative transfer function (RTF) to its corresponding position. The estimation is based on an optimization problem which takes into consideration the geometric structure of the RTF samples, which is empirically deduced from prerecorded training measurements. The solution is appropriately constrained to be smooth, meaning that similar RTFs are mapped to close positions. In this paper, we conduct a comprehensive experimental study with real-life recordings to examine the algorithm performance in actual noisy and reverberant conditions. The influence of the amount of training data as well as changes in the environmental conditions are also being examined. We show that the algorithm attains accurate localization in such challenging conditions.
The problem of speech enhancement using a distributed microphones array in a dynamic scenario where speaker, noise and microphone arrays are free to move is considered. The transfer function generalized sidelobe canceler (TF-GSC) spatial filter [1] which optimizes the minimum variance distortionless response (MVDR) criterion is used for enhancing the desired speech signal. A novel speech presence probability (SPP) estimator is proposed based on [2]. By using a dual-resolution SPP, the proposed estimator is able to detect noise dominant frequencies during speech, and thus improve noise tracking capability. We test the proposed algorithm in real dynamic scenarios, and demonstrate its consistent signal to noise ratio (SNR) improvement using a distributed microphone array consisting of 2 devices and 4 microphones.
In this study, we present a new phoneme-based deep neural network (DNN) framework for single microphone speech enhancement. While most speech enhancement algorithms overlook the phoneme structure of the speech signal, our proposed framework comprises a set of phoneme-specific DNNs (pDNNs), one for each phoneme, together with an additional phoneme-classification DNN (cDNN). The cDNN is responsible for determining the posterior probability that a specific phoneme was uttered. Concurrently, each of the pDNNs estimates a phoneme-specific speech presence probability (pSPP). The speech presence probability (SPP) is then calculated as a weighted averaging of the phoneme-specific pSPPs, with the weights determined by the posterior phoneme probability. A soft spectral attenuation, based on the SPP, is then applied to enhance the noisy speech signal. We further propose a compound training procedure, where each pDNN is first pre-trained using the phoneme labeling and the cDNN is trained to classify phonemes. Since these labels are unavailable in the test phase, the entire network is then trained using the noisy utterance, with the cDNN providing phoneme classification. A series of experiments in different noise types verifies the applicability of the new algorithm to the task of speech enhancement. Moreover, the proposed scheme outperforms other schemes that either do not consider the phoneme structure or use simpler training methodology.
A novel direction of arrival (DOA) estimator for concurrent speakers in reverberant environment is presented. Reverberation, if not properly addressed, is known to degrade the performance of DOA estimators. In our contribution, the DOA estimation task is formulated as a maximum likelihood (ML) problem, which is solved using the expectation-maximization (EM) procedure. The received microphone signals are modelled as a sum of anechoic and reverberant components. The reverberant components are modelled by a timeinvariant coherence matrix multiplied by time-varying reverberation power spectral density (PSD). The PSDs of the anechoic speech and reverberant components are estimated as part of the EM procedure. It is shown that the DOA estimates, obtained by the proposed algorithm, are less affected by reverberation than competing algorithms that ignore the reverberation. Experimental study demonstrates the benefit of the presented algorithm in reverberant environment using measured room impulse responses (RIRs).
A linear array of sensors with small spacing (compared to the wavelength) can be processed with superdirective beamforming. Specifically when applying minimum variance distortionless response (MVDR) weights designed for a diffuse noise-field, high gains are attainable in theory. A classical result relating to the far-field regime states that the gain with respect to diffuse noise (i.e., the directivity factor) for a source in the endfire direction may approach the number of sensors squared (N 2 ). However, as the wavelength increases, the beamformer encounters increasingly severe robustness issues. Results pertaining to the near-field regime are less well known. In this paper we analyze MVDR beamforming in a generic dual-microphone array scenario. Our analysis is not restricted to the far-field regime. We derive precise expressions for the directivity factor and the white-noise gain, as well as simplified approximations for the near- and far-field regimes. We show that in the near-field regime the directivity factor approaches infinity as the wavelength increases, and that the white-noise gain depends only on the ratio between the distance from the source to the distance between sensors. These properties of the beamformer (BF) behave differently than in the far-field regime.
Statistical likelihood ratio test is a widely used voice activity detection (VAD) method, in which the likelihood ratio of the current temporal frame is compared with a threshold. A fixed threshold is always used, but this is not suitable for various types of noise. In this paper, an adaptive threshold is proposed as a function of the local statistics of the likelihood ratio. This threshold represents the upper bound of the likelihood ratio for the non-speech frames, whereas it remains generally lower than the likelihood ratio for the speech frames. As a result, a high non-speech hit rate can be achieved, while maintaining speech hit rate as large as possible.
The estimation accuracy of the late reverberation power spectral density (PSD) is of paramount importance in single-channel frequency-domain dereverberation algorithms. In this domain, the reverberant signal can be modeled by the convolution of an early speech component and a relative convolutive transfer function (RCTF). In this work, the RCTF coefficients are modeled by a first-order Markov chain, which is well-suited to model time-varying scenarios. The RCTF coefficients are estimated online by a Kalman filter and are then used to compute the late reverberation PSD, which is used in a spectral enhancement filter to achieve dereverberation and noise reduction. It is shown that the proposed reverberation PSD estimator yields similar performance to other estimators, which impose a model on the reverberant tail and which depend on additional information like the reverberation time and the direct-to-reverberation ratio.
Speaker localization algorithms often assume static location for all sensors. This assumption simplifies the models used, since all acoustic transfer functions are linear time invariant. In many applications this assumption is not valid. In this paper we address the localization challenge with moving microphone arrays. We propose two algorithms to find the speaker position. The first approach is a batch algorithm based on the maximum likelihood criterion, optimized via expectation-maximization iterations. The second approach is a particle filter for sequential Bayesian estimation. The performance of both approaches is evaluated and compared for simulated reverberant audio data from a microphone array with two sensors.
In acoustic conditions with reverberation and coherent sources, various spatial filtering techniques, such as the linearly constrained minimum variance (LCMV) beamformer, require accurate estimates of the relative transfer functions (RTFs) between the sensors with respect to the desired speech source. However, the time-domain support of these RTFs may affect the estimation accuracy in several ways. First, short RTFs justify the multiplicative transfer function (MTF) assumption when the length of the signal time frames is limited. Second, they require fewer parameters to be estimated, hence reducing the effect of noise and model errors. In this paper, a spherical microphone array based framework for RTF estimation is presented, where the signals are transformed to the spherical harmonics (SH)-domain. The RTF time-domain supports are studied under different acoustic conditions, showing that SH-domain RTFs are shorter compared to conventional space-domain RTFs.
Various dereverberation and noise reduction algorithms require power spectral density estimates of the anechoic speech, reverberation, and noise. In this work, we derive a novel multichannel estimator for the power spectral densities (PSDs) of the reverberation and the speech suitable also for noisy environments. The speech and reverberation PSDs are estimated from all the entries of the received signals power spectral density (PSD) matrix. The Frobenius norm of a general error matrix is minimized to find the best fitting PSDs. Experimental results show that the proposed estimator provides accurate estimates of the PSDs, and is outperforming competing estimators. Moreover, when used in a multi-microphone noise reduction and dereverberation algorithm, the estimated reverberation and speech PSDs are shown to provide improved performance measures as compared with the competing estimators.
In addition to interference and noise reduction, an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of both the target and the undesired sound sources. For directional sources, this can be achieved by preserving the relative transfer function (RTF). The recently proposed binaural minimum variance distortionless response (BMVDR) beamformer preserves the RTF of the target, but typically distorts the RTF of the interfering sources. Recently, two extensions of the BMVDR beamformer were proposed preserving the RTFs of both the target and the interferer, namely, the binaural linearly constrained minimum variance (BLCMV) and the BMVDR-RTF beamformers. In this paper, we generalize the BMVDR-RTF to trade off interference reduction and noise reduction. Three special cases of the proposed beamformer are examined, either maximizing the signal-to-interference-and-noise ratio (SINR), the signal-to-noise ratio (SNR), or the signal-to-interference ratio (SIR). Experimental validations in an office environment validate our theoretical results.
A multitude of multi-microphone speech enhancement methods is available. In this paper, we focus our attention to the well-known minimum variance distortionless response (MVDR) beamformer, due to its ability to preserve distortionless response towards the desired speaker while minimizing the output noise power. We explore two alternatives for constructing the steering vectors towards the desired speech source. One is only using the direct path of the speech propagation in the form of delay-only filters, while the other is using the entire room impulse response (RIR). All beamforming methods requires some control information to be able to accomplish the task of enhancing a desired speech signal. In this paper, an acoustic event detection method using biologically-inspired features is employed. It can interpret the auditory scene by detecting the presence of different auditory objects. This is employed to control the estimation procedures used by beamformer. The resulting system provides a blind method of speech enhancement that can improve intelligibility independently of any additional information. Experiments with real recordings show the practical applicability of the method. Significant gain in fwSNRseg is achieved. Compared to using the direct path only, the use of the entire RIR proves beneficial.
An estimate of the power spectral density (PSD) of the late reverberation is often required by dereverberation algorithms. In this work, we derive a novel multichannel maximum likelihood (ML) estimator for the PSD of the reverberation that can be applied in noisy environments. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. As a closed-form solution for the maximum likelihood estimator is unavailable, a Newton method for maximizing the ML criterion is derived. Experimental results show that the proposed estimator provides an accurate estimate of the PSD, and outperforms competing estimators. Moreover, when used in a multi-microphone dereverberation and noise reduction algorithm, the best performance in terms of the log-spectral distance is achieved when employing the proposed PSD estimator.
Recently, an extension of the binaural multichannel Wiener filter (BMWF), referred to as BMWF-IRo, was presented in which an interference rejection constraint was added to the BMWF cost function. Although the BMWF-IRo aims to entirely suppress the interfering source, residual interfering sources (as well as unconstrained noise sources) are undesirably perceived as impinging the array from the desired source direction. In this paper, we propose two extensions of the BMWF-IRo that address this issue by preserving the spatial impression of the interfering source. In the first extension, the binaural cues of the interfering source are preserved, while those of the desired source may be slightly distorted. In the second extension, the binaural cues of both the desired and interfering sources are preserved. Simulation results show that the noise reduction performance of both proposed extensions is comparable to the BMWF-, IRo.
In this paper we present a new statistical model for the power spectral density (PSD) of an audio signal and its application to multichannel audio source separation (MASS). The source signal is modeled with the local Gaussian model (LGM) and we propose to model its variance with an inverse-Gamma distribution, whose scale parameter is factorized as a rank-1 model. We discuss the interest of this approach and evaluate it in a MASS task with underdetermined convolutive mixtures. For this aim, we derive a variational EM algorithm for parameter estimation and source inference. The proposed model shows a benefit in source separation performance compared to a state-of-the-art LGM NMF-based technique.
Besides noise reduction, an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of all sound sources. For the desired speech source and an interfering source, e.g., competing speaker, this can be achieved by preserving their relative transfer functions (RTFs). It has been shown that the binaural multi-channel Wiener filter (MWF) preserves the RTF of the desired speech source, but typically distorts the RTF of the interfering source. To this end, in this paper we propose an extension of the binaural MWF, i.e. the binaural MWF with RTF preservation (MWF-RTF) aiming to preserve the RTF of the interfering source. Analytical expressions for the performance of the binaural MWF and the MWF-RTF in terms of noise reduction and binaural cue preservation are derived, using which their performance is thoroughly compared. Simulation results using binaural behind-the-ear impulse responses measured in a reverberant environment validate the derived analytical expressions, showing that the MWF-RTF yields a better performance than the binaural MWF in terms of the signal-to-interference ratio and binaural cue preservation of the interfering source, while the overall noise reduction performance is slightly degraded.
Sound source localization is addressed by a novel Bayesian approach using a data-driven geometric model. The goal is to recover the target function that attaches each acoustic sample, formed by the measured signals, with its corresponding position. The estimation is derived by maximizing the posterior probability of the target function, computed on the basis of acoustic samples from known locations (labelled data) as well as acoustic samples from unknown locations (unlabelled data). To form the posterior probability we use a manifold-based prior, which relies on the geometric structure of the manifold from which the acoustic samples are drawn. The proposed method is shown to be analogous to a recently presented semi-supervised localization approach based on manifold regularization. Simulation results demonstrate the robustness of the method in noisy and reverberant environments.
We report on a recently-recorded database for use in processing of ad hoc microphone constellations. Twenty-four microphones were positioned in various locations at a central table in a large room, and their outputs were recorded while 4 target talkers at the table both read from a list of sentences in a constrained way and also maintained a natural conversation for several minutes. This was done in the quiet and in the presence of 8, 24, and 56 other simultaneous talkers surrounding the central table at various distances. We also recorded without the 4 target talkers active in each of these conditions, and used a loudspeaker to measure impulse responses to the microphones from various positions in the room. We provide details of the recording setup and demonstrate use of this database via an application of linearly constrained minimum variance beam-forming. The database will become available to researchers in the field.
A dereverberation method of a single speaker for binaural hearing aids is proposed. Thanks to binaural cues, listeners are capable of localizing sound sources even in reverberant enclosures. As the aim of dereverberation algorithms is the reduction of the sound reflections, it is changing the binaural cues of the reverberant signal. A recently proposed algorithm estimates both the early speech component and the room impulse response (RIR) in an online fashion. In this paper, we develop a binaural extension of this algorithm which enables a tradeoff between the amount of dereverberation and the preservation of the binaural cues of the reverberant signal. The method is tested using a database of binaural RIRs in different reverberation levels and source-listener distances. It is shown that the proposed method enables the tradeoff between improvement in the frequency-weighted signal to noise ratio (WSNR) scores and the preservation of the cues.
An estimate of the power spectral density (PSD) of the late reverberation is often required by dereverberation algorithms. In this work, we derive a novel multichannel maximum likelihood (ML) estimator for the PSD of the reverberation that can be applied in noisy environments. The direct path is first blocked by a blocking matrix and the output is considered as the observed data. Then, the ML criterion for estimating the reverberation PSD is stated. As a closed-form solution for the maximum likelihood estimator (MLE) is unavailable, a Newton method for maximizing the ML criterion is derived. Experimental results show that the proposed estimator provides an accurate estimate of the PSD, and is outperforming competing estimators. Moreover, when used in a multi-microphone noise reduction and dereverberation algorithm, the estimated reverberation PSD is shown to provide improved performance measures as compared with the competing estimators.
This paper addresses the problem of separation of moving sound sources. We propose a probabilistic framework based on the complex Gaussian model combined with non-negative matrix factorization. The properties associated with moving sources are modeled using time-varying mixing filters described by a stochastic temporal process. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the mixing filters. The sound sources are separated by means of Wiener filters, built from the estimators provided by the proposed VEM algorithm. Preliminary experiments with simulated data show that, while for static sources we obtain results comparable with the baseline method [1], in the case of moving source our method outperforms a piece-wise version of the baseline method.
A network of microphone pairs is utilized for the joint task of localizing and separating multiple concurrent speakers. The recently presented incremental distributed expectation-maximization (IDEM) is addressing the first task, namely detection and localization. Here we extend this algorithm to address the second task, namely blindly separating the speech sources. We show that the proposed algorithm, denoted distributed algorithm for localization and separation (DALAS), is capable of separating speakers in reverberant enclosure without a priori information on their number and locations. In the first stage of the proposed algorithm, the IDEM algorithm is applied for blindly detecting the active sources and to estimate their locations. In the second stage, the location estimates are utilized for selecting the most useful node of microphones for the subsequent separation stage. Separation is finally obtained by utilizing the hidden variables of the IDEM algorithm to construct masks for each source in the relevant node.
We propose a natural way to generalize relative transfer functions (RTFs) to more than one source. We first prove that such a generalization is not possible using a single multichannel spectro-temporal observation, regardless of the number of microphones. We then introduce a new transform for multichannel multi-frame spectrograms, i.e., containing several channels and time frames in each time-frequency bin. This transform allows a natural generalization which satisfies the three key properties of RTFs, namely, they can be directly estimated from observed signals, they capture spatial properties of the sources and they do not depend on emitted signals. Through simulated experiments, we show how this new method can localize multiple simultaneously active sound sources using short spectro-temporal windows, without relying on source separation.
The relative transfer function (RTF), i.e. the ratio of acoustic transfer functions between two sensors, can be used for sound’ source localization / beamforming based on a microphone array. The RTF is usually defined with respect to a unique reference sensor. Choosing the reference sensor may be a difficult task, especially for dynamic acoustic environment and setup. In this paper we propose to use a locally normalized RTF, in short local-RTF, as an acoustic feature to characterize the source direction. Local-RTF takes a neighbor sensor as the reference channel for a given sensor. The estimated local-RTF vector can thus avoid the bad effects of a noisy unique reference and have smaller estimation error than conventional RTF estimators. We propose two estimators for the local-RTF and concatenate the values across sensors and frequencies to form a high-dimensional vector which is utilized for source localization. Experiments with real-world signals show the interest of this approach.
In wireless acoustic sensor networks (WASNs), sampling rate offsets (SROs) between nodes are inevitable, and recognized as one of the challenges that have to be resolved for a coherent array processing. A simplified free-space propagation is considered with a single desired source impinging a WASNs from the far-field and contaminated by a diffuse noise. In this paper, we analyze the theoretical performance of a fixed superdirective beamformer (SDBF) in presence of SROs. The SDBF performance loss due to SROs is manifested as a distortion of the nominal beampattern and an excess noise power at the output of the beamformer. We also propose an iterative algorithm for SROs estimation. The theoretical results are validated by simulation.
The construction of a meaningful metric between acoustic responses which respects the source locations, is addressed. By comparing three alternative distance measures, we verify the existence of the acoustic manifold and give an insight into its nonlinear structure. From such a geometric view point, we demonstrate the limitations of linear approaches to infer physical adjacencies. Instead, we introduce the diffusion framework, which combines local and global processing in order to find an intrinsic nonlinear embedding of the data on a low-dimensional manifold. We present the diffusion distance which is related to the geodesic distance on the manifold. In particular, simulation results demonstrate the ability of the diffusion distance to organize the samples according to the source direction of arrival (DOA).
This paper addresses the problem of relative transfer function (RTF) estimation in the presence of stationary noise. We propose an RTF identification method based on segmental power spectral density (PSD) matrix subtraction. First multiple channel microphone signals are divided into segments corresponding to speech-plus-noise activity and noise-only. Then, the subtraction of two segmental PSD matrices leads to an almost noise-free PSD matrix by reducing the stationary noise component and preserving non-stationary speech component. This noise-free PSD matrix is used for single speaker RTF identification by eigenvalue decomposition. Experiments are performed in the context of sound source localization to evaluate the efficiency of the proposed method.
Microphone array processing utilize spatial separation between the desired speaker and interference signal for speech enhancement. The transfer functions (TFs) relating the speaker component at a reference microphone with all other microphones, denoted as the relative TFs (RTFs), play an important role in beamforming design criteria such as minimum variance distortionless response (MVDR) and speech distortion weighted multichannel Wiener filter (SDW-MWF). Two common methods for estimating the RTF are surveyed here, namely, the covariance subtraction (CS) and the covariance whitening (CW) methods. We analyze the performance of the CS method theoretically and empirically validate the results of the analysis through extensive simulations. Furthermore, empirically comparing the methods performances in various scenarios evidently shows thats the CW method outperforms the CS method.
Speech signal is often contaminated by both room reverberation and ambient noise. In this contribution, we propose a nested generalized sidelobe canceller (GSC) beamforming structure, comprising an inner and an outer GSC beamformers (BFs), that decouple the speech dereverberation and the noise reduction operations. The BFs are implemented in the short-time Fourier transform (STFT) domain. Two alternative reverberation models are adopted. In the first, used in the inner GSC, reverberation is assumed to comprise a coherent early component and a late reverberant component. In the second, used in the outer GSC, the influence of the entire acoustic transfer function (ATF) is modeled as a convolution along the frame index in each frequency. Unlike other BF designs for this problem that must be updated in each time-frame, the proposed BF is time-invariant in static scenarios. Experiments with both simulated and recorded environments verify the effectiveness of the proposed structure.
In this paper we consider an acoustic scenario with a desired source and a directional interference picked up by hearing devices in a noisy and reverberant environment. We present an extension of the binaural multichannel Wiener filter (BMWF), by adding an interference rejection constraint to its cost function, in order to combine the advantages of spatial and spectral filtering while mitigating directional interferences. We prove that this algorithm can be decomposed into the binaural linearly constrained minimum variance (BLCMV) algorithm followed by a single channel Wiener post-filter. The proposed algorithm yields improved interference rejection capabilities, as compared with the BMWF. Moreover, by utilizing the spectral information on the sources, it is demonstrating better SNR measures, as compared with the BLCMV.
An efficient implementation of a three-dimensional audio rendering system (3D-ARS) system over headphones is presented and its ability to render natural spatial sound is analyzed. In its most straightforward implementation spatial rendering is achieved by convolving a monophonic signal with the Head related transfer function (HRTF). Several methods were proposed in the literature to improve the naturalness of the spatial sound and the ability of the headphones’ wearer to localize sound sources. Among these methods, externalization, by incorporation of room reflections, personalization to the anthropometric attributes of the user, and the introduction of head movements, are known to yield improved performance. This work provides a unified and flexible platform incorporating the various optional components together with software tools to statistically analyze their contribution. Preliminary statistical analysis suggests that the additional components indeed contribute to the overall localization ability of the user.
The sampling rate offset (SRO) phenomenon in wireless acoustic sensor network (WASN) is considered in this work. The use of different clock sources in each node results in a drift between the nodes’ signals. The aim of this work is to estimate these SROs and to re-synchronize the network, enabling coherent multi-microphone processing. First, the link between SRO and the Doppler effect is driven. Then, a wideband correlation processor for SRO estimation, which is equivalent to continuous wavelet transform (CWT), is proposed. Finally, the node synchronization is achieved by re-sampling the signals at each node. Experimental study using an actual WASN, demonstrates the ability of the proposed algorithm to re-synchronize the network and to regain the performance loss due to SRO.
Widely linear model has recently been used for signal processing applications due to its ability to achieve better performance than conventional linear filtering for non-circular complex random variables (CRVs) and improper quaternion random variables (QRVs). In this paper, we study the time-domain widely linear quaternion model based minimum variance distortionless response beamformer (WL-QMVDR) for a single acoustic vector sensor (AVS) and analyze its performance in general through the use of beampatterns. We verify by simulation results that the estimated output of the WL-QMVDR is identical to the conventional linear model based MVDR beamformer when applied to an AVS in the non-reverberant and ideal sensor response scenario.
In this paper we describe a new multichannel room impulse responses database. The impulse responses are measured in a room with configurable reverberation level resulting in three different acoustic scenarios with reverberation times RT 60 equals to 160 ms, 360 ms and 610 ms. The measurements were carried out in recording sessions of several source positions on a spatial grid (angle range of -90° to 90° in 15° steps with 1 m and 2 m distance from the microphone array). The signals in all sessions were captured by three microphone array configurations. The database is accompanied with software utilities to easily access and manipulate the data. Besides the description of the database we demonstrate its use in spatial source separation task.
An algorithm for multichannel speech dereverberation is proposed that simultaneously estimates the clean signal, the linear prediction (LP) parameters of speech, and the acoustic parameters of the room. The received signals are processed in short segments to reduce the algorithm latency, and several expectation-maximization (EM) iterations are carried out on each segment to improve the signal estimation. In the expectation step, the fixed-lag Kalman smoother (FLKS) is applied to extract the clean signal from the data utilizing the estimated parameters. In the maximization step, the LP and room pa-rameters are updated using the output of the FLKS. Experimental results show that multiple EM iterations and the application of the LP model improve the quality of the output signal.
Ad hoc wireless acoustic sensor networks (WASNs) hold great potential for improved performance in speech processing applications, thanks to better coverage and higher diversity of the received signals. We consider a multiple speaker scenario where each of the WASN nodes, an autonomous system comprising of sensing, processing and communicating capabilities, is positioned in the near-field of one of the speakers. Each node aims at extracting its nearest speaker while suppressing other speakers and noise. The ad hoc network is characterized by an arbitrary number of speakers/nodes with uncontrolled microphone constellation. In this paper we propose a distributed algorithm which shares information between nodes. The algorithm requires each node to transmit a single audio channel in addition to a soft time-frequency (TF) activity mask for its nearest speaker. The TF activity masks are computed as a combination of estimates of a model-based speech presence probability (SPP), direct to reverberant ratio (DRR) and direction of arrival (DOA) per TF bin. The proposed algorithm, although sub-optimal compared to the centralized solution, is superior to the single-node solution.
Besides noise reduction an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of both desired and undesired sound sources. Recently, the binaural Linearly Constrained Minimum Variance (BLCMV) beamformer has been proposed that aims to preserve the desired speech component and suppress the undesired directional interference component while preserving the binaural cues of both components. Since the performance of the BLCMV beamformer highly depends on the amount of interference rejection determined by the interference rejection parameter, in this paper we propose several performance criteria to optimize the interference rejection parameters for the left and the right hearing aid. Experimental results show how the performance of the BLCMV beamformer is affected by the different optimal parameter combinations.
The challenge of localizing number of concurrent acoustic sources in reverberant enclosures is addressed in this paper. We formulate the localization task as a maximum likelihood (ML) parameter estimation problem, and develop a distributed expectation-maximization (DEM) procedure, based on the Incremental EM (IEM) framework. The algorithm enables localization of the speakers without a center point. Unlike direction search, localization is a distributed task in nature, since the sensors must be spatially deployed. Taking advantage of the distributed constellation of the sensors we propose a distributed algorithm that enables multiple processing nodes and considers communication constraints between them. The proposed DEM has surprising advantages over conventional expectation-maximization (EM) schemes. Firstly, it is less sensitive to initial conditions. Secondly, it converges much faster than the conventional EM. The proposed algorithm is tested by an extensive simulation study.
In signal enhancement applications, a reference signal which provides information about interferences and noise is desired. It can be obtained via a multichannel filter that performs a spatial null in the target position, a so-called target-cancelation filter. The filter must adapt to the target position, which is difficult when noise is active. When the target location is confined to a small area, a solution could be based on preparing a bank of target-cancelation filters for potential positions of the target. In this paper, we propose two methods to learn such banks from noise-free recordings. We show by experiments that learned banks have practical advantages compared to banks that were prepared manually by collecting filters for selected positions.
A randomly positioned microphone array is considered in this work. In many applications, the locations of the array elements are known up to a certain degree of random mismatch. We derive a novel statistical model for performance analysis of the multi-channel Wiener filter (MWF) beamformer under random mismatch in sensors location. We consider the scenario of one desired source and one interfering source arriving from the far-field and impinging on a linear array. A theoretical model for predicting the MWF mean squared error (MSE) for a given variation in sensors location is developed and verified by simulations. It is postulated that the probability density function (p.d.f) of the MSE of the MWF obeys Γ distribution. This claim is verified empirically by simulations.
Recently, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques, and automatic speech recognition (ASR) techniques robust to reverberation. To evaluate state-of-the-art algorithms and obtain new insights regarding potential future research directions, we propose a common evaluation framework including datasets, tasks, and evaluation metrics for both speech enhancement and ASR techniques. The proposed framework will be used as a common basis for the REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge. This paper describes the rationale behind the challenge, and provides a detailed description of the evaluation framework and benchmark results.
Speaker localization is one of the most prevalent problems in speech processing. Despite significant efforts in the last decades, high reverberation level still limits the performance of localization algorithms. Furthermore, using conventional localization methods, the information that can be extracted from dual microphone measurements is restricted to the time difference of arrival (TDOA). Under far-field regime, this is equivalent to either azimuth or elevation angles estimation. Full description of speaker’s coordinates necessitates several microphones. In this contribution we tackle these two limitations by taking a manifold learning perspective for system identification. We present a training-based algorithm, motivated by the concept of diffusion maps, that aims at recovering the fundamental controlling parameters driving the measurements. This approach turns out to be more robust to reverberation, and capable of recovering the speech source location using merely two microphones signals.
Speech extraction in a reverberant enclosure using a linearly-constrained minimum variance (LCMV) beamformer usually requires reliable estimates of the relative transfer functions (RTFs) of the desired source to all microphones. In this contribution, a geometrically constrained (GC)-TRINICON concept for RTF estimation is proposed. This approach is applicable in challenging multiple-speaker scenarios and in underdetermined situations, where the number of simultaneously active sources outnumbers the number of available microphone signals. As a most practically relevant and distinctive feature, this concept does not require any voice-activity-based control mechanism. It only requires coarse reference information on the target direction of arrival (DoA). The proposed GC-TRINICON method is compared to a recently proposed subspace method for RTF estimation relying on voice-activity control. Experimental results confirm the effectiveness of GC-TRINICON in realistic conditions.
This report proposes a novel variant of the generalized sidelobe canceler. It assumes that a set of prepared relative transfer functions (RTFs) is available for several potential positions of a target source within a confined area. The key problem here is to select the correct RTF at any time, even when the exact position of the target is unknown and interfering sources are present. We propose to select the RTF based on l p -norm, p ≤ 1, measured at the blocking matrix output in the frequency domain. Subsequent experiments show that this approach significantly outperforms previously proposed methods for selection when the target and interferer signals are speech signals.
In this paper, we present a method for transient interference suppression. The main idea is to learn the intrinsic geometric structure of the transients instead of relying on estimates of noise statistics. The transient interference structure is captured via a parametrization of a graph constructed from the measurements. This parametrization is viewed as an empirical model for transients and is used for building a filter that extracts transients from noisy speech. We present a model-based supervised algorithm, in which the graph-based empirical model is constructed in advance from training recordings, and then extended to new incoming measurements. This paper extends previous studies and presents a new Bayesian approach for empirical model extension that takes into account both the structure of the transients as well as the dynamics of speech signals.
Speech signals recorded in a room are commonly degraded by reverberation. In most cases, both the speech signal and the acoustic system of the room are unknown. In this paper, a multi-microphone algorithm that simultaneously estimates the acoustic system and the clean signal is proposed. An expectation-maximization (EM) scheme is employed to iteratively obtain the maximum likelihood (ML) estimates of the acoustic parameters. In the expectation step, the Kalman smoother is applied to extract the clean signal from the data utilizing the estimated parameters. In the maximization step, the parameters are updated according to the output of the Kalman smoother. Experimental results show a significant dereverberation capabilities of the proposed algorithm with only low speech distortion.
Identification of a relative transfer function (RTF) between two microphones is an important component of multichannel hands-free communication systems in reverberant and noisy environments. In this paper, we present an RTF identification method on manifolds for supervised generalized sidelobe canceler beamformers. We propose to learn the manifold of typical RTFs in a specific room using a novel extendable kernel method, which relies on common manifold learning approaches. Then, we exploit the extendable learned model and propose a supervised identification method that relies on both the a priori learned geometric structure and the measured signals. Experimental results show significant improvements over a competing method that relies merely on the measurements, especially in noisy conditions.
The optimal weights for a beamformer that provide maximum directivity, are often found to be severely lacking in terms of robustness. Although an ideal implementation of the beamformer with these weights provides high directivity, minor perturbations of the weights or of sensor placement cause severe degradation. Therefore, a robustness constraint is often imposed during the beamformer’s design stage. The classical method of diagonal loading is commonly used for this purpose. There are known results in this field which pertain to an array consisting of sensors with identical directivity-patterns and orientations. We extend these results to account for sensors with nonidentical directivity patterns, and sensors which share placement errors. We show that in such cases, modification of the classical loading scheme to incorporate nonidentical diagonal elements and off-diagonal elements is beneficial.
The scenario of P speakers received by an M microphone array in a reverberant enclosure is considered. We extend the single source speech distortion weighted multichannel Wiener filter (SDW-MWF) to deal with multiple speakers. The mean squared error (MSE) is extended by introducing P weights, each controlling the distortion of one of the sources. The P weights enable further control in the design of the beamformer (BF). Two special cases of the proposed BF are the SDW-MWF and the linearly constrained minimum variance (LCMV)-BF. We provide a theoretical analysis for the performance of the proposed BF. Finally, we exemplify the ability of the proposed method to control the tradeoff between noise reduction (NR) and distortion levels of various speakers in an experimental study.
In this contribution two different disciplines for designing microphone array beamformers are explored. On the one hand a fixed beamformer based on numerical near field optimization is employed. On the other hand an adaptive beamformer algorithm based on the linearly constrained minimum variance (LCMV) method is applied. For the evaluation, an audio-database for microphone array impulse responses and audio recordings (speech and noise) was created. Different acoustic scenarios were constructed, consisting of various audio sources (desired speaker, interfering speaker and directional noise) distributed around the microphone array at different angles and distances. The algorithms were compared based on both objective measure (signal-to-noise, signal-to-interference and speech distortion, and subjective tests (assessment of sonograms and informal listening tests).
In many cases hearing impaired persons suffer from hearing loss in both ears, necessitating two hearing apparatuses. In such cases, the applied speech enhancement algorithms should be capable of preserving the, so called, binaural cues. In this paper, a binaural extension of the linearly constrained minimum variance (LCMV) beamformer is proposed. The proposed algorithm, denoted binaural linearly constrained minimum variance (BLCMV) beamformer, is capable of extracting desired speakers while suppressing interference speakers. The BLCMV maintains the binaural cues of both the desired and the interference sources in the constrained space. The ability of preserving the binaural cues makes the BLCMV beamformer particularly suitable for hearing aid applications. It is further proposed to obtain a reduced complexity implementation by sharing common blocks in both sides of the hearing aid device. The performance of the proposed method, in terms of imposed distortion, interference cancellation and cue preservation, is verified by an extensive experimental study using signals recorded by a dummy head in an actual room.
A speech enhancement algorithm in a noisy and reverberant enclosure for a wireless acoustic sensor network (WASN) is derived. The proposed algorithm is structured as a two stage beamformers (BFs) scheme, where the outputs of the first stage are transmitted in the network. Designing the second stage BF requires estimating the desired signal components at the transmitted signals. The contribution here is twofold. First, in spatially static scenarios, the first stage BFs are designed to maintain a fixed response towards the desired signal. As opposed to competing algorithms, where the response changes and repeated estimation thereof is required. Second, the proposed algorithm is implemented in a generalized side-lobe canceler (GSC) form, separating the treatment of the desired speech and the interferences and enabling a simple time-recursive implementation of the algorithm. A comprehensive experimental study demonstrates the equivalent performance of the centralized GSC and of the proposed algorithm for both narrowband and speech signals.
Modern high performance speech processing applications incorporate large microphone arrays. Complicated scenarios comprising multiple sources, motivate the use of the linearly constrained minimum variance (LCMV) beamformer (BF) and specifically its efficient generalized sidelobe canceler (GSC) implementation. The complexity of applying the GSC is dominated by the blocking matrix (BM). A common approach for constructing the BM is to use a projection matrix to the null-subspace of the constraints. The latter BM is denoted as the eigen-space BM, and requires M 2 complex multiplications, where M is the number of microphones. In the current contribution, a novel systematic scheme for constructing a multiple constraints sparse BM is presented. The sparsity of the proposed BM substantially reduces the complexity to K × (M – K) complex multiplications, where K is the number of constraints. A theoretical analysis of the signal leakage and of the blocking ability of the proposed sparse BM and of the eigen-space BM is derived. It is proven analytically, and tested for narrowband signals and for speech signals, that the blocking abilities of the sparse and of the eigen-space BMs are equivalent.
Speech quality might significantly deteriorate in presence of interference. Multi-microphone measurements can be utilized to enhance speech quality and intelligibility only if room acoustics is taken into consideration. The vital role of the acoustic transfer function (ATF) between the sources and the microphones is demonstrated in two important cases: the minimum variance distortionless response (MVDR) and the linearly constrained minimum variance (LCMV) beamformers. The LCMV deals with the more general case of multiple desired speakers. It is argued that the MVDR beamformer exhibits a tradeoff between the amount of speech dereverberation and noise reduction. The level of noise reduction, sacrificed when complete dereverberation is required, is shown to depend on the direct-to-reverberation ratio. When the reverberation level is tolerable, practical beamformers can be designed by substituting the ATFs with their corresponding relative transfer functions (RTFs). As no dereverberation is performed by these beamformers, a higher level of noise reduction can be achieved. In comparison with the ATFs, the RTFs exhibit shorter impulse responses. Moreover, since non-blind procedures can be adopted, accurate RTF estimates might be obtained. Three such RTF estimation methods are discussed. Finally, a comprehensive experimental study in real acoustical environments demonstrates the benefits of using the proposed beamformers.
Beamforming methods for speech enhancement in wireless acoustic sensor networks (WASNs) have recently attracted the attention of the research community. One of the major obstacles in implementing speech processing algorithms in WASN is the sampling rate offsets between the nodes. As nodes utilize individual clock sources, sampling rate offsets are inevitable and may cause severe performance degradation. In this paper, a blind procedure for estimating the sampling rate offsets is derived. The procedure is applicable to speech-absent time segments with slow time-varying interference statistics. The proposed procedure is based on the phase drift of the coherence between two signals sampled at different sampling rates. Resampling the signals with Lagrange polynomials interpolation method compensates for the sampling rate offsets. An extensive experimental study, utilizing the transfer function generalized sidelobe canceller (TFGSC), exemplifies the problem and its solution.
Recently, we introduced a method to recover the controlling parameters of linear systems using diffusion kernels. In this paper, we apply our approach to the problem of source localization in a reverberant room using measurements from a single microphone. Prior recordings of signals from various known locations in the room are required for training and calibration. The proposed algorithm relies on a computation of a diffusion kernel with a specially-tailored distance measure. Experimental results in a real reverberant environment demonstrate accurate recovery of the source location.
A vector-sensor consisting of a monopole sensor collocated with orthogonally oriented dipole sensors can be used for direction-of arrival (DOA) estimation. A method is proposed to estimate the DOA based on the direction of maximum power. Algorithms mentioned in earlier works are shown to be special cases of the proposed method. An iterative algorithm based on the principal of gradient ascent is presented for the solution of the maximum power problem. The proposed maximum-power method is shown to approach the Cramer-Rao lower bound (CRLB) with a suitable choice of parameter.
Recently we have presented a novel approach for transient noise reduction that relies on non-local (NL) filtering. In this paper, we modify and extend our approach to support clustering and suppression of a few transient noise types simultaneously, by introducing two novel concepts. We observe that voiced speech spectral components are slowly varying compared to transient noise. Thus, by applying an algorithm for noise power spectral density (PSD) estimation, configured to track faster variations than pseudo-stationary noise, the PSD of speech components may be estimated. In addition, we utilize diffusion maps to embed the measurements into a new do main. We obtain a new representation which enables clustering of different transient noise types. The new representation is incorporated into a NL filter as a better affinity metric for averaging over transient instances. Experimental results show that the proposed algorithm enables clustering and suppression of multiple transient interferences.
A randomly distributed microphone array is considered in this work. In many applications exact design of the array is impractical. The performance of these arrays, characterized by a large number of microphones deployed in vast areas, cannot be analyzed by traditional deterministic methods. We therefore derive a novel statistical model for performance analysis of the MWF beamformer. We consider the scenario of one desired source and one interfering source arriving from the far-field and impinging on a uniformly distributed linear array. A theoretical model for the MMSE is developed and verified by simulations. The applicability of the proposed statistical model for speech signals is discussed.
A sensitivity analysis of two distortionless beamformers is presented in this paper. Specifically, two well-known variants, namely the minimum power distortionless response (MPDR) and minimum variance distortionless response (MVDR) beamformers, are considered. In our scenario, which is typical to many modern communications systems, waves emitted by multiple point sources are received by an antenna array. An analytical expression for the signal to interference and noise ratio (SINR) improvement obtained by both beamformers under steering errors is derived. These expression are experimentally evaluated and compared with the robust Capon beamformer (RCB), a robust variant of the MPDR beamformer. We show that the MVDR beamformer, which uses the noise correlation matrix in its minimization criterion, is more robust to steering errors than its counterparts, that use the received signal correlation matrix. Furthermore, even if the noise correlation matrix is erroneously estimated due to steering errors in the interference direction, the MVDR advantage is still maintained for reasonable range of steering errors. These conclusions conform with Cox findings. Only line of sight propagation regime is considered in the current contribution. Ongoing research extends this work to fading channels.
An acoustic vector sensor provides measurements of both the pressure and particle velocity of a sound field in which it is placed. These measurements are vectorial in nature and can be used for the purpose of source localization. A straightforward approach towards determining the direction of arrival (DOA) utilizes the acoustic intensity vector, which is the product of pressure and particle velocity. The accuracy of an intensity vector based DOA estimator in the presence of sensor noise or reverberation has been analyzed previously for the case of a white source signal. In this paper, the effects of reverberation upon the accuracy of such a DOA estimator in the presence of a colored source signal are examined. The analysis is done with the aid of an extension to Polack’s statistical room impulse response model which accounts for particle velocity as well as acoustic pressure. It is shown that signal colorations brings about a degradation in performance.
In this contribution a novel reduced-bandwidth iterative binaural MVDR beamformer is proposed. The proposed method reduces the bandwidth requirement between hearing aids to a single channel, regardless of the number of microphones. The algorithm is proven to converge to the optimal binaural MVDR in the case of a rank-1 desired source correlation matrix. Comprehensive simulations of narrow-band and speech signals demonstrate the convergence and the optimality of the algorithm.
We present a single-microphone speech enhancement algorithm that models the log-spectrum of the noise-free speech signal by a multidimensional Gaussian mixture. The proposed estimator is based on an earlier study which uses the single-dimensional mixture-maximum (MIXMAX) model for the speech signal. The experimental study shows that there is only a marginal difference between the proposed extension and the original algorithm in terms of both objective and subjective performance measures.
One of the problems with blind system identification in subbands is that the subband systems can only be identified correctly up to an arbitrary scale factor. This scale factor ambiguity is the same across all channels but can differ between the subbands and therefore, limits the usability of such estimates. In this contribution, a method that uses multiple filterbanks is proposed that utilizes overlapping passband regions between these filterbanks to find scalar correction factors that make the scale factor ambiguity uniform across all subbands. Simulation results are provided, showing that the proposed method accurately identifies and corrects for these scale factors at the cost of an increased computational burden.
In this paper we introduce a novel algorithm for extracting desired speech signals uttered by moving speakers contaminated by competing speakers and stationary noise in a reverberant environment. The proposed beamformer uses eigenvectors spanning the desired and interference signals subspaces. It relaxes the common requirement on the activity patterns of the various sources. A novel mechanism for tracking the desired and interferences subspaces is proposed, based on the projection approximation subspace tracking (deflation) (PASTd) procedure and on a union of subspaces procedure. This contribution extends previously proposed methods to deal with multiple speakers in dynamic scenarios.
Recently, we have presented a transient noise reduction algorithm for speech signals that relies on non-local diffusion filtering. By exploiting the repetitive nature of transient noises we proposed a simple and efficient algorithm, which enabled suppression of various noise types. In this paper, we incorporate a modified diffusion operator in order to obtain a more robust algorithm and further enhancement of the speech. We demonstrate the performance of the modified algorithm and compare it with a competing solution. We show that the proposed algorithm enables improved suppression of various transient interferences without any further computational burden.
In theory the linearly constrained minimum variance (LCMV) beamformer can achieve perfect dereverberation and noise cancellation when the acoustic transfer functions (ATFs) between all sources (including interferences) and the microphones are known. However, blind estimation of the ATFs remains a difficult task. In this paper the noise reduction of the LCMV beamformer is analyzed and compared with the noise reduction of the minimum variance distortionless response (MVDR) beamformer. In addition, it is shown that the constraint of the LCMV can be modified such that we only require relative transfer functions rather than ATFs to achieve perfect cancellation of coherent interferences. Finally, we evaluate the noise reduction performance achieved by the LCMV and MVDR beamformers for two coherent sources: one desired and one undesired.
Recently, we have presented a transfer-function generalized sidelobe canceler (TF-GSC) beamformer in the short time Fourier transform domain, which relies on a convolutive transfer function approximation of relative transfer functions between distinct sensors. In this paper, we combine a delay-and-sum beamformer with the TF-GSC structure in order to suppress the speech signal reflections captured at the sensors in reverberant environments. We demonstrate the performance of the proposed beamformer and compare it with the TF-GSC. We show that the proposed algorithm enables suppression of reverberations and further noise reduction compared with the TF-GSC beamformer.
The minimum variance distortionless response (MVDR) beamformer can be used for both speech dereverberation and noise reduction. In this paper we analyse the tradeoff between the amount of speech dereverberation and noise reduction achieved by the MVDR beamformer. We show that the amount of noise reduction that is sacrificed when desiring both speech dereverberation and noise reduction depends on the direct-to-reverberation ratio of the acoustic transfer function between the desired source and a reference microphone. The performance evaluation supports the theoretical analysis and demonstrates the tradeoff between speech dereverberation and noise reduction.
In speech communication systems the received microphone signals are degraded by room reverberation and ambient noise. This signal degradation can decrease the fidelity and intelligibility of the desired speaker. Reverberant speech can be separated into two components, viz. an early speech component and a late reverberant speech component. Reverberation suppression algorithms, that are feasible in practice, have been developed to suppress late reverberant speech or in other words to estimate the early speech component. The main challenge is to develop an estimator for the so-called late reverberant spectral variance (LRSV). In this contribution a generalized statistical reverberation model is proposed that can be used to estimate the LRSV. Novel and existing estimators can be derived from this model. One novel estimator is a so-called backward estimator that uses an estimate of the early speech component to obtain an estimate of the LRSV. Advantages and possible disadvantages of the estimators are discussed, and experimental results using simulated reverberant speech are presented.
In speech communication systems the received microphone signals are often degraded by competing speakers, noise signals and room reverberation. Microphone arrays are commonly utilized to enhance the desired speech signal. In this paper two important design criteria, namely the minimum variance distortionless response (MVDR) and the linearly constrained minimum variance (LCMV) beamformers, are explored. These structures differ in their treatment of the interference sources. Experimental results using simulated reverberant enclosure are used for comparing the two strategies. It is shown that the LCMV beamformer outperforms the MVDR beamformer provided that the acoustic environment is time-invariant.
Recently, a relative transfer function (RTF) identification method based on the convolutive transfer function (CTF) approximation was developed. This method is adapted to speech sources in reverberant environments and exploits the non-stationarity and presence probability of the speech signal. In this paper, we present experimental results that demonstrate the advantages and robustness of the proposed method. Specifically, we show the robustness of this method to the environment and to a variety of recorded noise signals.
A family of approaches for multi-microphone speech dereverbera- tion in colored noise environments, which uses the eigen-decomposition of the data correlation matrix, is studied in this paper. A recently proposed method shows that the Room Impulse Response (RIR)s, relating the speech source and the microphones, are embedded in the null subspace of the received signals. In cases where the channel order is overestimated, a closed-form algorithm for extracting the RIR is proposed. A variant, in which the sub- space method is incorporated into a subband framework, is given as well. In the last stage of the proposed method, the desired signal is reconstructed, using the estimated RIRs, by applying either the Matched Filter Beamformer (MBF) or the Multi-channel Inverse filter Theorem (MINT) algorithms. The emphasis of the current work is a comprehensive experimental study of the eigen-decomposition based dereverberation methods and the required channel inversion algorithms. This study supports the potential of the presented method, and provides insight into its limitations.
Acoustic echo arises due to acoustic coupling between the loudspeaker and the microphone of a communication device. Acoustic echo cancellation and suppression techniques are used to reduce the acoustic echo. In this work we propose to first cancel the early echo, which is related to the early part of the echo path, and subsequently suppress the late echo, which is related to the later part of the echo path. The identification of the echo path is carried out in the Short-Time Fourier Transform (STFT) domain, where a trade-off is facilitated between distortion of the near-end speech, residual echo, convergence rate, and robustness to echo path changes. Experimental results demonstrate that the system achieves high echo and noise reduction while maintaining low distortion of the near-end speech. In addition, it is shown that the proposed system is more robust to echo path changes compared to an acoustic canceller alone.
In this paper, we develop a dual-microphone speech dereverberation algorithm for noisy environments, which is aimed at suppressing late reverberation and background noise. The spectral variance of the late reverberation is obtained with adaptively-estimated direct path compensation. A Markov-switching generalized autoregressive conditional heteroscedasticity (GARCH) model is used to estimate the spectral variance of the desired signal, which includes the direct sound and early reverberation. Experimental results demonstrate the advantage of the proposed algorithm compared to a decision-directed-based algorithm.
In this paper we derive performance bounds for tracking time-varying OFDM multiple-input multiple-output (MIMO) communication channel in the presence of additive white Gaussian noise (AWGN). We discuss two channel tracking schemes. The first tracks the filter coefficients directly in time-domain, while the second separately tracks each tone in the frequency-domain. The Kalman filter, with known channel statistics, is utilized for evaluating the performance bounds. It is shown that the time-domain tracking scheme, which exploits the sparseness of the channel impulse response, outperforms the computationally more efficient, frequency-domain tracking scheme, which does not exploit the smooth frequency response of the channel.
In this paper, we consider a multiuser detection scheme for space division multiple access communication systems. Sequential interference cancellation (SIC) procedures are subject to performance degradation when the antenna array is only partially calibrated. We propose to incorporate robust beamforming algorithms into the SIC procedure to compensate for the array misalignment. We show by a simulation study that the proposed combination outperforms conventional SIC procedures for various degrees of array misalignment, different SNR values, several array configurations, and two modulation constellations (namely, QPSK and 16-QAM).
Speech signals recorded with a distant microphone usually contain reverberation, which degrades the fidelity and intelligibility of speech, and the recognition performance of automatic speech recognition systems. In this paper we propose a speech dereverberation system which uses two microphones. A generalized sidelobe canceller (GSC) type of structure is used to enhance the desired speech signal. The GSC structure is used to create two signals. The first signal is the output of a standard delay and sum beamformer, and the second signal is a reference signal which is constructed such that the direct speech signal is blocked. We propose to utilize the reverberation which is present in the reference signal to enhance the output of the delay and sum beamformer. The power envelope of the reference signal and the power envelope of the output of the delay and sum beamformer are used to estimate the residual reverberation in the output of the delay and sum beamformer. The output of the delay and sum beamformer is then enhanced using a spectral enhancement technique. The proposed method only requires an estimate of the direction of arrival of the desired speech source. Experiments using simulated room impulse responses are presented and show significant reverberation reduction while keeping the speech distortion low.
A bidirectional multiple-input multiple-output (MIMO) time varying channel is considered. The projection approximation subspace tracking (PAST) algorithm is used on both terminals in order to track the singular value decomposition of the channel matrix. Simulations using an autoregressive channel model and also a sampled MIMO indoor channel are performed, and the expected capacity degradation due to the estimation error is evaluated.
In this paper we present an algorithm for robust speech enhancement based on an Optimal Modified Minimum Mean-Square Error Log-Spectral Amplitude (OM-LSA) estimator for multiple interferences. In the original OM-LSA one interference was taken into account. However, there are many situations where multiple interferences are present. Since the human ear is more sensitive to a small amount of residual non-stationary interference than to a stationary interference we would like to reduce the non-stationary interference signal down to the residual noise level of the stationary interference. Possible applications for the proposed algorithm are joint speech dereverbera-tion and noise reduction, and joint residual echo suppres-sion and noise reduction. Additionally, we present two possible methods to estimate the a priori Signal to Noise Ratio of each of the interferences.
In this contribution1 a digital signal processing educational lab, established at the School of Electrical and Computers Engineering at Bar-Ilan University, Israel is presented. A unique educational approach is adopted. In this approach sophisticated algorithms can be implemented in an intuitive top-level design using Simulink©. Simultaneously, our approach gives the students the opportunity to conduct hands-on experiments with real signals and hardware, using Texas instruments (TI) C6713 evaluation boards. By taking this combined approach, we tried to focus the efforts of the students on the DSP problems themselves rather than on the actual programming. A comprehensive ensemble of experiments, which expose the students to a wide spectrum of DSP concepts, is introduced in this paper. The experiments were designed to enable the illustration and demonstration of theoretical aspects, already acquired by several DSP courses in the curriculum.
Speech signals recorded with a distant microphone usually contain reverberation and noise, which degrade the fidelity and intelligibility of speech, and the recognition performance of automatic speech recognition systems. In E. Habets (2005) presented a multi-microphone speech dereverberation algorithm to suppress late reverberation in a noise-free environment. In this paper we show how an estimate of the late reverberant energy can be obtained from noisy observations. A more sophisticated speech enhancement technique based on the optimally-modified log spectral amplitude (OM-LSA) estimator is used to suppress the undesired late reverberant signal and noise. The speech presence probability used in the OM-LSA is extended to improve the decision between speech, late reverberation and noise. Experiments using simulated and real acoustic impulse responses are presented and show significant reverberation reduction with little speech distortion
A novel approach for sub-band based multi-microphone speech dereverberation is presented1 . In recent contribution a method utilizing the null subspace of the spatial-temporal correlation matrix of the received signals (obtained by the generalized eigenvalue decomposition (GEVD) procedure). The desired acoustic transfer functions (ATF-s) are shown to be embedded in these generalized eigenvectors. The special Silvester structure of the filtering matrix, related to this subspace, was exploited for deriving a total least squares (TLS) estimate for the ATF-s. The high sensitivity of the GEVD procedure to noise, especially when the involved ATF-s are very long, and the wide dynamic range of the speech signal, make the proposed method problematic in realistic scenarios. In this contribution we suggest to incorporate the TLS subspace method into a sub-band structure. The novel method proves to be efficient, although some new problems arise and other remain open. A preliminary experimental study supports the potential of the proposed method.
Determining the spatial position of a speaker finds a growing interest in video conference scenario where automated camera steering and tracking are required. Speaker localization can be achieved with a dual step approach. In the preliminary stage microphone array is used to extract the time difference of arrival (TDOA) of the speech signal. These readings are then used by the second stage for the actual localization. Since speaker trajectory must be smooth, estimates of close speaker positions might be used to improve the current position estimate. However, many methods, although exploiting the spatial information obtained by different microphone pairs, do not exploit this temporal information. In this contribution we present two localization schemes, which exploit the temporal information. The first is the well known extended Kalman filter (EKF). The second is a recursive form of a Gauss method, which we denote Recursive Gauss (RG). Experimental study supports the potential of the proposed methods.
In a series of recent studies a new approach for applying the Kalman filter to nonlinear system, referred to as Unscented Kalman filter (UKF), was proposed. In this contribution1 we apply the UKF to several speech processing problems, in which a model with unknown parameters is given to the measured signals. We show that the nonlinearity arises naturally in these problems. Preliminary simulation results for artificial signals manifests the potential of the method.
Determining the spatial position of a speaker finds a growing interest in video conference scenario where automated camera steering and tracking are required. As a preliminary step for the localization, microphone array can be used to extract the time difference of arrival (TDOA) of the speech signal. The direction of arrival of the speech signal is then determined by the relative time delay between each, spatially separated, microphone pairs. In this work we present novel, frequency domain, approaches for TDOA calculation in a reverberant and noisy environment. Our methods are based on the speech quasi-stationarity property, and on the fact that the speech and the noise are uncorrelated. The proposed methods are supported by an extensive experimental study.
Adaptive beamforming techniques are inefficient for eliminating transient noise components that randomly arrive from unpredictable directions. In this paper, we present a real-time transfer function generalized sidelobe canceller (TF-GSC) for such nonstationary noise environments. Hypothesis testing in the spectral domain indicates either absence of transients, presence of an interfering transient, or presence of desired source components. The noise canceller branch of the TF-GSC is updated only during absence of transients, while the identification of the acoustical transfer function is carried out only when desired source components are present. Following the beamforming and the hypothesis testing, estimates for the signal presence probability, the noise power spectral density, and the desired speech log-spectral amplitude are derived. Experimental results demonstrate the usefulness of the proposed approach under nonstationary noise conditions.
In speech enhancement applications microphone array postfiltering allows additional reduction of noise components at a beamformer output. Among microphone array structures the recently proposed general transfer function generalized sidelobe canceller (TF-GSC) has shown impressive noise reduction abilities in a directional noise field, while still maintaining low speech distortion. However, in a diffused noise field less significant noise reduction is obtainable. The performance is even further degraded when the noise signal is nonstationary. In this contribution we propose three postfiltering methods for improving the performance of microphone arrays. Two of which are based on single-channel speech enhancers and making use of recently proposed algorithms concatenated to the beamformer output. The third is a multichannel speech enhancer which exploits noise-only components constructed within the TF-GSC structure. This work concentrates on the assessment of the proposed postfiltering structures. An extensive experimental study, which consists of both objective and subjective evaluation in various noise fields, demonstrates the advantage of the multichannel postfiltering compared to the single-channel techniques.
The problem of speaker localization is addressed in this work. We present a novel approach for estimating the time difference of arrival (TDOA) of the speech signal to a microphone array, in a reverberant and noisy environment. By estimating acoustical transfer function (ATF) ratios, the TDOA is extracted from a relatively short impulse response. Our approach shows superior performance, compared with the traditional generalized cross correlation (GCC) method.
In recent work we considered the use of a microphone array located in a reverberated room where general acoustic transfer functions (ATFs) relate the source signal and the microphones for enhancing a speech signal contaminated by interference. The resulting frequency-domain algorithm enables dealing with a complicated ATF in the same simple manner as Griffiths & Jim GSC algorithm deals with delay-only arrays. In this contribution a general expression of the enhancer output is derived. This expression is used for evaluating two figures of merit, i.e., noise reduction ability and the amount of distortion imposed. The performance is shown to be dependent on the ATFs involved, the noise field and the quality of estimation of the ATF ratios. Analytical performance evaluation of the method is obtained. It is shown that the proposed method maintains its good performance even in the general ATF case.
Speech quality and intelligibility might significantly deteriorate in the presence of background noise, especially when the speech signal is subject to subsequent processing. In this paper we represent a class of Kalman-filter based speech enhancement algorithms with some extensions, modifications, and improvements. The first algorithm employs the estimate-maximize (EM) method to iteratively estimate the spectral parameters of the speech and noise parameters. The enhanced speech signal is obtained as a by-product of the parameter estimation algorithm. The second algorithm is a sequential, computationally efficient, gradient descent algorithm. We discuss various topics concerning the practical implementation of these algorithms. Experimental study, using real speech and noise signals is provided to compare these algorithms with alternative speech enhancement algorithms, and to compare the performance of the iterative and sequential algorithms.
We present a method for acoustic source localization in reverberant environments based on semi-supervised machine learning (ML) with deep generative models. Source localization in the presence of reverberation remains a major challenge, which recent ML techniques have shown promise in addressing. Despite often large data volumes, the number of labels available for supervised learning in reverberant environments is usually small. In semi-supervised learning, ML systems are trained using many examples with only few labels, with the goal of exploiting the natural structure of the data. We use variational autoencoders (VAEs), which are generative neural networks (NNs) that rely on explicit probabilistic representations, to model the latent distribution of reverberant acoustic data. VAEs consist of an encoder NN, which maps complex input distributions to simpler parametric distributions (e.g., Gaussian), and a decoder NN which approximates the training examples. The VAE is trained to generate the phase of relative transfer functions (RTFs) between two microphones in reverberant environments, in parallel with a DOA classifier, on both labeled and unlabeled RTF samples. The performance this VAE-based approach is compared with conventional and ML-based localization in simulated and real-world scenarios.
Lip-to-speech involves generating a natural-sounding speech synchronized with a soundless video of a person talking. Despite recent advances, current methods still cannot produce high-quality speech with high levels of intelligibility for challenging and realistic datasets such as LRS3. In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained automatic speech recognition (ASR ) serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary. Moreover, our experiments show that the inclusion of the text modality plays a major role in the intelligibility of the produced speech, readily perceptible while listening, and is empirically reflected in the substantial reduction of the word error rate ( WER ) metric. We demonstrate the effectiveness of LipVoicer through human evaluation, which shows that it produces more natural and synchronized speech signals compared to competing methods. Finally, we created a demo showcasing LipVoicer’s superiority in producing natural, synchronized, and intelligible speech, providing additional evidence of its effectiveness. Project page: https://lipvoicer.github.io
Deep direction of arrival (DOA) models commonly require a perfect match between the array configurations in the training and test stages and consequently cannot be applied to unfamiliar microphone array constellations. In this paper, we present a deep DOA estimation method that circumvents this requirement. In our approach, we first cast the DOA estimation as a classification problem in each time-frequency (TF) bin, thus facilitating the localization of multiple concurrent speakers. We utilize a high-resolution spatial image, based on a narrow-band variant of the steered response power phase transform (SRP-PHAT) processor, as an input feature. The model is trained with simulated data using a single microphone array configuration in various acoustic conditions. In the test stage, the algorithm is applied with unfamiliar microphone array constellations, namely with a different number of microphones and inter-distances. An elaborated experimental study with real-life room impulse response (RIR) recordings demonstrates the effectiveness of the proposed input feature and the training scheme. Our approach achieves comparable results in familiar microphone array constellations and, more importantly, can accurately estimate the DOA of multiple concurrent speakers even with unfamiliar microphone arrays.
In this work, we present a two-stage method for speaker extraction under reverberant and noisy conditions. Given a reference signal of the desired speaker, the clean, but the still reverberant, desired speaker is first extracted from the noisy-mixed signal. In the second stage, the extracted signal is further enhanced by joint dereverberation and residual noise and interference reduction. The proposed architecture comprises two sub-networks, one for the extraction task and the second for the dereverberation task. We present a training strategy for this architecture and show that the performance of the proposed method is on par with other state-of-the-art (SOTA) methods when applied to the WHAMR! dataset. Furthermore, we present a new dataset with more realistic adverse acoustic conditions and show that our method outperforms the competing methods when applied to this dataset as well. Index Terms—Speaker extraction, Dereverberation
We present a study of a neural network-based method for speech emotion recognition that uses audio-only features. In the studied scheme, the acoustic features are extracted from the audio utterances and fed to a neural network that consists of convolutional neural networks (CNN) layers, bidirectional long short-term memory (BLSTM) combined with an attention mechanism layer, and a fully-connected layer. To illustrate and analyze the classification capabilities of the network, we used the t-distributed stochastic neighbor embedding (t-SNE) method. We evaluate our model using Ryerson audio-visual dataset of emotional speech and song (RAVDESS) and interactive emotional dyadic motion capture (IEMOCAP) datasets achieving weighted accuracy (WA) of 80% and 66%, respectively.
The interpretation and explanation of decision-making processes of neural networks are becoming a key factor in the deep learning field. Although several approaches have been presented for classification problems, the application to regression models needs to be further investigated. In this manuscript we propose a Grad-CAM-inspired approach for the visual explanation of neural network architecture for regression problems. We apply this methodology to a recent physics-informed approach for Nearfield Acoustic Holography, called Kirchhoff-Helmholtz-based Convolutional Neural Network (KHCNN) architecture. We focus on the interpretation of KHCNN using vibrating rectangular plates with different boundary conditions and violin top plates with complex shapes. Results highlight the more informative regions of the input that the network exploits to correctly predict the desired output. The devised approach has been validated in terms of NCC and NMSE using the original input and the filtered one coming from the algorithm.
In literature, sound source localization under the far- and near-field scenarios are mostly addressed as independent tasks using different approaches. This causes a tedious task to detect the type of sound-field, whereas in practice there may not be a clear boundary between the far- and near-field soundfield. In contrast, this paper proposes a multi-channel feature denoted generalized relative harmonic coefficients (generalized RHC) in the spherical harmonics domain, which can equally localize both far- and near-field sound source without requiring any adjustments. We derive the analytical expression of this feature and summarize its unique properties, which facilitate two single-source directional-of-arrival estimators: (i) using a full grid search over the directional space; and (ii) a closed-form solution without any grid search. Experimental study in realistic noisy and reverberant environments under both near-field and far-field conditions validates the efficacy of the proposed algorithm.
Acoustic reflections are known to limit the capabilities of traditional beamformers (BFs), which are based on zero-order steering vector, to extract a desired source and to suppress interference signals from noisy measurements. To alleviate these performance limitations, echo-aware BFs that take into account the acoustic reflections of the source and interfering signals, were introduced more than two decades ago. In this paper, 1 we propose a systematic methodology to analyze the performance of these BFs, highlighting the importance of the acoustic reflections in the BF design. Under this methodology, we redefine beampatterns to consider the entire reflection pattern, while the direction of arrivals (DOAs) of the sources are merely used as an indication of the positions of the sources that impinge the array from a circle around the microphone array. We further define measures of the quality of the BFs, namely the beampattern shape, the width of the main beam, the directivity, the null depth, and the signal-to-interference ratio (SIR) improvement. Using this methodology, we are able to clearly demonstrate the advantages of echo-aware BFs over traditional BFs that only consider the direct-arrival of the sources in their design.
The objective of binaural multi-microphone speech enhancement algorithms can be viewed as a multi-criteria design problem as there are several requirements to be met. When applying distortion less beamforming, it is necessary to suppress interfering sources and ambient background noise, and to extract an undistorted replica of the target source. In the binaural versions, it is also important to preserve the binaural cues of the target and the interference sources. In this paper, we propose a unified Pareto optimization framework for binaural distortion less beamformers, which is achieved by defining a multi-objective problem (MOP) to control the amount of interference suppression and noise reduction simultaneously. The derivation is given for the multi-interference case by introducing separate mean squared error (MSE) cost functions for each of the respective interference sources and the background noise. A Pareto optimal set of solutions is provided for any set of parameters. The performance of the proposed method in a noisy and reverberant environment is presented, demonstrating the impact of the trade-off parameters using real-signal recordings.
In this paper we present a unified time-frequency method for speaker extraction in clean and noisy conditions. Given a mixed signal, along with a reference signal, the common approaches for extracting the desired speaker are either applied in the time-domain or in the frequency-domain. In our approach, we propose a Siamese-Unet architecture that uses both representations. The Siamese encoders are applied in the frequency-domain to infer the embedding of the noisy and reference spectra, respectively. The concatenated representations are then fed into the decoder to estimate the real and imaginary components of the desired speaker, which are then inverse-transformed to the time-domain. The model is trained with the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) loss to exploit the time-domain information. The time-domain loss is also regularized with frequency-domain loss to preserve the speech patterns. Experimental results demonstrate that the unified approach is not only very easy to train, but also provides superior results as compared with state-of-the-art (SOTA) Blind Source Separation (BSS) methods, as well as commonly used speaker extraction approach.
Accurate direction-of-arrival (DOA) estimation in noisy and reverberant environments is a long-standing challenge in the field of acoustic signal processing. One of the promising research directions utilizes the decomposition of the multi-microphone measurements into the spherical harmonics (SH) domain. This paper presents an evaluation and comparison of learning-based single-source DOA estimation using two recently introduced SH domain features denoted relative harmonic coefficients (RHC) and relative modal coherence (RMC), respectively. Both features were shown to be independent of the time-varying source signal even in reverberant environments, thus facilitating training with synthesized, continuously active, noise signal rather than with speech signal. The inspected features are fed into a convolutional neural network, trained as a DOA classifier. Extensive validations confirm that the RHC-based method outperforms the RMC-based method, especially under unfavorable scenarios with severe noise and reverberation.
The relative harmonic coefficients (RHC), recently introduced as a multi-microphone spatial feature, demonstrates promising performance when applied to direction-of-arrival (DOA) estimation. All existing RHC-based DOA estimators suffer from a resolution limitation due to the inherent grid-based search. In contrast, this paper utilizes the first-order RHC to propose a closed-form DOA estimator by deriving a direction vector, which points towards to the desired source direction. Two objective metrics, namely localization accuracy and algorithm complexity, are adopted for the evaluation and comparison with existing RHC-based and intensity based localization approaches, in both simulated and real-life environments.
In this paper, we propose a deep neural network (DNN)-based single-microphone speech enhancement algorithm characterized by a short latency and low computational resources. Many speech enhancement algorithms suffer from low noise reduction capabilities between pitch harmonics, and in severe cases, the harmonic structure may even be lost. Recognizing this drawback, we propose a new weighted loss that emphasizes pitch-dominated frequency bands. For that, we propose a method, applied only at the training stage, to detect these frequency bands. The proposed method is applied to speech signals contaminated by several noise types, and in particular, typical domestic noise drawn from ESC-50 and DE-MAND databases, demonstrating its applicability to ‘stay-at-home’ scenarios.