Objective: Affective flexibility, the capacity to respond to life’s varying environmental changes in a dynamic and adaptive manner, is considered a central aspect of psychological health in many psychotherapeutic approaches. The present study examined whether affective two-dimensional (i.e., arousal and valence) temporal variability extracted from voice and facial expressions would be associated with positive changes over the course of psychotherapy, at the session, client, and treatment levels.
Method: 22,741 mean vocal arousal and facial expression valence observations were extracted from 137 therapy sessions in a sample of 30 clients treated for major depressive disorder by nine therapists. Before and after each session, the clients self-reported their level of well-being on the Outcome Rating Scale. Session-level affective temporal variability was assessed as the mean square of successive differences (MSSD) between consecutive two-dimensional affective measures. Results: Session outcome was positively associated with temporal variability at the session level (i.e., within clients, between sessions) and at the client level (i.e., between clients). Importantly, these associations held when controlling for average session- and client-level valence scores. In addition, the expansion of temporal variability throughout treatment was associated with steeper positive session outcome trajectories over the course of treatment.
Conclusions: The continuous assessment of both vocal and facial affective expressions and the ability to extract measures of affective temporal variability from within-session data may enable therapists to better respond and modulate clients’ affective flexibility; however, further research is necessary to determine whether there is a causal link between affective temporal variability and psychotherapy outcomes.
Introduction: To date, studies focusing on the connection between psychological functioning and autonomic nervous system (ANS) activity usually adopted the one-dimensional model of autonomic balance, according to which activation of one branch of the ANS is accompanied by an inhibition of the other. However, the sympathetic and parasympathetic branches also activate independently; thus, co-activation and co-inhibition may occur, which is demonstrated by a two-dimensional model of ANS activity. Here, we apply such models to assess how markers of the autonomic space relate to several critical psychological constructs: emotional contagion (EC), general anxiety, and positive and negative affect (PA and NA). We also examined gender differences in those psychophysiological relations.
Methods: In the present study, we analyzed data from 408 healthy students, who underwent a 5-min group baseline period as part of their participation in several experiments and completed self-reported questionnaires. Electrocardiogram (ECG), electrodermal activity (EDA), and respiration were recorded. Respiratory sinus arrhythmia (RSA), pre-ejection period (PEP), as well as cardiac autonomic balance (CAB) and regulation (CAR) and cross-system autonomic balance (CSAB) and regulation (CSAR), were calculated.
Results: Notably, two-dimensional models were more suitable for predicting and describing most psychological constructs. Gender differences were found in psychological and physiological aspects as well as in psychophysiological relations. Women’s EC scores were negatively correlated with sympathetic activity and positively linked to parasympathetic dominance. Men’s PA and NA scores were positively associated with sympathetic activity. PA in men also had a positive link to an overall activation of the ANS, and a negative link to parasympathetic dominance.
Discussion: The current results expand our understanding of the psychological aspects of the autonomic space model and psychophysiological associations. Gender differences and strengths and weaknesses of alternative physiological models are discussed.
In the last decade, the signal processing (SP) community has witnessed a paradigm shift from model-based to data-driven methods. Machine learning (ML)—more specifically, deep learning—methodologies are nowadays widely used in all SP fields, e.g., audio, speech, image, video, multimedia, and multimodal/multisensor processing, to name a few. Many data-driven methods also incorporate domain knowledge to improve problem modeling, especially when computational burden, training data scarceness, and memory size are important constraints.
In this paper we propose a data-driven approach for multiple speaker tracking in reverberant enclosures. The speakers are uttering, possibly overlapping, speech signals while moving in the environment. The method comprises two stages. The first stage executes a single source localization using semi-supervised learning on multiple manifolds. The second stage, which is unsupervised, uses time-varying maximum likelihood estimation for tracking. The feature vectors, used by both stages, are the relative transfer functions (RTFs), which are known to be related to source positions. The number of sources is assumed to be known while the microphone positions are unknown. In the training stage, a large database of RTFs is given. A small percentage of the data is attributed with exact positions (namely, labelled data) and the rest is assumed to be unlabelled, i.e. the respective position is unknown. Then, a nonlinear, manifold-based, mapping function between the RTFs and the source positions is inferred. Applying this mapping function to all unlabelled RTFs constructs a dense grid of localized sources. In the test phase, this RTF grid serves as the centroids for a Mixture of Gaussians (MoG) model. The MoG parameters are estimated by applying a recursive variant of the expectation-maximization (EM) procedure that relies on the sparsity and intermittency of the speech signals. We present a comprehensive simulation study in various reverberation levels, including static and dynamic scenarios, for both two or three (partially) overlapping speakers. For the dynamic case we provide simulations with several speakers trajectories, including intersecting sources. The proposed scheme outperforms baseline methods that use a simpler propagation model in terms of localization accuracy and tracking capabilities.
Audio signal processing has passed many landmarks in its development as a research topic. Many are well known, such as the development of the phonograph in the second half of the 19th century and technology associated with digital telephony that burgeoned in the late 20th century and is still a hot topic in multiple guises. Interestingly, the development of audio technology has been fueled not only by advancements in the capabilities of technology but also by high consumer expectations and customer engagement. From surround sound movie theaters to the latest in-ear devices, people love sound and soon build new audio technology into their daily lives as an essential and expected feature.
Direction-of-arrival (DOA) estimation for multiple simultaneous speakers in reverberant environments is still one of the challenging tasks in the audio signal processing field. A recent approach addresses this problem using a spherical harmonics domain feature named relative harmonic coefficients (RHC). Based on a bin-wise operation across the STFT (short-time Fourier transform) domain, this method detects the direct-path RHC in the first stage, followed by single source localization in the second stage. However, the method is computationally expensive as each STFT bin requires an exhaustive grid search over the two-dimensional (2-D) directional space. In this paper, we propose a significantly more computationally efficient alternative that decouples the azimuth and elevation 2-D search to two separate one-dimensional (1-D) search. The proposed multi-speaker localization algorithm comprises of two main steps, responsible for: (i) achieving a joint direct-path RHC detection and decoupled DOA estimation using 1-D search; and (ii) counting the number of speakers and estimating their DOAs based on the estimates from direct-path dominated STFT bins. Experiments using both simulated and real-life reverberant recordings confirm the significant computational complexity reduction while achieving competitive localization accuracy, compared to the baseline approaches. Although our proposed method performs in an unsupervised manner, it proves to be applicable even under unfavorable acoustic environments with a high reverberation level (e.g., T60=1 second).
The objective of binaural multi-microphone speech enhancement algorithms can be viewed as a multi-criteria design problem as there are several requirements to be met. The objective is not only to extract the target speaker without distortion, but also to suppress interfering sources (e.g., competing speakers) and ambient background noise, while preserving the auditory impression of the complete acoustic scene. Such a multi-objective problem (MOP) can be solved using a Pareto frontier, which provides a useful trade-off between the different criteria. In this paper, we propose a unified Pareto optimization framework, which is achieved by defining a generalized mean squared error (MSE) cost function, derived from a MOP. The solution to the multi-criteria problem is grounded on a solid mathematical foundation. The MSE cost function consists of a weighted sum of speech distortion (SD), partial interference reduction (IR), and partial noise reduction (NR) terms with scaling parameters that control the amount of IR and NR. The filter minimizing this generalized cost function, denoted Pareto optimal binaural multichannel Wiener filter (Pareto-BMWF), constitutes a generalization of various binaural MWF-based and binaural MVDR-based beamformers. This solution is optimal for any set of parameters. The improved speech enhancement capabilities are experimentally demonstrated using real-signal recordings when estimation errors are present and the binaural cue preservation capabilities are analyzed.
The problem of blind and online speaker localization and separation using multiple microphones is addressed
based on the recursive expectation-maximization (REM) procedure. A two-stage REM-based algorithm is
proposed: 1) multi-speaker direction of arrival (DOA) estimation and 2) multi-speaker relative transfer
function (RTF) estimation. The DOA estimation task uses only the time frequency (TF) bins dominated by a
single speaker while the entire frequency range is not required to accomplish this task. In contrast, the RTF
estimation task requires the entire frequency range in order to estimate the RTF for each frequency bin.
Accordingly, a different statistical model is used for the two tasks. The first REM model is applied under the
assumption that the speech signal is sparse in the TF domain, and utilizes a mixture of Gaussians (MoG)
model to identify the TF bins associated with a single dominant speaker. The corresponding DOAs are
estimated using these bins. The second REM model is applied under the assumption that the speakers are
concurrently active in all TF bins and consequently applies a multichannel Wiener filter (MCWF) to separate
the speakers. As a result of the assumption of the concurrent speakers, a more precise TF map of the speakers’
activity is obtained. The RTFs are estimated using the outputs of the MCWF-beamformer (BF), which are
constructed using the DOAs obtained in the previous stage. Next, using the linearly constrained minimum
variance (LCMV)-BF that utilizes the estimated RTFs, the speech signals are separated. The algorithm is
evaluated using real-life scenarios of two speakers. Evaluation of the mean absolute error (MAE) of the
estimated DOAs and the separation capabilities, demonstrates significant improvement w.r.t. a baseline DOA
estimation and speaker separation algorithm.
This paper presents a new dataset of measured multichannel Room Impulse Responses (RIRs) named dEchorate. It includes annotations of early echo timings and 3D positions of microphones, real sources and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, acoustic echo retrieval, microphone calibration, echo labeling and reflectors position estimation. The dataset is provided with software utilities to easily access, manipulate and visualize the data as well as baseline methods for echo-related tasks.
In this study, we present a deep neural network-based online multi-speaker localization algorithm based on a multi-microphone array. Following the W-disjoint orthogonality principle in the spectral domain, time-frequency (TF) bin is dominated by a single speaker and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high-resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using simulated and real-life recordings in static and dynamic scenarios demonstrates that the proposed algorithm significantly outperforms both classic and recent deep-learning-based algorithms. Finally, as a byproduct, we further show that the proposed method is also capable of separating moving speakers by the application of the obtained TF masks.
The gain achieved by a superdirective beamformer operating in a diffuse noise-field is significantly higher than the gain attainable with conventional delay-and-sum weights. A classical result states that for a compact linear array consisting of N sensors which receives a plane-wave signal from the endfire direction, the optimal superdirective gain approaches N2. It has been noted that in the near-field regime higher gains can be attained. The gain can increase, in theory, without bound for increasing wavelength or decreasing source receiver distance. We aim to address the phenomenon of near-field superdirectivity in a comprehensive manner. We derive the optimal performance for the limiting case of an infinitesimal-aperture array receiving a spherical-wave signal. This is done with the aid of a sequence of linear transformations. The resulting gain expression is a polynomial, which depends on the number of sensors employed, the wavelength, and the source-receiver distance. The resulting gain curves are optimal and outperform weights corresponding to other superdirectivity methods. The practical case of a finite aperture array is discussed. We present conditions for which the gain of such an array would approach that predicted by the theory of the infinitesimal case. The white noise gain (WNG) metric of robustness is shown to increase in the near-field regime.
This paper addresses the problem of tracking a moving source, e.g., a robot, equipped with both receivers and a source, that is tracking its own location and simultaneously estimating the locations of multiple plane reflectors. We assume a noisy knowledge of the robot’s movement. We formulate this problem, which is also known as simultaneous localization and mapping (SLAM), as a hybrid estimation problem. We derive the extended Kalman filter (EKF) for both tracking the robot’s own location and estimating the room geometry. Since the EKF employs linearization at every step, we incorporate a regulated kinematic model, which facilitates a successful tracking. In addition, we consider the echo-labeling problem as solved and beyond the scope of this paper. We then develop the hybrid Cramér-Rao lower bound on the estimation accuracy of both the localization and mapping parameters. The algorithm is evaluated with respect to the bound via simulations, which shows that the EKF approaches the hybrid Cramér-Rao bound (CRB) (HCRB) as the number of observation increases. This result implies that for the examples tested in simulation, the HCRB is an asymptotically tight bound and that the EKF is an optimal estimator. Whether this property is true in general remains an open question.
Localization in reverberant environments remains an open challenge. Recently, supervised learning approaches have demonstrated very promising results in addressing reverberation. However, even with large data volumes, the number of labels available for supervised learning in such environments is usually small. We propose to address this issue with a semi-supervised learning (SSL) approach, based on deep generative modeling. Our chosen deep generative model, the variational autoencoder (VAE), is trained to generate the phase of relative transfer functions (RTFs) between microphones. In parallel, a direction of arrival (DOA) classifier network based on RTF-phase is also trained. The joint generative and discriminative model, deemed VAE-SSL, is trained using labeled and unlabeled RTF-phase sequences. In learning to generate and classify the sequences, the VAE-SSL extracts the physical causes of the RTF-phase (i.e., source location) from distracting signal characteristics such as noise and speech activity. This facilitates effective <italic>end-to-end</italic> operation of the VAE-SSL, which requires minimal preprocessing of RTF-phase. VAE-SSL is compared with two signal processing-based approaches, steered response power with phase transform (SRP-PHAT) and MUltiple SIgnal Classification (MUSIC), as well as fully supervised CNNs. The approaches are compared using data from two real acoustic environments – one of which was recently obtained at Technical University of Denmark specifically for our study. We find that VAE-SSL can outperform the conventional approaches and the CNN in label-limited scenarios. Further, the trained VAE-SSL system can generate new RTF-phase samples which capture the physics of the acoustic environment. Thus, the generative modeling in VAE-SSL provides a means of interpreting the learned representations. To the best of our knowledge, this paper presents the first approach to modeling the physics of acoustic propagation using deep generative modeling.
Two novel methods for speaker separation of multi-microphone recordings that can also detect speakers with infrequent activity are presented. The proposed methods are based on a statistical model of the probability of activity of the speakers across time. Each method takes a different approach for estimating the activity probabilities. The first method is derived using a linear programming (LP) problem for maximizing the correlation function between different time frames. It is shown that the obtained maxima correspond to frames which contain a single active speaker. Accordingly, we propose an algorithm for successive identification of frames dominated by each speaker. The second method aggregates the correlation values associated with each frame in a correlation vector. We show that these correlation vectors lie in a simplex with vertices that correspond to frames dominated by one of the speakers. In this method, we utilize convex geometry tools to sequentially detect the simplex vertices. The correlation functions associated with single-speaker frames, which are detected by either of the two proposed methods, are used for recovering the activity probabilities. A spatial mask is estimated based on the recovered probabilities and is utilized for separation and enhancement by means of both spatial and spectral processing. Experimental results demonstrate the performance of the proposed methods in various conditions on real-life recordings with different reverberation and noise levels, outperforming a state-of-the-art separation method.
In this paper, a study addressing the task of tracking multiple concurrent speakers in reverberant conditions is presented. Since both past and future observations can contribute to the current location estimate, we propose a forward-backward approach, which improves tracking accuracy by introducing near-future data to the estimator, in the cost of an additional short latency. Unlike classical target tracking, we apply a non-Bayesian approach, which does not make assumptions with respect to the target trajectories, except for assuming a realistic change in the parameters due to natural behaviour. The proposed method is based on the recursive expectation-maximization (REM) approach. The new method is dubbed forward-backward recursive expectation-maximization (FB-REM). The performance is demonstrated using an experimental study, where the tested scenarios involve both simulated and recorded signals, with typical reverberation levels and multiple moving sources. It is shown that the proposed algorithm outperforms the regular common causal (REM).
Objective: The present study implements an automatic method of assessing arousal in vocal data as well as dynamic system models to explore intrapersonal and interpersonal affect dynamics within psychotherapy and to determine whether these dynamics are associated with treatment outcomes. Method: The data of 21,133 mean vocal arousal observations were extracted from 279 therapy sessions in a sample of 30 clients treated by 24 therapists. Before and after each session, clients self-reported their well-being level, using the Outcome Rating Scale. Results: Both clients’ and therapists’ vocal arousal showed intrapersonal dampening. Specifically, although both therapists and clients departed from their baseline, their vocal arousal levels were “pulled” back to these baselines. In addition, both clients and therapists exhibited interpersonal dampening. Specifically, both the clients’ and the therapists’ levels of arousal were “pulled” toward the other party’s arousal level, and clients were “pulled” by their therapists’ vocal arousal toward their own baseline. These dynamics exhibited a linear change over the course of treatment: whereas interpersonal dampening decreased over time, there was an increase in intrapersonal dampening over time. In addition, higher levels of interpersonal dampening were associated with better session outcomes. Conclusions: These findings demonstrate the advantages of using automatic vocal measures to capture nuanced intrapersonal and interpersonal affect dynamics in psychotherapy and demonstrate how these dynamics are associated with treatment gains. (PsycInfo Database Record (c) 2021 APA, all rights reserved)
This paper develops a semi-supervised algorithm to address the challenging multi-source localization problem in a noisy and reverberant environment, using a spherical harmonics domain source feature of the relative harmonic coefficients. We present a comprehensive research of this source feature, including (i) an illustration confirming its sole dependence on the source position, (ii) a feature estimator in the presence of noise, (iii) a feature selector exploiting its inherent directivity over space.Source features at varied spherical harmonic modes, representing unique characterization of the soundfield, are fused by the Multi-Mode Gaussian Process modeling. Based on the unifying model, we then formulate the mapping function revealing the underlying relationship between the source feature(s) and position(s) using a Bayesian inference approach. Another issue of the overlapped components is addressed by a pre-processing technique performing overlapped frame detection, which in turn reduces this challenging problem to a single source localization. It is highlighted that this data-driven method has a strong potential to be implemented in practice because only a limited number of labeled measurements is required. We evaluate this proposed algorithm using simulated recordings between multiple speakers in diverse environments, and extensive results confirm improved performance in comparison with the state-of-art methods. Additional assessments using real-life recordings further prove the effectiveness of the method, even at unfavorable circumstances with severe source overlapping.
In this paper, we present an algorithm for direction of arrival (DOA) tracking and separation of multiple speakers with a microphone array using the factor graph statistical model. In our model, the speakers can be located in one of a predefined set of candidate DOAs, and each time-frequency (TF) bin can be associated with a single speaker. Accordingly, by attributing a statistical model to both the DOAs and the associations, as well as to the microphone array observations given these variables, we show that the conditional probability of these variables given the microphone array observations can be modeled as a factor graph. Using the loopy belief propagation (LBP) algorithm, we derive a novel inference scheme which simultaneously estimates both the DOAs and the associations. These estimates are used in turn for separating the sources, by directing a beamformer towards the estimated DOAs, and then applying a TF masking according to the estimated associations. A comprehensive experimental study demonstrates the benefits of the proposed algorithm in both simulated data and real-life measurements recorded in our
laboratory.
Besides reducing undesired sources, i.e., interfering sources and background noise, another important objective of a binaural beamforming algorithm is to preserve the spatial impression of the acoustic scene, which can be achieved by preserving the binaural cues of all sound sources. While the binaural minimum variance distortionless response (BMVDR) beamformer provides a good noise reduction performance and preserves the binaural cues of the desired source, it does not allow to control the reduction of the interfering sources and distorts the binaural cues of the interfering sources and the background noise. Hence, several extensions have been proposed. First, the binaural linearly constrained minimum variance (BLCMV) beamformer uses additional constraints, enabling to control the reduction of the interfering sources while preserving their binaural cues. Second, the BMVDR with partial noise estimation (BMVDR-N) mixes the output signals of the BMVDR with the noisy reference microphone signals, enabling to control the binaural cues of the background noise. Aiming at merging the advantages of both extensions, in this paper we propose the BLCMV with partial noise estimation (BLCMV-N). We show that the output signals of the BLCMV-N can be interpreted as a mixture between the noisy reference microphone signals and the output signals of a BLCMV using an adjusted interference scaling parameter. We provide a theoretical comparison between the BMVDR, the BLCMV, the BMVDR-N and the proposed BLCMV-N in terms of noise and interference reduction performance and binaural cue preservation. Experimental results using recorded signals as well as the results of a perceptual listening test show that the BLCMV-N is able to preserve the binaural cues of an interfering source (like the BLCMV), while enabling to trade off between noise reduction performance and binaural cue preservation of the background noise (like the BMVDR-N).
Estimation problems like room geometry estimation and localization of acoustic reflectors are of great interest and importance in robot and drone audition. Several methods for tackling these problems exist, but most of them rely on information about times-of-arrival (TOAs) of the acoustic echoes. These need to be estimated in practice, which is a difficult problem in itself, especially in robot applications which are characterized by high ego-noise. Moreover, even if TOAs are successfully extracted, the difficult problem of echolabeling needs to be solved. In this paper, we propose multiple expectation-maximization (EM) methods, for jointly estimating the TOAs and directions-of-arrival (DOA) of the echoes, with a uniform circular array (UCA) and a loudspeaker in its center for probing the environment. The different methods are derived to be optimal under different noise conditions. The experimental results show that the proposed methods outperform existing methods in terms of estimation accuracy in noisy conditions. For example, it can provide accurate estimates at SNR of 10 dB lower compared to TOA extraction from room impulse responses, which is often used. Furthermore, the results confirm that the proposed methods can account for scenarios with colored noise or faulty microphones. Finally, we show the applicability of the proposed methods in mapping of an indoor environment.
Ad hoc acoustic networks comprising multiple nodes, each of which consists of several microphones, are addressed. From the ad hoc nature of the node constellation, microphone positions are unknown. Hence, typical tasks, such as localization, tracking, and beamforming, cannot be directly applied. To tackle this challenging joint multiple speaker localization and array calibration task, we propose a novel variant of the expectation-maximization (EM) algorithm. The coordinates of multiple arrays relative to an anchor array are blindly estimated using naturally uttered speech signals of multiple concurrent speakers. The speakers’ locations, relative to the anchor array, are also estimated. The inter-distances of the microphones in each array, as well their orientations, are assumed known, which is a reasonable assumption for many modern mobile devices (in outdoor and in a several indoor scenarios). The well-known initialization problem of the batch EM algorithm is circumvented by an incremental procedure, also derived here. The proposed algorithm is tested by an extensive simulation study.
The problem of blind audio source separation (BASS) in noisy and reverberant conditions is addressed by a novel approach, termed Global and LOcal Simplex Separation (GLOSS), which integrates full- and narrow-band simplex representations. We show that the eigenvectors of the correlation matrix between time frames in a certain frequency band form a simplex that organizes the frames according to the speaker activities in the corresponding band. We propose to build two simplex representations: one global based on a broad frequency band and one local based on a narrow band. In turn, the two representations are combined to determine the dominant speaker in each time-frequency (TF). Using the identified dominating speakers, a spectral mask is computed and is utilized for extracting each of the speakers using spatial beamforming followed by spectral postfiltering. The performance of the proposed algorithm is demonstrated using real-life recordings in various noisy and reverberant conditions.
Speech communication systems are prone to performance degradation in reverberant and noisy acoustic environments. Dereverberation and noise reduction algorithms typically require several model parameters, e.g. the speech, reverberation and noise power spectral densities (PSDs). A commonly used assumption is that the noise PSD matrix is known. However, in practical acoustic scenarios, the noise PSD matrix is unknown
and should be estimated along with the speech and reverberation PSDs. In this paper, we consider the case of rank-deficient noise PSD matrix, which arises when the noise signal consists of multiple directional noise sources, whose number is less than the number of microphones. We derive two closed-form maximum likelihood estimators (MLEs). The first is a non-blocking-based estimator which jointly estimates the speech, reverberation and noise PSDs, and the second is a blocking-based estimator, which first blocks the speech signal and then jointly estimates the
reverberation and noise PSDs. Both estimators are analytically compared and analyzed, and mean square errors (MSEs) expressions are derived. Furthermore, Cramér-Rao Bounds (CRBs) on the estimated PSDs are derived. The proposed estimators are examined using both simulation and real reverberant and noisy signals, demonstrating the advantage of the proposed method compared to competing estimators.
Hands-free speech systems are subject to performance degradation due to reverberation and noise. Common methods for enhancing reverberant and noisy speech require the knowledge of the speech, reverberation and noise power spectral densities (PSDs). Most literature on this topic assumes that the noise PSD matrix is known. However, in many practical acoustic scenarios, the noise PSD is unknown and should be estimated along with the speech and the reverberation PSDs. In this paper, the noise is modelled as a spatially homogeneous sound field, with an unknown time-varying PSD multiplied by a known time-invariant spatial coherence matrix. We derive two maximum likelihood estimators (MLEs) for the various PSDs, including the noise: The first is a non-blocking-based estimator, that jointly estimates the PSDs of the speech, reverberation and noise components. The second MLE is a blocking-based estimator, that blocks the speech signal and estimates the reverberation and noise PSDs. Since a closed-form solution does not exist, both estimators iteratively maximize the likelihood using the Fisher scoring method. In order to compare both methods, the corresponding Cramér-Rao Bounds (CRBs) are derived. For both the reverberation and the noise PSDs, it is shown that the non-blocking-based CRB is lower than the blocking-based CRB. Performance evaluation using both simulated and real reverberant and noisy signals, shows that the proposed estimators outperform competing estimators, and greatly reduce the effect of reverberation and noise.
Distortionless speech extraction in a reverberant environment can be achieved by applying a beamforming algorithm, provided that the relative transfer functions (RTFs) of the sources and the covariance matrix of the noise are known. In this paper, the challenge of RTF identification in a multi-speaker scenario is addressed. We propose a successive RTF identification (SRI) technique, based on the sole assumption that sources do not become simultaneously active. That is, we address the challenge of estimating the RTF of a specific speech source while assuming that the RTFs of all other active sources in the environment were previously estimated in an earlier stage. The RTF of interest is identified by applying the blind oblique projection (BOP)-SRI technique. When a new speech source is identified, the BOP algorithm is applied. BOP results in a null steering toward the RTF of interest, by means of applying an oblique projection to the microphone measurements. We prove that by artificially increasing the rank of the range of the projection matrix, the RTF of interest can be identified. An experimental study is carried out to evaluate the performance of the BOP-SRI algorithm in various signal to noise ratio (SNR) and signal to interference ratio (SIR) conditions and to demonstrate its effectiveness in speech extraction tasks.
Acoustic data provide scientific and engineering insights in fields ranging from biology and communications to ocean and Earth science. We survey the recent advances and transformative potential of machine learning (ML), including deep learning, in the field of acoustics. ML is a broad family of techniques, which are often based in statistics, for automatically detecting and utilizing patterns in data. Relative to conventional acoustics and signal processing, ML is data-driven. Given sufficient training data, ML can discover complex relationships between features and desired labels or actions, or between features themselves. With large volumes of training data, ML can discover models describing complex acoustic phenomena such as human speech and reverberation. ML in acoustics is rapidly developing with compelling results and significant future promise. We first introduce ML, then highlight ML developments in four acoustics research areas: source localization in speech processing, source localization in ocean acoustics, bioacoustics, and environmental sounds in everyday scenes.
This paper addresses the problem of multichannel online dereverberation. The proposed method is carried out in the short-time Fourier transform (STFT) domain, and for each frequency band independently. In the STFT domain, the time-domain room impulse response is approximately represented by the convolutive transfer function (CTF). The multichannel CTFs are adaptively identified based on the cross-relation method, and
using the recursive least square criterion. Instead of the complex- valued CTF convolution model, we use a nonnegative convolution model etween the STFT magnitude of the source signal and the CTF magnitude, which is just a coarse approximation of the former model, but is shown to be more robust against the CTF perturbations. Based on this nonnegative model, we propose an online STFT magnitude inverse filtering method. The inverse filters of the CTF magnitude are formulated based on the multiple-input/output inverse theorem (MINT), and adaptively stimated based on the gradient descent criterion. Finally, the inverse filtering is applied to the STFT magnitude of the microphone signals, obtaining an estimate of the STFT magnitude of the source signal. Experiments regarding both speech enhancement and automatic speech recognition are conducted, which demonstrate that the proposed method can effectively suppress reverberation, even for the difficult case of a moving speaker.
This paper addresses the problem of speech separation and enhancement from multichannel convolutive and noisy mixtures, assuming known mixing filters. We propose to perform speech separation and enhancement in the short-time Fourier transform domain using the convolutive transfer function (CTF) approximation. Compared to time-domain filters, CTF has much less taps. Consequently it requires less computational cost and sometimes is more robust against the filter perturbations. We propose three methods: i) For the multisource case, the multi-channel inverse filtering method, i.e. the multiple input/output inverse theorem (MINT), is exploited in the CTF domain; ii) A beamforming-like multichannel inverse filtering method applying single-source MINT and using power minimization, which is suitable whenever the source CTFs are not all known; and iii) A basis pursuit method, where the sources are recovered by minimizing their ` 1 -norm to impose spectral sparsity, while, the ` 2 -norm fitting cost between microphone signals and mixing model is constrained to be lower than a tolerance. The noise can be reduced by setting this tolerance at the noise power level. Experiments under various acoustic conditions are carried out to evaluate and compare the three proposed methods. Comparison with four baseline methods –beamforming-based, two time-domain inverse filters and time-domain Lasso– show the applicability of the proposed methods.
A recursive maximum-likelihood algorithm (RML) is proposed that can be used when both the observations and the hidden data have continuous values and are statistically dependent between different time samples. The algorithm recursively approximates the probability density functions of the observed and hidden data by analytically computing the integrals with respect to the state variables, where the parameters are updated using gradient steps. A full convergence proof is given, based on the ordinary differential equation approach, which shows that the algorithm converges to a local minimum of the Kullback-Leibler divergence between the true and the estimated parametric probability density functions; a result which is useful even for a miss-specified parametric model. Compared to other RML algorithms proposed in the literature, this contribution extends the state-space model and provides a theoretical analysis in a non-trivial statistical model that was not analyzed so far. We further extend the RML analysis to constrained parameter estimation problems. Two examples, including nonlinear state-space models, are given to highlight this contribution.
The IEEE Audio and Acoustic Signal Processing Technical Committee (AASP TC) is one of 13 TCs in the IEEE Signal Processing Society. Its mission is to support, nourish, and lead scientific and technological development in all areas of AASP. These areas are currently seeing increased levels of interest and significant growth, providing a fertile ground for a broad range of specific and interdisciplinary research and development. Ranging from array processing for microphones and loudspeakers to music genre classification, from psychoacoustics to machine learning (ML), from consumer electronics devices to blue-sky research, this scope encompasses countless technical challenges and many hot topics. The TC has roughly 30 elected volunteer members drawn equally from leading academic and industrial organizations around the world, unified by the common aim of offering their expertise in the service of the scientific community.
We present a fully Bayesian hierarchical approach for multichannel speech enhancement with time-varying audio channel. Our probabilistic approach relies on a Gaussian prior for the speech signal and a Gamma hyperprior for the speech precision, combined with a multichannel linear-Gaussian state space model for the acoustic channel. Furthermore, we assume a Wishart prior for the noise precision matrix. We derive a variational Expectation-Maximization (VEM) algorithm which uses a variant of multichannel Wiener filter (MCWF) to infer the sound source and a Kalman smoother to infer the acoustic channel. It is further shown that the VEM speech estimator can be recast as a multichannel minimum variance distortionless response (MVDR) beamformer followed by a single-channel variational postfilter. The proposed algorithm was evaluated using both simulated and real room environments with several noise types and reverberation levels. Both static and dynamic scenarios are considered. In terms of speech quality, it is shown that a significant improvement is obtained with respect to the noisy signal, and that the proposed method outperforms a baseline algorithm. In terms of channel alignment and tracking ability, a superior channel estimate is demonstrated.
This paper addresses the problems of blind multichannel identification and equalization for joint speech dere-verberation and noise reduction. The time-domain cross-relation method is hardly applicable for blind room impulse response identification due to the near-common zeros of the long impulse responses. We extend the cross relation method to the short-time Fourier transform (STFT) domain, in which the time domain impulse response is approximately represented by the convolutive transfer function (CTF) with much less coefficients. For the oversampled STFT, CTFs suffer from the common zeros caused by the non-flat frequency response of the STFT window. To overcome this, we propose to identify CTFs using the STFT framework with oversampled signals and critically sampled CTFs, which is a good trade-off between the frequency
aliasing of the signals and the common zeros problem of CTFs. The identified complex-valued CTFs are not accurate enough for multichannel equalization due to the frequency aliasing of the CTFs. Thence, we only use the CTF magnitudes, which leads to a nonnegative multichannel equalization method based on a nonnegative convolution model between the STFT magnitude of the source signal and the CTF magnitude. Compared with the complex-valued convolution model, this nonnegative convolution model is shown to be more robust against the CTF perturbations. To recover the STFT magnitude of the source signal and to reduce the additive noise, the ` 2 -norm fitting error between the STFT magnitude of the microphone signals and the nonnegative convolution is constrained to be less than a noise power related tolerance. Meanwhile, the ` 1 -norm of the STFT magnitude of the source signal is minimized to impose the sparsity.
Blind source separation (BSS) is addressed, using a novel data-driven approach, based on a well-established probabilistic model. The proposed method is specifically designed for separation of multichannel audio mixtures. The algorithm relies on spectral decomposition of the correlation matrix between different time frames. The probabilistic model implies that the column space of the correlation matrix is spanned by the probabilities of the various speakers across time. The number of speakers is recovered by the eigenvalue decay, and the eigenvectors form a simplex of the speakers’ probabilities. Time frames dominated by each of the speakers are identified exploiting convex geometry tools on the recovered simplex. The mixing acoustic channels are estimated utilizing the identified sets of frames, and a linear unmixing is performed to extract the individual speakers. The derived simplexes are visually demonstrated for mixtures of 2, 3 and 4 speakers. We also conduct a comprehensive experimental
study, showing high separation capabilities in various reverberation conditions.
Distributed acoustic tracking estimates the trajectories of source positions using an acoustic sensor network. As it is often difficult to estimate the source-sensor range from individual nodes, the source positions have to be inferred from the direction-of-arrival (DoA) estimates. Due to reverberation and noise, the sound field becomes increasingly diffuse with increasing source-sensor distance, leading to a decreased Direction of Arrival (DoA)-estimation accuracy. To distinguish between accurate and uncertain DoA estimates, this letter proposes to incorporate the coherent-to-diffuse ratio as a measure of DoA reliability for single – source tracking. It is shown that the source positions, therefore, can be probabilistically triangulated by exploiting the spatial diversity of all nodes.
Reduction of late reverberation can be achieved using spatio-spectral filters such as the multichannel Wiener filter (MWF). To compute this filter, an estimate of the late reverberation power spectral density (PSD) is required. In recent years, a multitude of late reverberation PSD estimators have been proposed. In this contribution, these estimators are categorized into several classes, their relations and differences are discussed, and a comprehensive experimental comparison is provided. To compare their performance, simulations in controlled as well as practical scenarios are conducted. It is shown that a common weakness of spatial coherence-based estimators is their performance in high direct-to-diffuse ratio (DDR) conditions. To mitigate this problem, a correction method is proposed and evaluated. It is shown that the proposed correction method can decrease the speech distortion without significantly affecting the reverberation reduction.
The problem of speaker tracking in noisy and reverberant enclosures is addressed. We present a hybrid algorithm, combining traditional tracking schemes with a new learning-based approach. A state-space representation, consisting of a propagation and observation models, is learned from signals measured by several distributed microphone pairs. The proposed representation is based on two data modalities corresponding
to high-dimensional acoustic features representing the full reverberant acoustic channels as well as low-dimensional TDOA estimates. The state-space representation is accompanied by a statistical model based on a Gaussian process used to relate the variations of the acoustic channels to the physical variations of the associated source positions, thereby forming a data-driven propagation model for the source movement. In the
observation model, the source positions are nonlinearly mapped to the associated TDOA readings. The obtained propagation and observation models establish the basis for employing an extended Kalman filter (EKF). Simulation results demonstrate the robustness of the proposed method in noisy and reverberant conditions.
Localization of acoustic sources has attracted a considerable amount of research attention in recent years. A major obstacle to achieving high localization accuracy is the presence of reverberation, the influence of which obviously increases with the number of active speakers in the room. Human hearing is capable of localizing acoustic sources even in extreme conditions.
In this study, we propose to combine a method based on human hearing mechanisms and a modified incremental distributed expectation-maximization algorithm (IDEM).
Rather than using phase difference measurements that are modeled by a mixture of complex-valued Gaussians, as proposed in the original IDEM framework, we propose to use time difference of arrival (TDoA) measurements in multiple subbands and model them by a mixture of real-valued truncated Gaussians. Moreover, we propose to first filter the measurements in order to reduce the effect of the multi-path conditions. The proposed
method is evaluated using both simulated data and real-life recordings.
The problem of blind separation of speech signals in the presence of noise using multiple microphones is addressed. Blind estimation of the acoustic parameters and the individual source signals are carried out by applying the expectation – maximization (EM) algorithm. Two models for the speech signals are used, namely an unknown deterministic signal model and a complex-Gaussian signal model. For the two alternatives, we
define a statistical model and develop EM-based algorithms to jointly estimate the acoustic parameters and the speech signals. The resulting algorithms are then compared from both theoretical and performance perspectives. In both cases, the latent data (differently defined for each alternative) is estimated in the E-step, where in the M-step, the two algorithms estimate the acoustic transfer functions of each source and the noise covariance matrix. The algorithms differ in the way the clean speech signals are used in the EM scheme. When the clean signal is assumed deterministic unknown, only the a posterior probabilities of the presence of each source are estimated in the E-step, while their time-frequency coefficients are designated as parameters, and are estimated in the M-step using the minimum variance distortionless response beamformer. If the clean speech signals are modelled as complex Gaussian signals, their power spectral densities (PSDs) are estimated in the E-step using the
multichannel Wiener filter output. The proposed algorithms were tested using reverberant noisy mixtures of two speech sources in different reverberation and noise conditions.
This paper addresses the problem of multiple- speaker localization in noisy and reverberant environments, using binaural recordings of an acoustic scene. A complex-valued Gaussian mixture model (CGMM) is adopted, whose components correspond to all the possible candidate source locations defined on a grid. After optimizing the CGMM-based objective function, given an observed set of complex-valued binaural features, both the number of sources and their locations are estimated by selecting the CGMM components with the largest weights. An entropy-based penalty term is added to the likelihood to impose sparsity over the set of CGMM component weights. This favors a small number of detected speakers with respect to the large number of initial candidate source locations. In addition, the direct-path relative transfer function (DP-RTF) is used to build robust binaural features. The DP-RTF, recently proposed for single-source localization, encodes inter-channel information
corresponding to the direct-path of sound propagation and is thus robust to reverberations. In this paper, we extend the DP-RTF estimation to the case of multiple sources. In the short-time Fourier transform domain, a consistency test is proposed to check whether a set of consecutive frames is associated to the same source or not. Reliable DP-RTF features are selected from the frames that pass the consistency test to be used for source localization. Experiments carried out using both simulation data and real data recorded with a robotic head confirm the efficiency of the proposed multi-source localization method.
The reverberation power spectral density (PSD) is often required for dereverberation and noise reduction algorithms. In this work, we compare two maximum likelihood (ML) estimators of the reverberation PSD in a noisy environment. In the first estimator, the direct path is first blocked. Then, the ML criterion for estimating the reverberation PSD is stated according to the probability density function (p.d.f.) of the blocking matrix (BM) outputs. In the second estimator, the speech component is not blocked. Since the anechoic speech PSD is usually unknown in advance, it is estimated as well. To compare the expected mean square error (MSE) between the two ML estimators of the reverberation PSD, the Cramér-Rao Bounds (CRBs) for the two ML estimators are derived. We show that the CRB for the joint reverberation and speech PSD estimator is lower than the CRB for estimating the reverberation PSD from the BM outputs. Experimental results show that the MSE of the two estimators indeed obeys the CRB curves. Experimental results of multi-microphone dereverberation and noise reduction algorithm show the benefits of using the ML estimators in comparison with another baseline estimators.
The problem of single source localization with ad hoc microphone networks in noisy and reverberant enclosures is addressed in this paper. A training set is formed by prerecorded measurements collected in advance, and consists of a limited number of labelled measurements, attached with corresponding positions, and a larger number of unlabelled measurements from unknown locations. Further information about the enclosure
characteristics or the microphone positions is not required. We propose a Bayesian inference approach for estimating a function that maps measurement-based features to the corresponding positions. The signals measured by the microphones represent different viewpoints, which are combined in a unified statistical framework. For this purpose, the mapping function is modelled by a Gaussian process with a covariance function that encapsulates both the connections between pairs of microphones and the relations among the samples in the training set. The parameters
of the process are estimated by optimizing a maximum likelihood (ML) criterion. In addition, a recursive adaptation mechanism is derived, where the new streaming measurements are used to update the model. Performance is demonstrated for both simulated data and real-life recordings in a variety of reverberation
and noise levels.
The problem of source separation and noise reduction using multiple microphones is addressed. The minimum
mean square error (MMSE) estimator for the multi-speaker case is derived and a novel decomposition of this estimator is pre-
sented. The MMSE estimator is decomposed into two stages: i) a multi-speaker linearly constrained minimum variance (LCMV)
beamformer (BF), and ii) a subsequent multi-speaker Wiener postfilter. The first stage separates and enhances the signals of
the individual speakers by utilizing the spatial characteristics of the speakers (as manifested by the respective acoustic transfer
functions (ATFs)) and the noise spatial correlation matrix, while the second stage exploits the speakers’ power spectral density
matrix to reduce the residual noise at the output of the first stage. The output vector of the multi-speaker LCMV BF is proven to be
the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral
amplitude estimator for the multi-speaker case is also derived given the multi-speaker LCMV BF outputs. The performance
evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically
verified that the multi-speaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when
compared with the single-speaker postfilter.
Speech enhancement and separation are core problems in audio signal processing, with commercial applications
in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are cru-
cial pre-processing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight mi-
crophones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those
of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting
with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated
and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between
these approaches is lacking at present. In this article, we propose to fill this gap by analyzing a large number of established
and recent techniques according to four transverse axes:
a) the acoustic impulse response model
b) the spatial filter design criterion
c) the parameter estimation algorithm
d) optional postfiltering.
We conclude this overview paper by providing a list
of software and data resources and by discussing perspectives and
future trends in the field.
As we are surrounded by an increased number of mobile devices equipped with wireless links and multiple microphones, e.g., smartphones, tablets, laptops and hearing aids, using them collaboratively for acoustic processing is a promising platform for emerging applications. These devices make up an acoustic sensor network comprised of nodes, i.e. distributed devices equipped with microphone arrays, communication unit and processing unit. Algorithms for speaker separation and localization using such a network require a precise knowledge of the nodes’ locations and orientations. To acquire this knowledge, a recently introduced approach proposed a combined direction of arrival (DoA) and time difference of arrival (TDoA) target function for off-line calibration with dedicated recordings. This paper proposes an extension of this approach to a novel online method with two new features: First, by employing an evolutionary algorithm on incremental measurements, it is online and fast enough for real-time application. Second, by using the sparse spike representation computed in a cochlear model for TDoA estimation, the amount of information shared between the nodes by transmission is reduced while the accuracy is increased. The proposed approach is able to calibrate an acoustic senor network online during a meeting in a reverberant conference room.
The challenge of blindly resynchronizing the data acquisition processes in a wireless acoustic sensor network (WASN) is addressed in this paper. The sampling rate offset (SRO) is precisely modeled as a time scaling. The applicability of a wideband correlation processor for estimating the SRO, even in a reverberant and multiple source environment, is presented. An explicit expression for the ambiguity function, which in our case involves time scaling of the received signals, is derived by applying truncated band-limited interpolation. We then propose the recursive band-limited interpolation (RBI) algorithm for recursive SRO estimation. A complete resynchronization scheme utilizing the RBI algorithm, in parallel with the SRO compensation module, is presented. The resulting resynchronization method operates in the time domain in a sequential manner and is thus capable of tracking a potentially time-varying SRO. We compared the performance of the proposed RBI algorithm to other available methods in a simulation study. The importance of resynchronization in a beamforming application is demonstrated by both a simulation study and experiments with a real WASN. Finally, we present an experimental study evaluating the expected SRO level between typical data acquisition devices.
The problem of source separation using an array of microphones in reverberant and noisy conditions is addressed. We consider applying the well-known linearly constrained minimum variance (LCMV) beamformer (BF) for extracting individual speakers. Constraints are defined using relative transfer functions (RTFs) for the sources, which are acoustic transfer functions (ATFs) ratios between any microphone and a reference microphone. The latter are usually estimated by methods which rely on single-talk time segments where only a single source is active and on reliable knowledge of the source activity. Two novel algorithms for estimation of RTFs using the TRINICON (Triple-N ICA for convolutive mixtures) framework are proposed, not resorting to the usually unavailable source activity pattern. The first algorithm estimates the RTFs of the sources by applying multiple two-channel geometrically constrained (GC) TRINICON units, where approximate direction of arrival (DOA) information for the sources is utilized for ensuring convergence to the desired solution. The GC-TRINICON is applied to all microphone pairs using a common reference microphone. In the second algorithm, we propose to estimate RTFs iteratively using GC-TRINICON, where instead of using a fixed reference microphone as before, we suggest to use the output signals of LCMV-BFs from the previous iteration as spatially processed references (SPRs) with improved signal-to-interference-and-noise ratio (SINR). For both algorithms, a simple detection of noise-only time segments is required for estimating the covariance matrix of noise and interference. We conduct an experimental study in which the performance of the proposed methods is confirmed and compared to corresponding supervised methods.
The problem of source separation and noise reduction using multiple microphones is addressed. The minimum mean square error (MMSE) estimator for the multispeaker case is derived and a novel decomposition of this estimator is presented. The MMSE estimator is decomposed into two stages: first, a multispeaker linearly constrained minimum variance (LCMV) beamformer (BF); and second, a subsequent multispeaker Wiener postfilter. The first stage separates and enhances the signals of the individual speakers by utilizing the spatial characteristics of the speakers [as manifested by the respective acoustic transfer functions (ATFs)] and the noise power spectral density (PSD) matrix, while the second stage exploits the speakers’ PSD matrix to reduce the residual noise at the output of the first stage. The output vector of the multispeaker LCMV BF is proven to be the sufficient statistic for estimating the marginal speech signals in both the classic sense and the Bayesian sense. The log spectral amplitude estimator for the multispeaker case is also derived given the multispeaker LCMV BF outputs. The performance evaluation was conducted using measured ATFs and directional noise with various signal-to-noise ratio levels. It is empirically verified that the multispeaker postfilters are beneficial in terms of signal-to-interference plus noise ratio improvement when compared with the single-speaker postfilter.
We present a novel non-iterative and rigorously motivated approach for estimating hidden Markov models (HMMs) and factorial hidden Markov models (FHMMs) of high-dimensional signals. Our approach utilizes the asymptotic properties of a spectral, graph-based approach for dimensionality reduction and manifold learning, namely the diffusion framework. We exemplify our approach by applying it to the problem of single microphone speech separation, where the log-spectra of two unmixed speakers are modeled as HMMs, while their mixture is modeled as an FHMM. We derive two diffusion-based FHMM estimation schemes. One of which is experimentally shown to provide separation results that compare with contemporary speech separation approaches based on HMM. The second scheme allows a reduced computational burden.
In this paper, we present a single-microphone speech enhancement algorithm. A hybrid approach is proposed merging the generative mixture of Gaussians (MoG) model and the discriminative deep neural network (DNN). The proposed algorithm is executed in two phases, the training phase, which does not recur, and the test phase. First, the noise-free speech log-power spectral density is modeled as an MoG, representing the phoneme-based diversity in the speech signal. A DNN is then trained with phoneme labeled database of clean speech signals for phoneme classification with mel-frequency cepstral coefficients as the input features. In the test phase, a noisy utterance of an untrained speech is processed. Given the phoneme classification results of the noisy speech utterance, a speech presence probability (SPP) is obtained using both the generative and discriminative models. SPP-controlled attenuation is then applied to the noisy speech while simultaneously, the noise estimate is updated. The discriminative DNN maintains the continuity of the speech and the generative phoneme-based MoG preserves the speech spectral structure. Extensive experimental study using real speech and noise signals is provided. We also compare the proposed algorithm with alternative speech enhancement algorithms. We show that we obtain a significant improvement over previous methods in terms of speech quality measures. Finally, we analyze the contribution of all components of the proposed algorithm indicating their combined importance.
This paper addresses the problem of sound-source localization of a single speech source in noisy and reverberant environments. For a given binaural microphone setup, the binaural response corresponding to the direct-path propagation of a single source is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function of the two channels. We propose a method to estimate the DP-RTF from the noisy and reverberant microphone signals in the short-time Fourier transform (STFT) domain. First, the convolutive transfer function approximation is adopted to accurately represent the impulse response of the sensors in the STFT domain. Second, the DP-RTF is estimated by using the auto- and cross-power spectral densities at each frequency and over multiple frames. In the presence of stationary noise, an interframe spectral subtraction algorithm is proposed, which enables to achieve the estimation of noise-free auto- and cross-power spectral densities. Finally, the estimated DP-RTFs are concatenated across frequencies and used as a feature vector for the localization of speech source. Experiments with both simulated and real data show that the proposed localization method performs well, even under severe adverse acoustic conditions, and outperforms state-of-the-art localization methods under most of the acoustic conditions.
Smartglasses, in addition to their visual-output capabilities, often contain acoustic sensors for receiving the user’s voice. However, operation in noisy environments may lead to significant degradation of the received signal. To address this issue, we propose employing an acoustic sensor array which is mounted on the eyeglasses frames. The signals from the array are processed by an algorithm with the purpose of acquiring the desired near-field speech signal produced by the wearer while suppressing noise signals originating from the environment. The array is comprised of two acoustic vector-sensors (AVSs) which are located at the fore of the glasses’ temples. Each AVS consists of four collocated subsensors: one pressure sensor (with an omnidirectional response) and three particle-velocity sensors (with dipole responses) oriented in mutually orthogonal directions. The array configuration is designed to boost the input power of the desired signal, and to ensure that the characteristics of the noise at the different channels are sufficiently diverse (lending towards more effective noise suppression). Since changes in the array’s position correspond to the desired speaker’s movement, the relative source-receiver position remains unchanged; hence, the need to track fluctuations of the steering vector is avoided. Conversely, the spatial statistics of the noise are subject to rapid and abrupt changes due to sudden movement and rotation of the user’s head. Consequently, the algorithm must be capable of rapid adaptation toward such changes. We propose an algorithm which incorporates detection of the desired speech in the time-frequency domain, and employs this information to adaptively update estimates of the noise statistics. The speech detection plays a key role in ensuring the quality of the output signal. We conduct controlled measurements of the array in noisy scenarios. The proposed algorithm preforms favorably with respect to conventional algorithms.
In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that consists of the direct path and some early reflections and a late reverberant component that consists of all late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation, and ambient noise is presented. The expectation-maximization (EM) algorithm is used to estimate the signals and spatial parameters of the early speech component and the late reverberation components. As a result, a spatially filtered version of the early speech component is estimated in the E-step. The power spectral density (PSD) of the anechoic speech, the relative early transfer functions, and the PSD matrix of the late reverberation are estimated in the M-step of the EM algorithm. The algorithm is evaluated using real room impulse response recorded in our acoustic lab with a reverberation time set to 0.36 s and 0.61 s and several signal-to-noise ratio levels. It is shown that significant improvement is obtained and that the proposed algorithm outperforms baseline single-channel and multichannel dereverberation algorithms, as well as a state-of-the-art multichannel dereverberation algorithm.
This paper addresses the problem of separating audio sources from time-varying convolutive mixtures. We propose a probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization. The time-varying mixing filters are modeled by a continuous temporal stochastic process. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the time-varying mixing matrix, and that jointly estimate the source parameters. The sound sources are then separated by Wiener filters constructed with the estimators provided by the VEM algorithm. Extensive experiments on simulated data show that the proposed method outperforms a blockwise version of a state-of-the-art baseline method.
Conventional speaker localization algorithms, based merely on the received microphone signals, are often sensitive to adverse conditions, such as: high reverberation or low signal-to-noise ratio (SNR). In some scenarios, e.g., in meeting rooms or cars, it can be assumed that the source position is confined to a predefined area, and the acoustic parameters of the environment are approximately fixed. Such scenarios give rise to the assumption that the acoustic samples from the region of interest have a distinct geometrical structure. In this paper, we show that the high-dimensional acoustic samples indeed lie on a low-dimensional manifold and can be embedded into a low-dimensional space. Motivated by this result, we propose a semi-supervised source localization algorithm based on two-microphone measurements, which recovers the inverse mapping between the acoustic samples and their corresponding locations. The idea is to use an optimization framework based on manifold regularization, that involves smoothness constraints of possible solutions with respect to the manifold. The proposed algorithm, termed manifold regularization for localization, is adapted while new unlabelled measurements (from unknown source locations) are accumulated during runtime. Experimental results show superior localization performance when compared with a recently presented algorithm based on a manifold learning approach and with the generalized cross-correlation algorithm as a baseline. The algorithm achieves 2° accuracy in typical noisy and reverberant environments (reverberation time between 200 and 800 ms and SNR between 5 and 20 dB).
The recently proposed binaural linearly constrained minimum variance (BLCMV) beamformer is an extension of the well-known binaural minimum variance distortionless response (MVDR) beamformer, imposing constraints for both the desired and the interfering sources. Besides its capabilities to reduce interference and noise, it also enables to preserve the binaural cues of both the desired and interfering sources, hence making it particularly suitable for binaural hearing aid applications. In this paper, a theoretical analysis of the BLCMV beamformer is presented. In order to gain insights into the performance of the BLCMV beamformer, several decompositions are introduced that reveal its capabilities in terms of interference and noise reduction, while controlling the binaural cues of the desired and the interfering sources. When setting the parameters of the BLCMV beamformer, various considerations need to be taken into account, e.g. based on the amount of interference and noise reduction and the presence of estimation errors of the required relative transfer functions (RTFs). Analytical expressions for the performance of the BLCMV beamformer in terms of noise reduction, interference reduction, and cue preservation are derived. Comprehensive simulation experiments, using measured acoustic transfer functions as well as real recordings on binaural hearing aids, demonstrate the capabilities of the BLCMV beamformer in various noise environments.
Statistically optimal spatial processors (also referred to as data-dependent beamformers) are widely-used spatial focusing techniques for desired source extraction. The Kalman filter-based beamformer (KFB) [1] is a recursive Bayesian method for implementing the beamformer. This letter provides new insights into the KFB. Specifically, we adopt the KFB framework to the task of speech extraction. We formalize the KFB with a set of linear constraints and present its equivalence to the linearly constrained minimum power (LCMP) beamformer. We further show that the optimal output power, required for implementing the KFB, is merely controlling the white noise gain (WNG) of the beamformer. We also show, that in static scenarios, the adaptation rule of the KFB reduces to the simpler affine projection algorithm (APA). The analytically derived results are verified and exemplified by a simulation study.
In recent years, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques and automatic speech recognition (ASR) techniques that are robust to reverberation. In this paper, we describe the REVERB challenge, which is an evaluation campaign that was designed to evaluate such speech enhancement (SE) and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions. Even though most existing benchmark tasks and challenges for distant speech processing focus on the noise robustness issue and sometimes only on a single-channel scenario, a particular novelty of the REVERB challenge is that it is carefully designed to test robustness against reverberation, based on both real, single-channel, and multichannel recordings. This challenge attracted 27 papers, which represent 25 systems specifically designed for SE purposes and 49 systems specifically designed for ASR purposes. This paper describes the problems dealt within the challenge, provides an overview of the submitted systems, and scrutinizes them to clarify what current processing strategies appear effective in reverberant speech processing.
The objective of binaural noise reduction algorithms is not only to selectively extract the desired speaker and to suppress interfering sources (e.g., competing speakers) and ambient background noise, but also to preserve the auditory impression of the complete acoustic scene. For directional sources this can be achieved by preserving the relative transfer function (RTF) which is defined as the ratio of the acoustical transfer functions relating the source and the two ears and corresponds to the binaural cues. In this paper, we theoretically analyze the performance of three algorithms that are based on the binaural minimum variance distortionless response (BMVDR) beamformer, and hence, process the desired source without distortion. The BMVDR beamformer preserves the binaural cues of the desired source but distorts the binaural cues of the interfering source. By adding an interference reduction (IR) constraint, the recently proposed BMVDR-IR beamformer is able to preserve the binaural cues of both the desired source and the interfering source. We further propose a novel algorithm for preserving the binaural cues of both the desired source and the interfering source by adding a constraint preserving the RTF of the interfering source, which will be referred to as the BMVDR-RTF beamformer. We analytically evaluate the performance in terms of binaural signal-to-interference-and-noise ratio (SINR), signal-to-interference ratio (SIR), and signal-to-noise ratio (SNR) of the three considered beamformers. It can be shown that the BMVDR-RTF beamformer outperforms the BMVDR-IR beamformer in terms of SINR and outperforms the BMVDR beamformer in terms of SIR. Among all beamformers which are distortionless with respect to the desired source and preserve the binaural cues of the interfering source, the newly proposed BMVDR-RTF beamformer is optimal in terms of SINR. Simulations using acoustic transfer functions measured on a binaural hearing aid validate our theoretical results.
Besides noise reduction, an important objective of binaural speech enhancement algorithms is the preservation of the binaural cues of all sound sources. For the desired speech source and the interfering sources, e.g., competing speakers, this can be achieved by preserving their relative transfer functions (RTFs). It has been shown that the binaural multi-channel Wiener filter (MWF) preserves the RTF of the desired speech source, but typically distorts the RTF of the interfering sources. To this end, in this paper we propose two extensions of the binaural MWF, i.e., the binaural MWF with RTF preservation (MWF-RTF) aiming to preserve the RTF of the interfering source and the binaural MWF with interference rejection (MWF-IR) aiming to completely suppress the interfering source. Analytical expressions for the performance of the binaural MWF, MWF-RTF and MWF-IR in terms of noise reduction, speech distortion and binaural cue preservation are derived, showing that the proposed extensions yield a better performance in terms of the signal-to-interference ratio and preservation of the binaural cues of the directional interference, while the overall noise reduction performance is degraded compared to the binaural MWF. Simulation results using binaural behind-the-ear impulse responses measured in a reverberant environment validate the derived analytical expressions for the theoretically achievable performance of the binaural MWF, MWF-RTF, and MWF-IR, showing that the performance highly depends on the position of the interfering source and the number of microphones. Furthermore, the simulation results show that the MWF-RTF yields a very similar overall noise reduction performance as the binaural MWF, while preserving the binaural cues of both the speech and the interfering source.
The directivity factor (DF) of a beamformer describes its spatial selectivity and ability to suppress diffuse noise which arrives from all directions. For a given array configuration, it is possible to design beamforming weights which maximize the DF for a particular look-direction, while enforcing nulls for a set of undesired directions. In general, the resulting DF is dependent upon the specific look- and null directions. Using the same array, one may apply a different set of weights designed for any other feasible set of look- and null directions. In this contribution we show that when the optimal DF is averaged over all look directions the result equals the number of sensors minus the number of null constraints. This result holds, regardless of the positions and spatial responses of the individual sensors, and of the null directions. The result generalizes to more complex wave-propagation domains (e.g., reverberation).
The problem of distributed localization for ad hoc wireless acoustic sensor networks (WASNs) is addressed in this paper. WASNs are characterized by low computational resources in each node and by limited connectivity between the nodes. Novel bi-directional tree-based distributed estimation-maximization (DEM) algorithms are proposed to circumvent these inherent limitations. We show that the proposed algorithms are capable of localizing static acoustic sources in reverberant enclosures without a priori information on the number of sources. Unlike serial estimation procedures (like ring-based algorithms), the new algorithms enable simultaneous computations in the nodes and exhibit greater robustness to communication failures. Specifically, the recursive distributed EM (RDEM) variant is better suited to online applications due to its recursive nature. Furthermore, the RDEM outperforms the other proposed variants in terms of convergence speed and simplicity. Performance is demonstrated by an extensive experimental study consisting of both simulated and actual environments.
Relative impulse responses between microphones are usually long and dense due to the reverberant acoustic environment. Estimating them from short and noisy recordings poses a long-standing challenge of audio signal processing. In this paper, we apply a novel strategy based on ideas of compressed sensing. Relative transfer function (RTF) corresponding to the relative impulse response can often be estimated accurately from noisy data but only for certain frequencies. This means that often only an incomplete measurement of the RTF is available. A complete RTF estimate can be obtained through finding its sparsest representation in the time-domain: that is, through computing the sparsest among the corresponding relative impulse responses. Based on this approach, we propose to estimate the RTF from noisy data in three steps. First, the RTF is estimated using any conventional method such as the nonstationarity-based estimator by Gannot or through blind source separation. Second, frequencies are determined for which the RTF estimate appears to be accurate. Third, the RTF is reconstructed through solving a weighted l1 convex program, which we propose to solve via a computationally efficient variant of the SpaRSA (Sparse Reconstruction by Separable Approximation) algorithm. An extensive experimental study with real-world recordings has been conducted. It has been shown that the proposed method is capable of improving many conventional estimators used as the first step in most situations.
In speech communication systems, the microphone signals are degraded by reverberation and ambient noise. The reverberant speech can be separated into two components, namely, an early speech component that includes the direct path and some early reflections, and a late reverberant component that includes all the late reflections. In this paper, a novel algorithm to simultaneously suppress early reflections, late reverberation and ambient noise is presented. A multi-microphone minimum mean square error estimator is used to obtain a spatially filtered version of the early speech component. The estimator constructed as a minimum variance distortionless response (MVDR) beamformer (BF) followed by a postfilter (PF). Three unique design features characterize the proposed method. First, the MVDR BF is implemented in a special structure, named the nonorthogonal generalized sidelobe canceller (NO-GSC). Compared with the more conventional orthogonal GSC structure, the new structure allows for a simpler implementation of the GSC blocks for various MVDR constraints. Second, In contrast to earlier works, RETFs are used in the MVDR criterion rather than either the entire RTFs or only the direct-path of the desired speech signal. An estimator of the RETFs is proposed as well. Third, the late reverberation and noise are processed by both the beamforming stage and the PF stage. Since the relative power of the noise and the late reverberation varies with the frame index, a computationally efficient method for the required matrix inversion is proposed to circumvent the cumbersome mathematical operation. The algorithm was evaluated and compared with two alternative multichannel algorithms and one single-channel algorithm using simulated data and data recorded in a room with a reverberation time of 0.5 s for various source-microphone array distances (1-4 m) and several signal-to-noise levels. The processed signals were tested using two commonly used objective measures, namely perceptual …
Speech signals recorded in a room are commonly degraded by reverberation. In most cases, both the speech signal and the acoustic system of the room are unknown and time-varying. In this paper, a scenario with a single desired sound source and slowly time-varying and spatially-white noise is considered, and a multi-microphone algorithm that simultaneously estimates the clean speech signal and the time-varying acoustic system is proposed. The recursive expectation-maximization scheme is employed to obtain both the clean speech signal and the acoustic system in an online manner. In the expectation step, the Kalman filter is applied to extract a new sample of the clean signal, and in the maximization step, the system estimate is updated according to the output of the Kalman filter. Experimental results show that the proposed method is able to significantly reduce reverberation and increase the speech quality. Moreover, the tracking ability of the algorithm was validated in practical scenarios using human speakers moving in a natural manner.
In multiple speaker scenarios, the linearly constrained minimum variance (LCMV) beamformer is a popular microphone array-based speech enhancement technique, as it allows minimizing the noise power while maintaining a set of desired responses towards different speakers. Here, we address the algorithmic challenges arising when applying the LCMV beamformer in wireless acoustic sensor networks (WASNs), which are a next-generation technology for audio acquisition and processing. We review three optimal distributed LCMV-based algorithms, which compute a network-wide LCMV beamformer output at each node without centralizing the microphone signals. Optimality here refers to equivalence to a centralized realization where a single processor has access to all signals. We derive and motivate the algorithms in an accessible top-down framework that reveals their underlying relations. We explain how their differences result from their different design criterion (node-specific versus common constraints sets), and their different priorities for communication bandwidth, computational power, and adaptivity. Furthermore, although originally proposed for a fully connected WASN, we also explain how to extend the reviewed algorithms to the case of a partially connected WASN, which is assumed to be pruned to a tree topology. Finally, we discuss the advantages and disadvantages of the various algorithms
The problem of localizing and tracking a known number of concurrent speakers in noisy and reverberant enclosures is addressed in this paper. We formulate the localization task as a maximum likelihood (ML) parameter estimation problem, and solve it by utilizing the expectation-maximization (EM) procedure. For the tracking scenario, we propose to adapt two recursive EM (REM) variants. The first, based on Titterington’s scheme, is a Newton-based recursion. In this work we also extend Titterington’s method to deal with constrained maximization, encountered in the problem at hand. The second is based on Cappé and Moulines’ scheme. We discuss the similarities and dissimilarities of these two variants and show their applicability to the tracking problem by a simulated experimental study.
The beampattern of an array consisting of N elements is determined by the beampatterns of the individual elements, their placement, and the weights assigned to them. For each look direction, it is possible to design weights that maximize the array directivity factor (DF). For the case of an array of omnidirectional elements using optimal weights, it has been shown that the average DF over all look directions equals the number of elements. The validity of this theorem is not dependent on array geometry. We generalize this theorem by means of an alternative proof. The chief contributions of this letter are: (a) a compact and direct proof, (b) generalization to arrays containing directional elements (such as cardioids and dipoles), and (c) generalization to arbitrary wave propagation models. A discussion of the theorem’s ramifications on array processing is provided.
Beamforming with wireless acoustic sensor networks (WASNs) has recently drawn the attention of the research community. As the number of microphones grows it is difficult, and in some applications impossible, to determine their layout beforehand. A common practice in analyzing the expected performance is to utilize statistical considerations. In the current contribution, we consider applying the speech distortion weighted multi-channel Wiener filter (SDW-MWF) to enhance a desired source propagating in a reverberant enclosure where the microphones are randomly located with a uniform distribution. Two noise fields are considered, namely, multiple coherent interference signals and a diffuse sound field. Utilizing the statistics of the acoustic transfer function (ATF), we derive a statistical model for two important criteria of the beamformer (BF): the signal to interference ratio (SIR), and the white noise gain. Moreover, we propose reliability functions, which determine the probability of the SIR and white noise gain to exceed a predefined level. We verify the proposed model with an extensive simulative study.
Signal processing methods have significantly changed over the last several decades. Traditional methods were usually based on parametric statistical inference and linear filters. These frameworks have helped to develop efficient algorithms that have often been suitable for implementation on digital signal processing (DSP) systems. Over the years, DSP systems have advanced rapidly, and their computational capabilities have been substantially increased. This development has enabled contemporary signal processing algorithms to incorporate more computations. Consequently, we have recently experienced a growing interaction between signal processing and machine-learning approaches, e.g., Bayesian networks, graphical models, and kernel-based methods, whose computational burden is usually high.
This paper proposes a distributed multiple constraints generalized sidelobe canceler (GSC) for speech enhancement in an N -node fully connected wireless acoustic sensor network (WASN) comprising M microphones. Our algorithm is designed to operate in reverberant environments with constrained speakers (including both desired and competing speakers). Rather than broadcasting M microphone signals, a significant communication bandwidth reduction is obtained by performing local beamforming at the nodes, and utilizing only transmission channels. Each node processes its own microphone signals together with the N + P transmitted signals. The GSC-form implementation, by separating the constraints and the minimization, enables the adaptation of the BF during speech-absent time segments, and relaxes the requirement of other distributed LCMV based algorithms to re-estimate the sources RTFs after each iteration. We provide a full convergence proof of the proposed structure to the centralized GSC-beamformer (BF). An extensive experimental study of both narrowband and (wideband) speech signals verifíes the theoretical analysis.
A transient is an abrupt or impulsive sound followed by decaying oscillations, e.g., keyboard typing and door knocking. Such sounds often arise as interference in everyday applications, e.g., hearing aids, hands-free accessories, mobile phones, and conference-room devices. In this paper, we present an algorithm for single-channel transient interference suppression. The main component of the proposed algorithm is the estimation of the spectral variance of the interference. We propose a statistical model of the transient interference and combine it with non-local filtering. We exploit the unique spectral structure of the transients along with their impulsive temporal nature to distinct them from speech. A particular attention is given to handling both short- and long-duration transients. Experimental results show that the proposed algorithm enables significant transient suppression for a variety of transient types.
In this paper, we present a supervised graph-based framework for sequential processing and employ it to the problem of transient interference suppression. Transients typically consist of an initial peak followed by decaying short-duration oscillations. Such sounds, e.g., keyboard typing and door knocking, often arise as an interference in everyday applications: hearing aids, hands-free accessories, mobile phones, and conference-room devices. We describe a graph construction using a noisy speech signal and training recordings of typical transients. The main idea is to capture the transient interference structure, which may emerge from the construction of the graph. The graph parametrization is then viewed as a data-driven model of the transients and utilized to define a filter that extracts the transients from noisy speech measurements. Unlike previous transient interference suppression studies, in this work the graph is constructed in advance from training recordings. Then, the graph is extended to newly acquired measurements, providing a sequential filtering framework of noisy speech.
We address the application of the linearly constrained minimum variance (LCMV) beamformer in sensor networks. In signal processing applications, it is common to have a redundancy in the number of nodes, fully covering the area of interest. Here we consider suboptimal LCMV beamformers utilizing only a subset of the available sensors for signal enhancement applications. Multiple desired and interfering sources scenarios in multipath environments are considered. We assume that an oracle entity determines the group of sensors participating in the spatial filtering, denoted as the active sensors. The oracle is also responsible for updating the constraints set according to either sensors or sources activity or dynamics. Any update of the active sensors or of the constraints set necessitates recalculation of the beamformer and increases the power consumption. As power consumption is a most valuable resource in sensor networks, it is important to derive efficient update schemes. In this paper, we derive procedures for adding or removing either an active sensor or a constraint from an existing LCMV beamformer. Closed-form, as well as generalized sidelobe canceller (GSC)-form implementations, are derived. These procedures use the previous beamformer to save calculations in the updating process. We analyze the computational burden of the proposed procedures and show that it is much lower than the computational burden of the straightforward calculation of their corresponding beamformers.
Modeling natural and artificial systems has played a key role in various applications and has long been a task that has drawn enormous efforts. In this work, instead of exploring predefined models, we aim to identify implicitly the system degrees of freedom. This approach circumvents the dependency of a specific predefined model for a specific task or system and enables a generic data-driven method to characterize a system based solely on its output observations. We claim that each system can be viewed as a black box controlled by several independent parameters. Moreover, we assume that the perceptual characterization of the system output is determined by these independent parameters. Consequently, by recovering the independent controlling parameters, we find in fact a generic model for the system. In this work, we propose a supervised algorithm to recover the controlling parameters of natural and artificial linear systems. The proposed algorithm relies on nonlinear independent component analysis using diffusion kernels and spectral analysis. Employment of the proposed algorithm on both synthetic and practical examples has shown accurate recovery of parameters.
A vector-sensor consisting of a monopole sensor collocated with orthogonally oriented dipole sensors is used for direction of arrival (DOA) estimation in the presence of an isotropic noise-field or internal device noise. A maximum likelihood (ML) DOA estimator is derived and subsequently shown to be a special case of DOA estimation by means of a search for the direction of maximum steered response power (SRP). The problem of SRP maximization with respect to a vector-sensor can be solved with a computationally inexpensive algorithm. The ML estimator achieves asymptotic efficiency and thus outperforms existing estimators with respect to the mean square angular error (MSAE) measure. The beampattern associated with the ML estimator is shown to be identical to that used by the minimum power distortionless response beamformer for the purpose of signal enhancement.
The Kalman filter is one of the most widely applied tools in the statistical signal processing field, especially in the context of causal online applications [1]. This article presents an introduction to the Kalman filter; the desired signal and its corresponding measurements are modeled, the Kalman filter is formulated and presented with an intuitive explanation of the involved equations, applications of the filter are given in the context of speech processing, and examples of two popular applications in speech enhancement and speaker tracking are provided.
Particle filtering has been shown to be an effective approach to solving the problem of acoustic source localization in reverberant environments. In reverberant environment, the direct- arrival of the single source is accompanied by multiple spurious arrivals. Multiple-hypothesis model associated with these arrivals can be used to alleviate the unreliability often attributed to the acoustic source localization problem. Until recently, this multiple- hypothesis approach was only applied to bootstrap-based particle filter schemes. Recently, the extended Kalman particle filter (EPF) scheme which allows for an improved tracking capability was proposed for the localization problem. The EPF scheme utilizes a global extended Kalman filter (EKF) which strongly depends on prior knowledge of the correct hypotheses. Due to this, the extension of the multiple-hypothesis model for this scheme is not trivial. In this paper, the EPF scheme is adapted to the multiple-hypothesis model to track a single acoustic source in reverberant environments. Our work is supported by an extensive experimental study using both simulated data and data recorded in our acoustic lab. Various algorithms and array constellations were evaluated. The results demonstrate the superiority of the proposed algorithm in both tracking and switching scenarios. It is further shown that splitting the array into several sub-arrays improves the robustness of the estimated source location.
Enhancement of speech signals for hands-free communication systems has attracted significant research efforts in the last few decades. Still, many aspects and applications remain open and require further research. One of the important open problems is the single-channel transient noise reduction. In this paper, we present a novel approach for transient noise reduction that relies on non-local (NL) neighborhood filters. In particular, we propose an algorithm for the enhancement of a speech signal contaminated by repeating transient noise events. We assume that the time duration of each reoccurring transient event is relatively short compared to speech phonemes and model the speech source as an auto-regressive (AR) process. The proposed algorithm consists of two stages. In the first stage, we estimate the power spectral density (PSD) of the transient noise by employing a NL neighborhood filter. In the second stage, we utilize the optimally modified log spectral amplitude (OM-LSA) estimator for denoising the speech using the noise PSD estimate from the first stage. Based on a statistical model for the measurements and diffusion interpretation of NL filtering, we obtain further insight into the algorithm behavior. In particular, for given transient noise, we determine whether estimation of the noise PSD is feasible using our approach, how to properly set the algorithm parameters, and what is the expected performance of the algorithm. Experimental study shows good results in enhancing speech signals contaminated by transient noise, such as typical household noises, construction sounds, keyboard typing, and metronome clacks.
An acoustic vector sensor provides measurements of both the pressure and particle velocity of a sound field in which it is placed. These measurements are vectorial in nature and can be used for the purpose of source localization. A straightforward approach towards determining the direction of arrival (DOA) utilizes the acoustic intensity vector, which is the product of pressure and particle velocity. The accuracy of an intensity vector based DOA estimator in the presence of noise has been analyzed previously. In this paper, the effects of reverberation upon the accuracy of such a DOA estimator are examined. It is shown that particular realizations of reverberation differ from an ideal isotropically diffuse field, and induce an estimation bias which is dependant upon the room impulse responses (RIRs). The limited knowledge available pertaining the RIRs is expressed statistically by employing the diffuse qualities of reverberation to extend Polack’s statistical RIR model. Expressions for evaluating the typical bias magnitude as well as its probability distribution are derived.
We consider a bidirectional time division duplex (TDD) multiple-input multiple-output (MIMO) communication system with time-varying channel and additive white Gaussian noise (AWGN). A blind bidirectional channel tracking algorithm, based on the projection approximation subspace tracking (PAST) algorithm, is applied in both terminals. The resulting singular value decomposition (SVD) of the channel matrix is then used to approximately diagonalize the channel. The proposed method is applied to an orthogonal frequency-division multiplexing-(OFDM-)MIMO setting with a typical indoor time-domain reflection model. The computational cost of the proposed algorithm, compared with other state-of-the-art algorithms, is relatively small. The Kalman filter is utilized for establishing a benchmark for the obtained performance of the proposed tracking algorithm. The performance degradation relative to a full channel state information (CSI) due to the application of the tracking algorithm is evaluated in terms of average effective rate and the outage probability and compared with alternative tracking algorithms. The obtained results are also compared with a benchmark obtained by the Kalman filter with known input signal and channel characteristics. It is shown that the expected degradation in performance of frequency-domain algorithms (which do not exploit the smooth frequency response of the channel) is only minor compared with time-domain algorithms in a range of reasonable signal-to-noise ratio (SNR) levels. The proposed bidirectional frequency-domain tracking algorithm, proposed in this paper, is shown to attain communication rates close to the benchmark and to outperform a competing algorithm. The paper is concluded by evaluating the proposed blind tracking method in terms of the outage probability and the symbol error rate (SER) versus. SNR for binary phase shift keying (BPSK) and 4-Quadrature amplitude modulation (QAM) constellations.
The minimum variance distortionless response (MVDR) beamformer, also known as Capon’s beamformer, is widely studied in the area of speech enhancement. The MVDR beamformer can be used for both speech dereverberation and noise reduction. This paper provides new insights into the MVDR beamformer. Specifically, the local and global behavior of the MVDR beamformer is analyzed and novel forms of the MVDR filter are derived and discussed. In earlier works it was observed that there is a tradeoff between the amount of speech dereverberation and noise reduction when the MVDR beamformer is used. Here, the tradeoff between speech dereverberation and noise reduction is analyzed thoroughly. The local and global behavior, as well as the tradeoff, is analyzed for different noise fields such as, for example, a mixture of coherent and non-coherent noise fields, entirely non-coherent noise fields and diffuse noise fields. It is shown that maximum noise reduction is achieved when the MVDR beamformer is used for noise reduction only. The amount of noise reduction that is sacrificed when complete dereverberation is required depends on the direct-to-reverberation ratio of the acoustic impulse response between the source and the reference microphone. The performance evaluation supports the theoretical analysis and demonstrates the tradeoff between speech dereverberation and noise reduction. When desiring both speech dereverberation and noise reduction, the results also demonstrate that the amount of noise reduction that is sacrificed decreases when the number of microphones increases.
In speech communication systems the received microphone signals are degraded by room reverberation and ambient noise that decrease the fidelity and intelligibility of the desired speaker. Reverberant speech can be separated into two components, viz. early speech and late reverberant speech. Recently, various algorithms have been developed to suppress late reverberant speech. One of the main challenges is to develop an estimator for the so-called late reverberant spectral variance (LRSV) which is required by most of these algorithms. In this letter a statistical reverberation model is proposed that takes the energy contribution of the direct-path into account. This model is then used to derive a more general LRSV estimator, which in a particular case reduces to an existing LRSV estimator. Experimental results show that the developed estimator is advantageous in case the source-microphone distance is smaller than the critical distance.
In this paper, we propose a convolutive transfer function generalized sidelobe canceler (CTF-GSC), which is an adaptive beamformer designed for multichannel speech enhancement in reverberant environments. Using a complete system representation in the short-time Fourier transform (STFT) domain, we formulate a constrained minimization problem of total output noise power subject to the constraint that the signal component of the output is the desired signal, up to some prespecified filter. Then, we employ the general sidelobe canceler (GSC) structure to transform the problem into an equivalent unconstrained form by decoupling the constraint and the minimization. The CTF-GSC is obtained by applying a convolutive transfer function (CTF) approximation on the GSC scheme, which is a more accurate and a less restrictive than a multiplicative transfer function (MTF) approximation. Experimental results demonstrate that the proposed beamformer outperforms the transfer function GSC (TF-GSC) in reverberant environments and achieves both improved noise reduction and reduced speech distortion.
In many practical environments we wish to extract several desired speech signals, which are contaminated by nonstationary and stationary interfering signals. The desired signals may also be subject to distortion imposed by the acoustic room impulse responses (RIRs). In this paper, a linearly constrained minimum variance (LCMV) beamformer is designed for extracting the desired signals from multimicrophone measurements. The beamformer satisfies two sets of linear constraints. One set is dedicated to maintaining the desired signals, while the other set is chosen to mitigate both the stationary and nonstationary interferences. Unlike classical beamformers, which approximate the RIRs as delay-only filters, we take into account the entire RIR [or its respective acoustic transfer function (ATF)]. The LCMV beamformer is then reformulated in a generalized sidelobe canceler (GSC) structure, consisting of a fixed beamformer (FBF), blocking matrix (BM), and adaptive noise canceler (ANC). It is shown that for spatially white noise field, the beamformer reduces to a FBF, satisfying the constraint sets, without power minimization. It is shown that the application of the adaptive ANC contributes to interference reduction, but only when the constraint sets are not completely satisfied. We show that relative transfer functions (RTFs), which relate the desired speech sources and the microphones, and a basis for the interference subspace suffice for constructing the beamformer. The RTFs are estimated by applying the generalized eigenvalue decomposition (GEVD) procedure to the power spectral density (PSD) matrices of the received signals and the stationary noise. A basis for the interference subspace is estimated by collecting eigenvectors, calculated in segments where nonstationary interfering sources are active and the desired sources are inactive. The rank of the basis is then reduced by the application of the orthogonal triangular decomposition (QRD). This procedure relaxes the common requirement for nonoverlapping activity periods of the interference sources. A comprehensive experimental study in both simulated and real environments demonstrates the performance of the proposed beamformer.
In this paper, we present a relative transfer function (RTF) identification method for speech sources in reverberant environments. The proposed method is based on the convolutive transfer function (CTF) approximation, which enables to represent a linear convolution in the time domain as a linear convolution in the short-time Fourier transform (STFT) domain. Unlike the restrictive and commonly used multiplicative transfer function (MTF) approximation, which becomes more accurate when the length of a time frame increases relative to the length of the impulse response, the CTF approximation enables representation of long impulse responses using short time frames. We develop an unbiased RTF estimator that exploits the nonstationarity and presence probability of the speech signal and derive an analytic expression for the estimator variance. Experimental results show that the proposed method is advantageous compared to common RTF identification methods in various acoustic environments, especially when identifying long RTFs typical to real rooms.
Noise fields encountered in real-life scenarios can often be approximated as spherical or cylindrical noise fields. The characteristics of the noise field can be described by a spatial coherence function. For simulation purposes, researchers in the signal processing community often require sensor signals that exhibit a specific spatial coherence function. In addition, they often require a specific type of noise such as temporally correlated noise, babble speech that comprises a mixture of mutually independent speech fragments, or factory noise. Existing algorithms are unable to generate sensor signals such as babble speech and factory noise observed in an arbitrary noise field. In this paper an efficient algorithm is developed that generates multisensor signals under a predefined spatial coherence constraint. The benefit of the developed algorithm is twofold. Firstly, there are no restrictions on the spatial coherence function. Secondly, to generate MM sensor signals the algorithm requires only MM mutually independent noise signals. The performance evaluation shows that the developed algorithm is able to generate a more accurate spatial coherence between the generated sensor signals compared to the so-called image method that is frequently used in the signal processing community.
Hands-free devices are often used in a noisy and reverberant environment. Therefore, the received microphone signal does not only contain the desired near-end speech signal but also interferences such as room reverberation that is caused by the near-end source, background noise and a far-end echo signal that results from the acoustic coupling between the loudspeaker and the microphone. These interferences degrade the fidelity and intelligibility of near-end speech. In the last two decades, post filters have been developed that can be used in conjunction with a single microphone acoustic echo canceller to enhance the near-end speech. In previous works, spectral enhancement techniques have been used to suppress residual echo and background noise for single microphone acoustic echo cancellers. However, dereverberation of the near-end speech was not addressed in this context. Recently, practically feasible spectral enhancement techniques to suppress reverberation have emerged. In this paper, we derive a novel spectral variance estimator for the late reverberation of the near-end speech. Residual echo will be present at the output of the acoustic echo canceller when the acoustic echo path cannot be completely modeled by the adaptive filter. A spectral variance estimator for the so-called late residual echo that results from the deficient length of the adaptive filter is derived. Both estimators are based on a statistical reverberation model. The model parameters depend on the reverberation time of the room, which can be obtained using the estimated acoustic echo path. A novel postfilter is developed which suppresses late reverberation of the near-end speech, residual echo and background noise, and maintains a constant residual background noise level. Experimental results demonstrate the beneficial use of the developed system for reducing reverberation, residual echo, and background noise.
Full-duplex hands-free man/machine interface often suffers from directional nonstationary interference, such as a competing speaker, as well as stationary interferences which may comprise both directional and nondirectional signals. The transfer-function generalized sidelobe canceller (TF-GSC) exploits the nonstationarity of the speech signal to enhance it when the undesired interfering signals are stationary. Unfortunately, the assumptions leading to the derivation of the TF-GSC are violated when a nonstationary interference is present. In this paper, we propose an adaptive beamformer, based on the TF-GSC, that is suitable for cancelling nonstationary interferences in noisy reverberant environments. We modify two of the TF-GSC components to enable suppression of the nonstationary undesired signal. A modified fixed beamformer (FBF) is designed to block the nonstationary interfering signal while maintaining the desired speech signal. A modified blocking matrix (BM) is designed to block both the desired signal and the nonstationary interference. We introduce a novel method for updating the blocking matrix in double talk scenarios, which exploits the nonstationarity of both the desired and interfering speech signals. Experimental results demonstrate the performance of the proposed algorithm in noisy and reverberant environments and show its superiority over the original TF-GSC.
In this work, we evaluate the performance of a recently proposed adaptive beamformer, namely Dual source Transfer-Function Generalized Sidelobe Canceller (DTF-GSC). The DTF-GSC is useful for enhancing a speech signal received by an array of microphones in a noisy and reverberant environment. We demonstrate the applicability of the DTF-GSC in some representative reverberant and non-reverberant environments under various noise field conditions. The performance is evaluated based on the power spectral density (PSD) deviation imposed on the desired signal at the beamformer output, the achievable noise reduction, and the interference reduction. We show that the resulting expressions for the PSD deviation and noise reduction depend on the actual acoustical environment, the noise field, and the estimation accuracy of the relative transfer functions (RTFs), defined as the ratio between each acoustical transfer function (ATF) and a reference ATF. The achievable interference reduction is generally independent of the noise field. Experimental results demonstrate the sensitivity of the system’s performance to array misalignments.
Man machine interaction requires an acoustic interface for providing full duplex hands-free communication. The transfer-function generalized sidelobe canceller (TF-GSC) is an adaptive beamformer suitable for enhancing a speech signal received by an array of microphones in a noisy and reverberant environment. When an echo signal is also present in the microphone output signals, cascade schemes of acoustic echo cancellation and TF-GSC can be employed for suppressing both interferences. However, the performances obtainable by cascade schemes are generally insufficient. An acoustic echo canceller (AEC) that precedes the adaptive beamformer suffers from the noise component at its input. Acoustic echo cancellation following the adaptive beamformer lacks robustness due to time variations in the echo path affecting beamformer adaptation. In this paper, we introduce an echo transfer-function generalized sidelobe canceller (ETF-GSC), which combines the TF-GSC with an acoustic echo canceller. The proposed scheme consists of a primary TF-GSC for dealing with the noise interferences, and a secondary modified TF-GSC for dealing with the echo cancellation. The secondary TF-GSC includes an echo canceller embedded within a replica of the primary TF-GSC components. We show that using this structure, the problems encountered in the cascade schemes can be appropriately avoided. Experimental results demonstrate improved performance of the ETF-GSC compared to cascade schemes in noisy and reverberant environments.
The advantages of optics that include processing speed and information throughput, modularity and versatility could be incorporated into one of the most interesting and applicable topics of digital communication related to Viterbi decoders. We aim to accelerate the processing rate and capabilities of Viterbi decoders applied for convolution codes, speech recognition, inter symbol interference (ISI) mitigation problems. The suggested configuration for realizing the decoder is based upon fast optical switches. The configuration is very modular and can easily be increased to Viterbi decoder based upon state machine with larger number of states and depth of the trellis diagram.
A dual-step approach for speaker localization based on a microphone array is addressed in this paper. In the first stage, which is not the main concern of this paper, the time difference between arrivals of the speech signal at each pair of microphones is estimated. These readings are combined in the second stage to obtain the source location. In this paper, we focus on the second stage of the localization task. In this contribution, we propose to exploit the speaker’s smooth trajectory for improving the current position estimate. Three localization schemes, which use the temporal information, are presented. The first is a recursive form of the Gauss method. The other two are extensions of the Kalman filter to the nonlinear problem at hand, namely, the extended Kalman filter and the unscented Kalman filter. These methods are compared with other algorithms, which do not make use of the temporal information. An extensive experimental study demonstrates the advantage of using the spatial-temporal methods. To gain some insight on the obtainable performance of the localization algorithm, an approximate analytical evaluation, verified by an experimental study, is conducted. This study shows that in common TDOA-based localization scenarios—where the microphone array has small interelement spread relative to the source position—the elevation and azimuth angles can be accurately estimated, whereas the Cartesian coordinates as well as the range are poorly estimated.
Determining the spatial position of a speaker finds a growing interest in video conference scenarios where automated camera steering and tracking are required. Speaker localization can be achieved with a dual-step approach. In the preliminary stage a microphone array is used to extract the time difference of arrival (TDOA) of the speech signal. These readings are then used by the second stage for the actual localization. In this work we present novel, frequency domain, approaches for TDOA calculation in a reverberant and noisy environment. Our methods are based on the speech quasi-stationarity property, noise stationarity and on the fact that the speech and the noise are uncorrelated. The mathematical derivations in this work are followed by an extensive experimental study which involves static and tracking scenarios.
In speech enhancement applications microphone array postfiltering allows additional reduction of noise components at a beamformer output. Among microphone array structures the recently proposed general transfer function generalized sidelobe canceller (TF-GSC) has shown impressive noise reduction abilities in a directional noise field, while still maintaining low speech distortion. However, in a diffused noise field less significant noise reduction is obtainable. The performance is even further degraded when the noise signal is nonstationary. In this contribution we propose three postfiltering methods for improving the performance of microphone arrays. Two of which are based on single-channel speech enhancers and making use of recently proposed algorithms concatenated to the beamformer output. The third is a multichannel speech enhancer which exploits noise-only components constructed within the TF-GSC structure. This work concentrates on the assessment of the proposed postfiltering structures. An extensive experimental study, which consists of both objective and subjective evaluation in various noise fields, demonstrates the advantage of the multichannel postfiltering compared to the single-channel techniques.
In recent work, we considered a microphone array located in a reverberated room, where general transfer functions (TFs) relate the source signal and the microphones, for enhancing a speech signal contaminated by interference. It was shown that it is sufficient to use the ratio between the different TFs rather than the TFs themselves in order to implement the suggested algorithm. An unbiased estimate of the TFs ratios was obtained by exploiting the nonstationarity of the speech signal. In this correspondence, we present an analysis of a distortion indicator, namely power spectral density (PSD) deviation, imposed on the desired signal by our newly suggested transfer function generalized sidelobe canceller (TF-GSC) algorithm. It is well known that for speech signals, PSD deviation between the reconstructed signal and the original one is the main contribution for speech quality degradation. As we are mainly dealing with speech signals, we analyze the PSD deviation rather than the regular waveform distortion. The resulting expression depends on the TFs involved, the noise field, and the quality of estimation of the TF’s ratios. For the latter dependency, we provide an approximated analysis of estimation procedure that is based on the signal’s nanstationarity and explore its dependency on the actual speech signal and on the signal-to-noise ratio (SNR) level. The theoretical expression is then used to establish empirical evaluation of the PSD deviation for several TFs of interest, various noise fields, and a wide range of SNR levels. It is shown that only a minor amount of PSD deviation is imposed on the beamformer output. The analysis presented in this correspondence is in good agreement with the actual performance presented in the former TF-GSC paper.
A novel approach for multimicrophone speech dereverberation is presented. The method is based on the construction of the null subspace of the data matrix in the presence of colored noise, using the generalized singular-value decomposition (GSVD) technique, or the generalized eigenvalue decomposition (GEVD) of the respective correlation matrices. The special Silvester structure of the filtering matrix, related to this subspace, is exploited for deriving a total least squares (TLS) estimate for the acoustical transfer functions (ATFs). Other less robust but computationally more efficient methods are derived based on the same structure and on the QR decomposition (QRD). A preliminary study of the incorporation of the subspace method into a subband framework proves to be efficient, although some problems remain open. Speech reconstruction is achieved by virtue of the matched filter beamformer (MFBF). An experimental study supports the potential of the proposed methods.
We present a novel approach for real-time multichannel speech enhancement in environments of nonstationary noise and time-varying acoustical transfer functions (ATFs). The proposed system integrates adaptive beamforming, ATF identification, soft signal detection, and multichannel postfiltering. The noise canceller branch of the beamformer and the ATF identification are adaptively updated online, based on hypothesis test results. The noise canceller is updated only during stationary noise frames, and the ATF identification is carried out only when desired source components have been detected. The hypothesis testing is based on the nonstationarity of the signals and the transient power ratio between the beamformer primary output and its reference noise signals. Following the beamforming and the hypothesis testing, estimates for the signal presence probability and for the noise power spectral density are derived. Subsequently, an optimal spectral gain function that minimizes the mean square error of the log-spectral amplitude (LSA) is applied. Experimental results demonstrate the usefulness of the proposed system in nonstationary noise environments.
We address the problem of cancelling a stationary noise component from its static mixtures with a nonstationary signal of interest. Two different approaches, both based on second-order statistics, are considered. The first is the blind source separation (BSS) approach which aims at estimating the mixing parameters via approximate joint diagonalization of estimated correlation matrices. Proper exploitation of the nonstationary nature of the desired signal, in contrast to the stationarity of the noise, allows parameterization of the joint diagonalization problem in terms of a nonlinear weighted least squares (WLS) problem. The second approach is a denoising approach, which translates into direct estimation of just one of the mixing coefficients via solution of a linear WLS problem, followed by the use of this coefficient to create a noise-only signal to be properly eliminated from the mixture. Under certain assumptions, the BSS approach is asymptotically optimal, yet computationally more intense, since it involves an iterative nonlinear WLS solution, whereas the second approach only requires a closed-form linear WLS solution. We analyze and compare the performance of the two approaches and provide some simulation results which confirm our analysis. Comparison to other methods is also provided.
We present a spectral domain, speech enhancement algorithm. The new algorithm is based on a mixture model for the short time spectrum of the clean speech signal, and on a maximum assumption in the production of the noisy speech spectrum. In the past this model was used in the context of noise robust speech recognition. In this paper we show that this model is also effective for improving the quality of speech signals corrupted by additive noise. The computational requirements of the algorithm can be significantly reduced, essentially without paying performance penalties, by incorporating a dual codebook scheme with tied variances. Experiments, using recorded speech signals and actual noise sources, show that in spite of its low computational requirements, the algorithm shows improved performance compared to alternative speech enhancement algorithms.
We consider a sensor array located in an enclosure, where arbitrary transfer functions (TFs) relate the source signal and the sensors. The array is used for enhancing a signal contaminated by interference. Constrained minimum power adaptive beamforming, which has been suggested by Frost (1972) and, in particular, the generalized sidelobe canceler (GSC) version, which has been developed by Griffiths and Jim (1982), are the most widely used beamforming techniques. These methods rely on the assumption that the received signals are simple delayed versions of the source signal. The good interference suppression attained under this assumption is severely impaired in complicated acoustic environments, where arbitrary TFs may be encountered. In this paper, we consider the arbitrary TF case. We propose a GSC solution, which is adapted to the general TF case. We derive a suboptimal algorithm that can be implemented by estimating the TFs ratios, instead of estimating the TFs. The TF ratios are estimated by exploiting the nonstationarity characteristics of the desired signal. The algorithm is applied to the problem of speech enhancement in a reverberating room. The discussion is supported by an experimental study using speech and noise signals recorded in an actual room acoustics environment.
Speech quality and intelligibility might significantly deteriorate in the presence of background noise, especially when the speech signal is subject to subsequent processing. In particular, speech coders and automatic speech recognition (ASR) systems that were designed or trained to act on clean speech signals might be rendered useless in the presence of background noise. Speech enhancement algorithms have therefore attracted a great deal of interest. In this paper, we present a class of Kalman filter-based algorithms with some extensions, modifications, and improvements of previous work. The first algorithm employs the estimate-maximize (EM) method to iteratively estimate the spectral parameters of the speech and noise parameters. The enhanced speech signal is obtained as a byproduct of the parameter estimation algorithm. The second algorithm is a sequential, computationally efficient, gradient descent algorithm. We discuss various topics concerning the practical implementation of these algorithms. Extensive experimental study using real speech and noise signals is provided to compare these algorithms with alternative speech enhancement algorithms, and to compare the performance of the iterative and sequential algorithms.
For IEEE papers:
© 19xx, 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.