I am involved in four exciting research projects. I am looking for talented and committed Ph.D. students and/or Post-Doctoral trainees with strong background in statistical signal processing and machine learning. Experience in audio signal processing and high programming skills will be considered as an advantage. If interested, please contact me.

Please see our sponsorship.

Socially Pertinent Robots in Gerontological Healthcare – SPRING

horizon 2020 logo

  • H2020 Consortium
  • Project duration: 2020-2023
  • Details

Scientific Objective

Develop a novel paradigm and novel concept of socially-aware robots, and to conceive innovative methods and algorithms for computer vision, audio processing, sensor-based control, and spoken dialog systems based on modern statistical- and deep-learning to ground the required social robot skills.

Technological Objective

Create and launch a brand new generation of robots that are flexible enough to adapt to the needs of the users, and not the other way around.

Experimental Objective

Validate the technology based on HRI experiments in a gerontology hospital, and to assess its acceptability by patients and medical staff.


Combined Neural Interface and Deep Learning Methods for Multi-Microphone Assisted Listening and Selective Attention Devices


ministry of science logo

  • Ministry of Science
  • Project duration: 2020-2022
  • This is a joint project with Elana Zion-Golumbic Jacob Goldberger




Modern day environments are laden with rich stimuli all competing for our attention, a reality that poses substantial challenges for the perceptual system. Focusing attention exclusively on one important speaker and avoiding distraction is a major feat, in particular for individuals with hearing or attentional impairment.

In this project, we propose a unique combination of methodologies from signal processing, machine learning and brain research disciplines that can jointly develop novel algorithms and evaluation procedures capable of extracting and enhancing a desired speaker in adverse acoustic scenarios. Specifically, we harness the power of deep neural networks (DNNs) for audio-processing classification tasks, to develop approaches for training a DNN with multi-microphone speech recordings, and to overcome the complex nature of dealing with natural speech data with its inherent dynamic nature. Moreover, determining which speaker should be selectively enhanced will be guided automatically by the user’s momentary internal preferences using a real-time EEG-based neural interface. The novel and unique neuro-engineering approach is critical for developing stable technological solutions meant for human use and that need to adhere to real-life behavioural and environmental constraints, and can be applied more broadly in the design of new-generation “attentive” hearing devices, “hearables” and teleconferencing systems.


Audio Processing Algorithms in Adverse Conditions

IIA logo

  • “Magneton” – Israel Innovation Authority, jointly with CEVA Ltd.
  • Project duration: 2020-2022




This project deals with developing single- and multi-microphone algorithms for speech enhancement in adverse conditions. Special concern will be given to hardware constraints and to real-time requirements.


Audio-visual Referring Expressions

DSI logo

  • Bar-Ilan University, Data Science Institute, jointly with Gal Chechik
  • Project duration: 2020-2021




As rich datasets for visual and auditory modalities accumulate, the field of machine perception progresses towards the challenge of understanding rich scenes that involve multiple co-interacting entities. Going beyond low-level perception, semantic understanding of rich-scenes is viewed as the next frontier of machine perception.

To evaluate understanding of rich scenes, one often aims at high-level decisions and reasoning about the scene. Specifically, research on language grounding, aims to connect high-level
representations in plain language referring to perceived objects and situations. As one concrete important example, in the task of referring expressions, a model is trained to detect an object of interest, based on its visual properties. As an example, a deep network can find a “boy wearing a red hat” in an image of several boys.

We propose to advance the state of the art in this domain by generating perception signals that are more natural in two ways. First, they refer to dynamic scenes (videos) where objects can move around. Second, objects are also referred to by the sounds they emit, forming a unified audio-visual reasoning dataset.

Two main challenges are addressed in this project:

  1. Develop methods for generating a dataset of videos with a matching “3D” soundtrack, together with a high-level semantic representation that captures the dynamic audio-visual properties of objects.
  2. Develop algorithms to detect objects jointly based on the sound they make and visual appearance, as referred to in natural language.