Bandhav Veluri, Justin Chan, Malek Itani, Tuochao Chen, Takuya Yoshioka, Shyamnath Gollakota

Paul G. Allen School of Computer Science & Engineering, University of Washington, USA
Microsoft, One Microsoft Way, Redmond, WA, USA


We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner, while also benefiting from the performance transformer based architectures provide. Our evaluations show as much as 2.2–3.3 dB improvement in SI-SNRi compared to the prior models for this task, while having a 1.2–4x smaller model size and a 1.5–2x lower runtime.

[Paper] [Code] [Web App]