Abstract

Supervised multi-channel audio source separation requires extracting useful spectral, temporal, and spatial features from the mixed signals. The success of many existing systems is therefore largely dependent on the choice of features used for training. In this work, we introduce a novel multi-channel,multiresolution convolutional auto-encoder neural network that works on raw time-domain signals to determine appropriate multiresolution features for separating the singing-voice from stereo music. Our experimental results show that the proposed method can achieve multi-channel audio source separation without the need for hand-crafted features or any pre- or post-processing.

Bibtex


@article{Grais_2018,
  author = {Grais, E. M. and Ward, D. and Plumbley, M. D.},
  title = {Raw Multi-Channel Audio Source Separation using Multi-Resolution Convolutional Auto-Encoders},
  journal = {ArXiv e-prints},
  archiveprefix = {arXiv},
  eprint = {1803.00702},
  primaryclass = {cs.SD},
  keywords = {"maruss"},
  year = {2018},
  month = mar,
  url = {http://adsabs.harvard.edu/abs/2018arXiv180300702G},
  adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}


Evaluation Results

We extracted the singing-voice (vocals) from 46 different rock and pop songs which were used for the evaluation stage of the music separation challenge of SiSEC 2016. The performance of this system was assessed using the BSS Eval toolkit. For comparison, we applied BSS Eval to the audio files submitted by participants of the SiSEC 2016 challenge (note that these values differ from those found on the results page of the SiSEC MUS site, as we did not use the framewise method). You can download the vocal separation performance measures as a csv file, where our proposed method is named “MRCAE”.

Audio Examples

The following examples are provided to show some of the best and worst of the estimated vocals and estimated accompaniment from the SiSEC 2016 dataset. The accompaniment was estimated by subtracting the extracted vocal from the original mixture. Source-to-distortion ratios (SDR; in decibels) refer to those measured on the singing voice over the entire audio signal, with the original vocals and accompaniment used as ground truth. Our average SDR across the 46 songs was 4.25 dB, with a standard deviation of 1.5 dB.

    Song 42 (SDR: 7.1 dB)

  • Song 15 (SDR: 6.3 dB)

  • Song 20 (SDR: 3.1 dB)

  • Song 1 (SDR: 0.6 dB)