This page is an on-line demo of our recent research results on using phase recovery for reducing interference in monaural musical sound source separation. Full presentation of results and method is in our paper entitled:
"Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation"
and submitted to INTERSPEECH 2018.
Our work is about improving the separation of a musical source - the singing voice - from a musical mixture, by using novel phase recovery techniques. Typical musical sound source separation methods are based only on the retrieval of the magnitude, and use the original mixture's phase to estimate complex-valued time-frequency representations of the sources. We propose to recover the phase of the separated sources by using more recent phase retrieval algorithms.
For evaluating our method, we focus on the singing voice separation. We are using the most state-of-the-art method, MaD TwinNet, to estimate the magnitude spectrogram of the singing voice. On top of that, we show that improved phase recovery algorithms reduce interference between the estimates sources (here: singing voice and musical background).
MaD TwinNet offers current state-of-the-art results on monaural singing voice separation. For more details, visit MaD TwinNet's site!
In a nutshell, MaD TwinNet uses the Masker to retrieve a first estimate of the magnitude spectrogram of the targeted source. Then, it enhances this spectrogram with the Denoiser, producing the final estimate, $\hat{\mathbf{V}}_{1}$, of the targeted musical source. We use this final estimate as an input for phase recovery techniques.
In short, phase recovery is achieved by exploiting phase constraints that can originate from several properties:
Those phase constraints can be incorporated into a source separation framework, in order to account for both prior phase models and the mixture's phase. Indeed, we want to promote some phase properties, but provided that the estimated sources will still add up to the mixture (in other words, we want to preserve the overall energy of the audio signal).
The first algorithm we propose is the recently-introduced consistent anisotropic Wiener filter. It is an extension of the classical Wiener filter that has been designed to account for the two phase properties presented above. More info on this filter here!
The second algorithm is an iterative procedure that uses phase unwrapping as an initialization scheme. Unlike Wiener fitlers, it does not modify the target magnitude over iterations. The iterative process is illustrated bellow, and more details can be found on the companion website of the corresponding journal paper.
Below you can actually listen the performance of our method! We have a set of songs and for each one, we offer for listening the original mixture (i.e. the song), the original voice, and the predicted voice as is reconstructed by simply using the mixture's phase and by our algorithms.
To offer you the ability to compare the results of our algorithms, we used exact the same songs and excerpts with the results presented for the MaD TwinNet.
Must be mentioned that we did not do any kind of extra post-processing to the files. You will just hear the actual, unprocessed, output of our algorithms.
Artist | Title | Genre |
---|---|---|
Signe Jakobsen | What Have You Done To Me | Rock Singer-Songwriter |
Artist | Title | Genre |
---|---|---|
Fergessen | Back From The Start | Melodic Indie Rock |
Artist | Title | Genre |
---|---|---|
Sambasevam Shanmugam | Kaathaadi | Bollywood |
Artist | Title | Genre |
---|---|---|
James Elder & Mark M Thompson | The English Actor | Indie Pop |
Artist | Title | Genre |
---|---|---|
Leaf | Come around | Atmospheric Indie Pop |
In other words, from what data is our method learned, on what data it is tested, and how well does it perform from an objective perspective?
In order to train our method, we used the development subset of the Demixing Secret Dataset (DSD), which consists of 50 mixtures with their corresponding sources, plus music stems from MedleyDB.
For testing our method, we used the testing subset of the DSD, consisting of 50 mixtures and their corresponding sources.
We objectively evaluated our method using the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifacts ratio (SAR). The results for the singing voice can be seen in the table below.
SDR | SIR | SAR | |
---|---|---|---|
Mixture's phase | 04.57 | 08.17 | 05.97 |
PU Iter | 04.52 | 08.87 | 05.52 |
CAW | 04.46 | 10.32 | 04.97 |
We also present a graphical illustration of the separation performance of the CAW filtering. This filter depends on a parameter $\kappa$ that promotes the "sinusoidality" of the signal (through the phase unwrapping technique) and a parameter $\delta$ that promotes the consistency constraint.
We would like to kindly acknowledge all those that supported and helped us for this work.