Two Recent Results in Transfer Learning for Music and Speech

In this post, I want to highlight two recent complementary results on transfer learning applied to audio — one related to music, another related to speech.

The first paper, related to music, is by Keunwoo Choi and friends, which incidentally also won the best paper award at ISMIR 2017. But that’s not the only reason why you should read it. You should read it because it’s extremely well written and experiments ask and answer many interesting questions about transfer learning for music.

In transfer learning, you are transferring representations learned on a source domain/task to a target domain/task. The setup for the Choi et al paper is very straightforward. The source task is fixed to music tagging with the label space limited to the top-50 tags. They demonstrate the learned representations perform well on a bunch of related music tasks and also on an urban sound detection task.

What’s really interesting about the Choi et al paper is, unlike the traditional approach of taking the penultimate layer in the network, they consider all possible combinations of the different intermediate layers, including a concatenation of all the intermediate layers.  This sounds crazy! And it is because, for n layers, the number of combinations is 2^n. I did not think this was even possible, but Choi & friends cleverly chose a small enough ConvNet size to show these results.

 

The results are quite interesting. Concatenating all layers almost always wins over other combinations or just the penultimate layer.  This experiment is even possible because of the size of the ConvNet they choose. I think it’s pretty neat.

The second paper is for speech and it is from us (shameless plug). We ask a complementary question: can we effectively transfer representations for across audio domains? If the ConvNets are learning hierarchical features of sounds in the audible spectrum, can we use any audio data and learn representations that make sense for speech? To do this we transferred representations from an environmental sounds dataset (UrbanSounds) to the newly released Google Speech Commands data. In addition to this, we also use a much deeper DenseNet-121 model and a new multiscale representation using dilated convolutions — we use 4 incrementally larger dilation factors for 4 convolutional kernels and stacked them in a single layer. Our observation is the multiscale representation works really well with the pre-trained representation.

Now the natural question emerges, if representations learnt from environmental sounds do well with speech, will the reverse direction work? Although we don’t report it in our paper, the reverse direction doesn’t transfer well, despite the Google Speech Commands dataset being an order of magnitude larger than UrbanSounds. While we don’t have a solid understanding of why this should happen, our intuition is the gamut of environmental sounds effectively capture the manifold of acoustic events (speech being one of them).

I think these two papers are barely scratching the surface of transfer learning for speech and audio.  For example, do the Choi et al style of concatenating all the previous layers work well with deeper architectures like ResNet and DenseNet we explored in our paper? What kinds of audio data transfer well for speech tasks and why? Do MFCC features matter (our paper doesn’t even bother to use them but Choi et al do)?

There is so much to be discovered here!

  • Keunwoo Choi

    Thanks for introducing our work! And good point about MFCCs. Not in the paper, but as in this slide https://www.slideshare.net/KeunwooChoi/transfer-learning-for-music-classification-and-regression-tasks/23, I argued in the presentation that the musical information MFCCs provide seems mostly covered by our convnet feature, although they still add useful, non-redundant information for audio event detection. This is simply based on the experiment results — for 5 musical tasks, concat(convnet_feature, MFCC) was worse than convnet_feature, while concat > convnet for audio event detection.

    I think interpreting would be even more important in transfer learning. Probably transferring feature would be more tempting for less deep learning oriented people who still wanna make sure about this off-the-shelf feature. After some credibility accumulated, people might start to use it without really concerning.. like we MIR people do with MFCCs :)

    • Delip Rao

      Actually, you are right about people not understanding why/where MFCCs matter. My intuition is with the modern deep models some of that information is redundantly encoded and cases where addition of MFCC wins is probably due to higher order interaction effects created in the upstream layers. Back in graduate school, I sat through a graduate level speech recognition course and I still don’t get them :)

      For example, in your case, why MFCC should matter for audio event detection — your guess is as good as mine!

      • Matt Wescott

        Nice work, Delip. My handwavy rationalization of MFCCs for non-speech tasks is that they heuristically capture the shape of the spectrum with reduced sensitivity to scaling the physical size of the audio source. If that is partially true, replacing the mel scale with a log scale and replacing the final DCT with a convolutional layer may work as well for augmentation, and would feel less magical.

        Congrats on R7, sounds interesting. Let’s catch up soon!

© 2016 Delip Rao. All Rights Reserved.