Tailed U-Net: Multi-Scale Music Representation Learning

M.A. Vélez Vásquez; J.A. Burgoyne

doi:https://doi.org/10.5281/zenodo.7316596

Tailed U-Net: Multi-Scale Music Representation Learning

Authors	M.A. Vélez Vásquez J.A. Burgoyne
Publication date	2022
Host editors	P. Rao H. Murthy A. Srinivasamurthy R. Bittner R. Caro Repetto M. Goto X. Serra M. Miron
Book title	Proceedings of the 23rd International Society for Music Information Retrieval Conference
Book subtitle	Bengaluru, India, December 04-08, 2022
ISBN (electronic)	9781732729926
Event	23rd International Society for Music Information Retrieval Conference
Pages (from-to)	67-75
Number of pages	9
Publisher	ISMIR
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	Self-supervised learning has steadily been gaining traction in recent years. In music information retrieval (MIR), one promising recent application of self-supervised learning is the CLMR framework (contrastive learning of musical representations). CLMR has shown good performance, achieving results on par with state-of-the-art end-to-end classification models, but it is strictly an encoding framework. It suffers the characteristic limitation of any encoder that it cannot explicitly combine multi-timescale information, whereas a characteristic feature of human audio perception is that we tend to perceive all frequencies simultaneously. To this end, we propose a generalization of CLMR that learns to extract and explicitly combine representations across different frequency resolutions, which we coin the tailed U-Net (TUNe). TUNe architectures combine multi-timescale information during a decoding phase, similar to U-Net architectures used in computer vision and source separation, but have a tail added to reduce sample-level information to a smaller pre-defined number of representation dimensions. The size of the decoding phase is a hyperparameter, and in the case of a zero-layer decoding phase, TUNe reduces to CLMR. The best TUNe architectures, however, require less training time to match CLMR performance, have superior transfer learning performance, and are competitive with state-of-the-art models even at dramatically reduced dimensionalities.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.5281/zenodo.7316596
Other links	https://ismir2022program.ismir.net/poster_109.html https://www.ismir.net/conferences/ismir2022.html
Downloads	000007 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Tailed U-Net: Multi-Scale Music Representation Learning