Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning

Riou, Alain; Lattner, Stefan; Hadjeres, Gaëtan; Peeters, Geoffroy

Full-text links:

Download:

Current browse context:

cs.SD

< prev | next >

new | recent | 2405

Computer Science > Sound

Title: Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning

Authors: Alain Riou, Stefan Lattner, Gaëtan Hadjeres, Geoffroy Peeters

(Submitted on 14 May 2024)

Abstract: This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.

Comments:	Self-supervision in Audio, Speech and Beyond workshop, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2024
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2405.08679 [cs.SD]
	(or arXiv:2405.08679v1 [cs.SD] for this version)

Submission history

From: Alain Riou [view email]
[v1] Tue, 14 May 2024 15:00:09 GMT (167kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.08679

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Sound

Title: Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning

Submission history