Task-wise dataset distribution. The bi-coloured cells denote collections of paired image-audio samples from public datasets following our data curation strategy while single-coloured cells signify direct adaptation. Datasets with dashed outlines are used only during model training while the ones with ★ are reserved for zero-shot evaluations. Other datasets have a defined train/test split. Numbers in the bottom right represent the total #samples present in each task.