Overview¶
pip install mirdata
mirdata
is a library which aims to standardize how audio datasets are accessed in Python,
removing the need for writing custom loaders in every project, and improving reproducibility.
Working with datasets usually requires an often cumbersome step of downloading data and writing
load functions that load related files (for example, audio and annotations)
into a standard format to be used for experimenting or evaluating. mirdata
does all of this for you:
import mirdata
print(mirdata.list_datasets())
tinysol = mirdata.initialize('tinysol')
tinysol.download()
# get annotations and audio for a random track
example_track = tinysol.choice_track()
instrument = example_track.instrument_full
pitch = example_track.pitch
y, sr = example_track.audio
mirdata
loaders contain methods to:
download()
: download (or give instructions to download) a datasetload_*()
: load a dataset’s files (audio, metadata, annotations, etc.) into standard formats, so you don’t have to write them yourself which are compatible withmir_eval
andjams
.validate()
: validate that a dataset is complete and correctcite()
: quickly print a dataset’s relevant citationaccess
track
andmultitrack
objects for grouping multiple annotations for a particular track/multitrackand more
See the Tutorial for a detailed explanation of how to get started using this library.
mirdata design principles¶
Ease of use and contribution¶
We designed mirdata
to be easy to use and easy to contribute to. mirdata
simplifies the research pipeline considerably,
facilitating research in a wider diversity of tasks and musical datasets. We provide detailed examples on how to interact with
the library in the Tutorial, as well as detail explanation on how to contribute in Contributing. Additionally,
we have a repository of Jupyter notebooks with usage
examples of the different datasets.
Reproducibility¶
We aim for mirdata
to aid in increasing research reproducibility by providing a common framework for MIR researchers to
compare and validate their data. If mistakes are found in annotations or audio versions change, using mirdata
, the community
can fix mistakes while still being able to compare methods moving forward.
canonical versions¶
The dataset loaders
in mirdata
are written for what we call the canonical version
of a dataset. Whenever possible,
this should be the official release of the dataset as published by the dataset creator/s. When this is not possible, (e.g. for
data that is no longer available), the procedure we follow is to find as many copies of the data as possible from different researchers
(at least 4), and use the most common one. To make this process transparent, when there are doubts about the data consistency we open an
issue and leave it to the community to discuss what to use.
Standardization¶
Different datasets have different annotations, metadata, etc. We try to respect the idiosyncracies of each dataset as much as we can. For this
reason, tracks
in each Dataset
in mirdata
have different attributes, e.g. some may have artist
information and some may not.
However there are some elements that are common in most datasets, and in these cases we standarize them to increase the usability of the library.
Some examples of this are the annotations in mirdata
, e.g. BeatData
.
indexes¶
Indexes in mirdata are manifests of the files in a dataset and their corresponding md5 checksums.
Specifically, an index is a json file with the mandatory top-level key version
and at least one of the optional
top-level keys metadata
, tracks
, multitracks
or records
. An index might look like:
Example Index
{ "version": "1.0.0",
"metadata": {
"metadata_file_1": [
// the relative path for metadata_file_1
"path_to_metadata/metadata_file_1.csv",
// metadata_file_1 md5 checksum
"bb8b0ca866fc2423edde01325d6e34f7"
],
"metadata_file_2": [
// the relative path for metadata_file_2
"path_to_metadata/metadata_file_2.csv",
// metadata_file_2 md5 checksum
"6cce186ce77a06541cdb9f0a671afb46"
]
}
"tracks": {
"track1": {
'audio': ["audio_files/track1.wav", "6c77777ce77a06541cdb9f0a671afb46"],
'beats': ["annotations/track1.beats.csv", "ab8b0ca866fc2423edde01325d6e34f7"],
'sections': ["annotations/track1.sections.txt", "05abeca866fc2423edde01325d6e34f7"],
}
"track2": {
'audio': ["audio_files/track2.wav", "6c77777ce77a06542cdb9f0a672afb46"],
'beats': ["annotations/track2.beats.csv", "ab8b0ca866fc2423edde02325d6e34f7"],
'sections': ["annotations/track2.sections.txt", "05abeca866fc2423edde02325d6e34f7"],
}
...
}
}
The optional top-level keys (tracks, multitracks and records) relate to different organizations of music datasets. tracks are used when a dataset is organized as a collection of individual tracks, namely mono or multi-channel audio, spectrograms only, and their respective annotations. multitracks are used in when a dataset comprises of multitracks - different groups of tracks which are directly related to each other. Finally, records are used when a dataset consits of groups of tables (e.g. relational databases), as many recommendation datasets do.
See the contributing docs 1. Create an index for more information about mirdata indexes.
annotations¶
mirdata provdes Annotation
classes of various kinds which provide a standard interface to different
annotation formats. These classes are compatible with the mir_eval
library’s expected format, as well
as with the jams format. The format can be easily extended to other formats, if requested.
metadata¶
When available, we provide extensive and easy-to-access metadata
to facilitate track metadata-specific analysis.
metadata
is available as attroibutes at the track
level, e.g. track.artist
.