Tutorial¶

Installation¶

To install mirdata:

pip install mirdata

Usage¶

mirdata is easily imported into your Python code by:

import mirdata

Initializing a dataset¶

Print a list of all available dataset loaders by calling:

import mirdata
print(mirdata.list_datasets())

To use a loader, (for example, ‘orchset’) you need to initialize it by calling:

import mirdata
orchset = mirdata.initialize('orchset')

Now orchset is a Dataset object containing common methods, described below.

Downloading a dataset¶

All dataset loaders in mirdata have a download() function that allows the user to download the canonical version of the dataset (when available). When initializing a dataset, by default, mirdata will download/read data to/from a default location (“~/mir_datasets”). This can be customized by specifying data_home in mirdata.initialize.

Downloading a dataset into the default folder:

In this first example, data_home is not specified. Thus, ORCHSET will be downloaded and retrieved from mir_datasets folder created at user root folder:

import mirdata
orchset = mirdata.initialize('orchset')
orchset.download()  # Dataset is downloaded to ~/mir_datasets/orchset

Downloading a dataset into a specified folder:

Now data_home is specified and so orchset will be read from / written to this custom location:

orchset = mirdata.initialize('orchset', data_home='Users/leslieknope/Desktop/orchset123')
orchset.download()  # Dataset is downloaded to the folder "orchset123" Leslie Knope's desktop

Partially downloading a dataset¶

The download() functions allows partial downloads of a dataset. In other words, if applicable, the user can select which elements of the dataset they want to download. Each dataset has a REMOTES dictionary were all the available elements are listed.

cante100 has different elements as seen in the REMOTES dictionary. Thus, we can specify which of these elements are downloaded, by passing to the download() function the list of keys in REMOTES that we are interested in. This list is passed to the download() function through the partial_download variable.

Example REMOTES

REMOTES = {
    "spectrogram": download_utils.RemoteFileMetadata(
        filename="cante100_spectrum.zip",
        url="https://zenodo.org/record/1322542/files/cante100_spectrum.zip?download=1",
        checksum="0b81fe0fd7ab2c1adc1ad789edb12981",  # the md5 checksum
        destination_dir="cante100_spectrum",  # relative path for where to unzip the data, or None
    ),
    "melody": download_utils.RemoteFileMetadata(
        filename="cante100midi_f0.zip",
        url="https://zenodo.org/record/1322542/files/cante100midi_f0.zip?download=1",
        checksum="cce543b5125eda5a984347b55fdcd5e8",  # the md5 checksum
        destination_dir="cante100midi_f0",  # relative path for where to unzip the data, or None
    ),
    "notes": download_utils.RemoteFileMetadata(
        filename="cante100_automaticTranscription.zip",
        url="https://zenodo.org/record/1322542/files/cante100_automaticTranscription.zip?download=1",
        checksum="47fea64c744f9fe678ae5642a8f0ee8e",  # the md5 checksum
        destination_dir="cante100_automaticTranscription",  # relative path for where to unzip the data, or None
    ),
    "metadata": download_utils.RemoteFileMetadata(
        filename="cante100Meta.xml",
        url="https://zenodo.org/record/1322542/files/cante100Meta.xml?download=1",
        checksum="6cce186ce77a06541cdb9f0a671afb46",  # the md5 checksum
    ),
    "README": download_utils.RemoteFileMetadata(
        filename="cante100_README.txt",
        url="https://zenodo.org/record/1322542/files/cante100_README.txt?download=1",
        checksum="184209b7e7d816fa603f0c7f481c0aae",  # the md5 checksum
    ),
}

A partial download example for cante100 dataset could be:

cante100.download(partial_download=['spectrogram', 'melody', 'metadata'])

Validating a dataset¶

Using the method validate() we can check if the files in the local version are the same than the available canical version, and the files were downloaded correctly (none of them are corrupted).

For big datasets: In future mirdata versions, a random validation will be included. This improvement will reduce validation time for very big datasets.

Accessing annotations¶

We can choose a random track from a dataset with the choice_track() method.

Loading annotations

random_track = orchset.choice_track()
print(random_track)
>>> Track(
       alternating_melody=True,
       audio_path_mono="user/mir_datasets/orchset/audio/mono/Beethoven-S3-I-ex1.wav",
       audio_path_stereo="user/mir_datasets/orchset/audio/stereo/Beethoven-S3-I-ex1.wav",
       composer="Beethoven",
       contains_brass=False,
       contains_strings=True,
       contains_winds=True,
       excerpt="1",
       melody_path="user/mir_datasets/orchset/GT/Beethoven-S3-I-ex1.mel",
       only_brass=False,
       only_strings=False,
       only_winds=False,
       predominant_melodic_instruments=['strings', 'winds'],
       track_id="Beethoven-S3-I-ex1",
       work="S3-I",
       audio_mono: (np.ndarray, float),
       audio_stereo: (np.ndarray, float),
       melody: F0Data,
    )

We can also access specific tracks by id. The available track ids can be acessed via the .track_ids attribute. In the next example we take the first track id, and then we retrieve the melody annotation.

orchset_ids = orchset.track_ids  # the list of orchset's track ids
orchset_data = orchset.load_tracks()  # Load all tracks in the dataset
example_track = orchset_data[orchset_ids[0]]  # Get the first track

# Accessing the track's melody annotation
example_melody = example_track.melody

Alternatively, we don’t need to load the whole dataset to get a single track.

orchset_ids = orchset.track_ids  # the list of orchset's track ids
example_track = orchset.track(orchset_ids[0])  # load this particular track
example_melody = example_track.melody  # Get the melody from first track

Accessing data on non-local filesystems¶

mirdata uses the smart_open library, which supports non-local filesystems such as GCS and AWS. If your data lives, e.g. on Google Cloud Storage (GCS), simply set the data_home variable accordingly when initializing a dataset. For example:

Accessing annotations remotely

import mirdata

orchset = mirdata.initialize("orchset", data_home="gs://my-bucket/my-subfolder/orchset")

# everything should work the same as if the data were local
orchset.validate()

example_track = orchset.choice_track()
melody = example_track.melody
y, fs = example_track.audio_mono

Note that the data on the remote file system must have identical folder structure to what is specified by dataset.download(), and we do not support downloading (i.e. writing) to remote filesystems, only reading from them. To prepare a new dataset to use with mirdata, we recommend running dataset.download() on a local filesystem, and then manually transfering the folder contents to the remote filesystem.

mp3 data

For a variety of reasons, mirdata doesn’t support remote reading of mp3 files, so some datasets with mp3 audio may have tracks unavailable attributes.

Annotation classes¶

mirdata defines annotation-specific data classes. These data classes are meant to standarize the format for all loaders, and are compatibly with jams and mir_eval.

The list and descriptions of available annotation classes can be found in Annotations.

Note

These classes may be extended in the case that a loader requires it.

Iterating over datasets and annotations¶

In general, most datasets are a collection of tracks, and in most cases each track has an audio file along with annotations.

With the load_tracks() method, all tracks are loaded as a dictionary with the ids as keys and track objects (which include their respective audio and annotations, which are lazy-loaded on access) as values.

orchset = mirdata.initialize('orchset')
for key, track in orchset.load_tracks().items():
    print(key, track.audio_path)

Alternatively, we can loop over the track_ids list to directly access each track in the dataset.

orchset = mirdata.initialize('orchset')
for track_id in orchset.track_ids:

    print(track_id, orchset.track(track_id).audio_path)

Basic example: including mirdata in your pipeline¶

If we wanted to use orchset to evaluate the performance of a melody extraction algorithm (in our case, very_bad_melody_extractor), and then split the scores based on the metadata, we could do the following:

mirdata usage example

import mir_eval
import mirdata
import numpy as np
import sox

def very_bad_melody_extractor(audio_path):
    duration = sox.file_info.duration(audio_path)
    time_stamps = np.arange(0, duration, 0.01)
    melody_f0 = np.random.uniform(low=80.0, high=800.0, size=time_stamps.shape)
    return time_stamps, melody_f0

# Evaluate on the full dataset
orchset = mirdata.initialize("orchset")
orchset_scores = {}
orchset_data = orchset.load_tracks()
for track_id, track_data in orchset_data.items():
    est_times, est_freqs = very_bad_melody_extractor(track_data.audio_path_mono)

    ref_melody_data = track_data.melody
    ref_times = ref_melody_data.times
    ref_freqs = ref_melody_data.frequencies

    score = mir_eval.melody.evaluate(ref_times, ref_freqs, est_times, est_freqs)
    orchset_scores[track_id] = score

# Split the results by composer and by instrumentation
composer_scores = {}
strings_no_strings_scores = {True: {}, False: {}}
for track_id, track_data in orchset_data.items():
    if track_data.composer not in composer_scores.keys():
        composer_scores[track_data.composer] = {}

    composer_scores[track_data.composer][track_id] = orchset_scores[track_id]
    strings_no_strings_scores[track_data.contains_strings][track_id] = \
        orchset_scores[track_id]

This is the result of the example above.

Example result

print(strings_no_strings_scores)
>>> {True: {
        'Beethoven-S3-I-ex1':OrderedDict([
               ('Voicing Recall', 1.0),
               ('Voicing False Alarm', 1.0),
               ('Raw Pitch Accuracy', 0.029798422436459245),
               ('Raw Chroma Accuracy', 0.08063102541630149),
               ('Overall Accuracy', 0.0272654370489174)
               ]),
        'Beethoven-S3-I-ex2': OrderedDict([
               ('Voicing Recall', 1.0),
               ('Voicing False Alarm', 1.0),
               ('Raw Pitch Accuracy', 0.009221311475409836),
               ('Raw Chroma Accuracy', 0.07377049180327869),
               ('Overall Accuracy', 0.008754863813229572)]),
        ...

        'Wagner-Tannhauser-Act2-ex2': OrderedDict([
               ('Voicing Recall', 1.0),
               ('Voicing False Alarm', 1.0),
               ('Raw Pitch Accuracy', 0.03685636856368564),
               ('Raw Chroma Accuracy', 0.08997289972899729),
               ('Overall Accuracy', 0.036657681940700806)])
        }}

You can see that very_bad_melody_extractor performs very badly!

Using mirdata with tensorflow¶

The following is a simple example of a generator that can be used to create a tensorflow Dataset.

mirdata with tf.data.Dataset example

import mirdata
import numpy as np
import tensorflow as tf

def orchset_generator():
    # using the default data_home
    orchset = mirdata.initialize("orchset")
    track_ids = orchset.track_ids
    for track_id in track_ids:
        track = orchset.track(track_id)
        audio_signal, sample_rate = track.audio_mono
        yield {
            "audio": audio_signal.astype(np.float32),
            "sample_rate": sample_rate,
            "annotation": {
                "times": track.melody.times.astype(np.float32),
                "freqs": track.melody.frequencies.astype(np.float32),
            },
            "metadata": {"track_id": track.track_id}
        }

dataset = tf.data.Dataset.from_generator(
    orchset_generator,
    {
        "audio": tf.float32,
        "sample_rate": tf.float32,
        "annotation": {"times": tf.float32, "freqs": tf.float32},
        "metadata": {'track_id': tf.string}
    }
)

In future mirdata versions, generators for Tensorflow and Pytorch will be included.