Releases · ufal/olimpic-icdar24

OLiMPiC dataset

OpenScore Lieder Linearized MusicXML Piano Corpus, is a dataset of MusicXML - LMX - PNG triplets for piano systems in the OpenScore Lieder corpus. It contains synthetic images from MuseScore and scanned images from IMSLP. The synthetic dataset contains all the splits (train, dev, test). The scanned dataset contains only dev and test splits. These splits are aligned across both variants. Both variants are are sliced up into systems (piano staves) that are grouped into folders by the score (the song), which form the training samples:

samples/
    123456/               ... one folder per score (song)
        p1-s1.png         ... one image and two annotations for each system `p{page}-s{system}`
        p1-s1.lmx         ... Linearized MusicXML
        p1-s1.musicxml    ... non-compressed MusicXML file
samples.test.txt          ... list of samples for a partition, contains lines: `samples/123456/p1-s1`
statistics.test.yaml
vocabulary.txt            ... list of all vocabulary tokens for LMX annotations

A permanent handle to the dataset: http://hdl.handle.net/11234/1-5419

You can preview the scanned dataset test partition here:
https://ufallab.ms.mff.cuni.cz/~mayer/icdar2024/scanned/

And the synthetic dataset train partition here:
https://ufallab.ms.mff.cuni.cz/~mayer/icdar2024/synthetic/

Partition	Systems (samples)	Scores (songs)	In synthetic dataset	In scanned dataset
test	1 493	100	✔️	✔️
dev	1 438	100	✔️	✔️
train	15 014	1095	✔️	❌

To get the source IMSLP PDFs and manually annotated system bounding boxes for the scanned dataset, download the attached olimpic-1.0-sources-for-scanned.2024-02-12.tar.gz file.

GrandStaff LMX dataset

We've also added LMX and MusicXML annotations to the GrandStaff dataset. For each .krn file we added a .lmx file and a .musicxml file in the same format as in the datasets described above. These additional files are attached as grandstaff-lmx.2024-02-12.tar.gz to this release.

A permanent handle to the dataset: http://hdl.handle.net/11234/1-5423

The dataset can be previewed here:
https://ufallab.ms.mff.cuni.cz/~mayer/icdar2024/grandstaff/

Be careful when harmonizing it with the other two datasets, there is a list of issues to be aware of:

The semantic content diversity regarding special symbols and inter-staff interactions is lower. The GrandStaff dataset does not contain slurs, arpeggios are present in images but are not present in LMX, grace notes have missing stems (probbably a JPEG compression artefact or a Verovio bug).
LMX token sequence lengths can be much larger. Partly because some systems have regularly 6 measures, whereas the previous datasets typically cap at 4, and partly because some scores are very dense and contain large numbers of notes.
The GrandStaff dataset is much larger ~50K samples, compared to the previous ~15K samples. So only a subset of the dataset should be used when training on both.
The Humdrum kern format seems not to support mid-voice staff changes. And even if it does support it, the convertor we used music21 seems not to able to encode them via its internal representation format. Looking at 100 random GrandStaff images, we were not able to find a single instance where a voice would cross between staves. There are places that are almost begging to be represented that way, see the measure 4 and 5, middle ascending voice:

beethoven/piano-sonatas/sonata29-4/maj2_up_m-181-186.jpg:

In the OpenScore Lieder corpus, mid-voice staff changes are relatively common. We were able to easily find 4 examples in 100 random images from the scanned dataset. See the last measures:

samples/6377942/p4-s1.png:

samples/5026306/p5-s1.png:

We think that the OpenScore Lieder corpus is more interesting and complex in terms of music notation, compared to the KernScores corpus from which the GrandStaff dataset was made. We believe the choice to use MuseScore as the representation format for OS Lieder was a well-made one, especially regarding this area of OMR reserach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OLiMPiC dataset

GrandStaff LMX dataset

Releases: ufal/olimpic-icdar24

The Zeus models for (Camera) GrandStaff LMX

The zeus-olimpic-1.0-2024-02-12.model

Datasets

OLiMPiC dataset

GrandStaff LMX dataset