Skip to content

The public data and evaluation scripts for the MLSP 2024 Shared Task

Notifications You must be signed in to change notification settings

MLSP2024/MLSP_Data

Repository files navigation

MLSP_Data

Notice 1 - the MLSP 2024 Shared Task is now concluded. This repository is maintained for data distribution. **

Notice 2 - Cleaned Gold data is available at: https://huggingface.co/datasets/MLSP2024/MLSP2024 This data should be used in conjunction with Data/Test_Clean, not with Data/Test/ **

Notice 3 - The trial and test data is named as 'multilex', however the final dataset (as in the test_clean and gold data) is called 'MultiLS'. Please use this name to refer to the dataset.

The public data and evaluation scripts for the MLSP 2024 Shared Task

The Trial Data is found in Data/Trial/

The All/ folder contains the aggregated data for all languages. There are also subfolders for each language with their own data.

Each folder contains 3 files:

  • multilex_SPLIT_LANG_combined - containing Lexical Complexity Prediction and Lexical Simplification data on identical targets and contexts
  • multilex_SPLIT_LANG_lcp - containing Lexical Complexity Prediction data only on targets and contexts taken from the combined data
  • multilex_SPLIT_LANG_ls - containing Lexical Simplification data only on targets and contexts taken from the combined data

The scripts folder contains 2 scripts for evaluation

All Trial Data has now been released. Please register your interest in the task to receive updates (registration form at the task website here: https://sites.google.com/view/mlsp-sharedtask-2024/home?authuser=0)

All Test Data for the shared task is now released in full at Data/Test/ Participant submissions should be made via the MLSP_Participants Repository.

We have also included cleaned Test Data at Data/Test_Clean/ which removes some problematic instances (repetitions, missing targets, etc.). The labels for the cleaned test data are available via the MLSP 2024 HuggingFace Dataset

The baselines are available in separate repositories:

Metadata for the annotators of each subset is included in the Table below:

language age years in education nr. of L2-languages hours reading/week number of native annotators number of non-native annotators L1-languages
Catalan (trial) 56.9 (11.0) 20.9 (4.3) 2.7 (1.6) 20.1 (31.9) 5 5 Catalan (5), Spanish (4), German (1)
Catalan (test) 34.8 (15.2) 17.2 (3.6) 2.1 (0.9) 7.7 (7.1) 21 53 Spanish (56), Basque (1), English (1), German (1)
English 22.6 (3.9) 16.1 (2.5) 2.2 (0.8) 16.0 (14.2) 10 11 English (10), Chinese (3), Urdu (3), Persian (2), Assamese(1), Italian(1), Kurdish (1)
Filipino 26.4 (5.9) 15.6 (0.8) 2.2 (0.4) 17.3 (12.9) 10 0 Filipino (10)
French (LCP) 26.7 (3.7) 21.4 (2.9) 3.2 (1.3) 13.1 (8.5) 0 10 Arabic (2), Mandarin (2), German (1), Hindi (1), Italian (1), Japanese (1), Spanish (1), Turkish (1)
French (LS) 22.7 (1) 19.8 (1.1) 2.4 (0.5) 5.6 (2.1) 10 0 French (10)
German 29.3 (7.1) 18.5 (2.3) 3.3 (2.1) 10.2 (8.0) 10 0 German (10)
Italian 35.7 (7.4) 19.25 (2.1) 2.1 (0.7) 5.8 (4.1) 20 0 Italian (20)
Japanese (LCP) 40.8 (9.1) 18.4 (3.7) 1.8 (0.6) 5.7 (7.6) 0 10 English (5), Portuguese & English (1), French (1), French & English (1), Basque & Spanish (1), Swedish (1)
Japanese (LS) 54.1 (5.5) 16.8 (2.8) 1.4 (0.8) 15.9 (14.5) 10 0 Japanese (10)
Portuguese - - - - 10 0 Brazilian Portuguese (10)
Sinhala 28.5 - - - 10 - Sinhala (10)
Spanish (trial) 43.4 (14.2) 21.8 (3.3) 3.0 (1.2) 23.6 (33.83) 7 3 Spanish (7), Polish (1), Italian (1), German (1)
Spanish (test) 18.0 (1.4) 12.3 (1.5) 0.9 (0.5) 2.7 (2.8) 60 0 Spanish (60)

Further metadata on the texts in each subset is below. The summary statistics are based solely on the Trial data and are not guaranteed to be the same for the Test data.

language Target Group Text Genre Mean Complexity Mean Context Length Mean Unique Subs
Catalan Varied News 0.487 (0.125) 239.5 (70.1) 14.2 (3.4)
English University Students Wikibooks 0.200 (0.201) 111.0 (37.0) 6.2 (1.9)
Filipino University Staff Educational Books 0.171 (0.126) 64.1 (22.1) 4.0 (1.1)
French Language Learners Varied 0.371 (0.229) 129.1 (45.6) 10.1 (3.5)
German High-School Students Wiki / Literary 0.413 (0.191) 195.7 (59.6) 8.1 (2.8)
Italian Native Speakers Wikibooks/Wikiquote 0.248 (0.168) 168.4 (67.6) 7.8 (3.0)
Japanese Language Learners Varied 0.259 (0.173) 37.8 (7.3) 15.8 (4.6)
Portuguese MTurk Workers Varied 0.273 (0.165) 165.9 (74.1) 5.4 (1.2)
Sinhala University Staff News / Religious 0.243 (0.214) 163.4 (52.6) 4.3 (0.6)
Spanish Varied Educational Books 0.449 (0.233) 178.7 (48.1) 10.9 (3.8)

About

The public data and evaluation scripts for the MLSP 2024 Shared Task

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published