Predicting complexity in context for English by using distributional models, behavioural norms, and lexical resources

We provide an implementation of the models with which we participated (as team Andi) in the LCP task of SemEval 2021. The task involved predicting subjective ratings of complexity for single words and multi-word expressions, presented in context. Our approach, which ranked 4th in the single word sub-task, and 6th in the multi-word expression sub-task, relies on a combination of context-dependent and context-independent distributional models, together with behavioural norms and lexical resources.

If you want to test our models, you can run the two Jupyter notebooks (one for each sub-task). Please feel free to experiment with your own combinations of stimuli, norms, and models, once you make sure that they are in the proper format (see the information provided below). If you get interesting results, please let us know! 🙂

Before you start

In order to be able to successfully run the demos, you first need to do the following things:

Create a dedicated Python environment (highly recommended) and install the necessary libraries. Start by installing pytorch. Next, run the following command:

pip install notebook pandas scipy scikit-learn transformers

Place the necessary files in their corresponding directories, as follows:

Put the files 'lcp_single_train.tsv', 'lcp_single_test.tsv', 'lcp_multi_train.tsv', and 'lcp_multi_test.tsv', in the 'stimuli' folder. The four files can be obtained from the dedicated GitHub repository. Please note that, within that repository, the files in the 'test' folder contain only the stimuli, while the files in the 'test-labels' folder contain both the stimuli and their associated complexity ratings.
(Optional) Put the behavioural norms in the 'behavioural-norms' folder. Each file must be in .CSV format and have a header with the variable names (e.g., Word,Frequency,SemanticDiversity,...). The first column ('Word') must contain the normed words, while the other columns must contain the behavioural data. For copyright reasons, we cannot upload the norms we used for our submission, but you can download them yourself by using the following links (just remember to convert them to the right format, keeping only the columns of interest):
(Optional) Put the context-independent embeddings (i.e., models) in the 'context-independent-models' folder. Each file must be in CSV format, but with no header. The first column must contain the words, while the other columns must contain the word vectors. The models we used for our submission can be downloaded from the dedicated OSF project.

(Optional) Make sure you have enough disk space for the context-dependent models (i.e., Hugging Face transformers). If you wish to change the location where the models are stored, uncomment the first two lines in the demo code and replace <new_cache_folder_path> with your chosen location. If you decide to use such models, keep in mind that it might take some time for the download, given that the size of most models is around 500MB.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
behavioural-norms		behavioural-norms
context-independent-models		context-independent-models
sources		sources
stimuli		stimuli
LICENSE		LICENSE
README.md		README.md
demo_multi_word_expressions.ipynb		demo_multi_word_expressions.ipynb
demo_single_words.ipynb		demo_single_words.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting complexity in context for English by using distributional models, behavioural norms, and lexical resources

Before you start

About

Releases

Packages

Languages

License

armandrotaru/TeamAndi-LCP

Folders and files

Latest commit

History

Repository files navigation

Predicting complexity in context for English by using distributional models, behavioural norms, and lexical resources

Before you start

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages