Pretrained Language Models (PLMs) have revolutionized NLP, but their linguistic underpinnings still raise several questions. This thesis tries to shed some light on these questions by investigating PLMs' ability to encode morphosyntactic information, focusing on tense and subject-verb agreement.
A novel probing method that leverages neural probes is developed to test the representations generated by three PLM architectures: BERT, RoBERTa, and Sentence Transformer. These PLMs are tested across three morphologically diverse languages: English, Italian, and German.
This repository hosts the code, data and results of my Master's thesis. For a more in-depth explanation of the research questions, data, methodology and findings, please refer to the thesis report.
This project uses Poetry, a dependency management and packaging tool for Python. To install Poetry, follow the steps described at https://python-poetry.org/docs/#installation. Additionally, depending on your GPU, you may need to adjust the following line in pyprject.toml
to get the appropriate torch
version for your setup:
torch = {file = "./torch-2.0.0+rocm5.4.2-cp310-cp310-linux_x86_64.whl"}
After installing Poetry and updating pyprject.toml
, you can install the required dependencies and create a dedicated environment by running:
poetry install
The details for the tense and agreement experiments in each language are outlined in the respective json
files:
it_experiments.json
for Italianen_experiments.json
for Englishde_experiments.json
for German
To run the experiments for a specific language, execute the following command:
python main.py [language_code]_experiments.json
Replace [language_code] with the desired language code (e.g., it, en, or de). This will execute main.py
with the specified json
file containing the experiment's configuration and data.