Code Switched MLM

Code for ACL 2023 paper 'Improving Pretraining Techniques for Code-Switched NLP' Authors: Richeek Das, Sahasra Ranjan, Shreya Pathak, Preethi Jyothi

This project is based on the GlueCoS. The official repo provides more details about the setup and dataset provided for finetune tasks.

Overview of the repo

Code:
- experiments: Contains scripts for all the additional experiments viz. including amb tokens, inverting lids, maskable-OTHER, etc.
- taggedData: Contains the final input data files. These files include tokenized text data marked as maskable or not maskable to be used during pretraining
  - utils: tokenizeData.sh is the most important script here which generates the tokenized text data file mentioned above which takes in LID tagged data as input.
- utils: Contains base python scripts for pretraining, finetuning and probing.
- Contains bash scripts for pretraining, finetuning and probing. These are the files which you need to modify training arguments and run the code.
Data:
- LIDtagged: Default path the store the LIDtagged data.
- MLM: Default path to store original code-switched text data
  - scripts: Standard data processing scripts
FreqMLM: All the scrips and experiments related Frequency MLM
- data: Contains additional data required for freqMLM
- scripts: Contains scripts for Frequency MLM
- vocab: Contains vocabulary requrired for freq MLM. We cite the dataset we used to get the vocabulary in the paper.

Setup

Get the dataset and setup environment

If you intend to you the Aksharantar, download this Google no swear dataset to the freqMLM/data directory.
We provide a small debug data with ~20 sentences in the repo to give an idea of the text format. We share the full pretraining data here: Dataset-CS-MLM. Furthermode, we used the GlueCoS finetune dataset for QA and SA which we cite in the paper.
Setup conda environment : conda env create --name envname --file=environments.yml Or alternately you can use requirements.txt with python3.6.9 as the base python version to setup the enviroment.

Generate tokenized input files:

Standard MLM

We provide a debug data file here with no LID tags. We provide full dataset here.
Use the gen_single_tagged_data.py to generate fake LID tagged data.
Run tokenizeData.sh with the above file as source to generate tokenized input for the pretraining. Make sure the mask-type is set to all-tokens.

Switch MLM

We provide a debug LID tagged data file here. We provide full dataset here.
Run tokenizeData.sh with the above file as source to generate tokenized input for the pretraining. Make sure the mask-type is set to around-switch.

Frequency MLM

Take the debug data file without LID tags here as source file. Use the gen_freqmlm_tags.py to indentify the LID tags using the x-hit or nll approach as we describe in the paper.
Similar to SwitchMLM, use the tokenizeData.sh script to tokenize the text data.

Pretraining

Use the split_train_eval.py to split the tokenized input file into train and eval set. Make sure the input file has .txt extension.
Use the pretraining script with correct training arguments and train, eval file.
- pretrain.sh: Pretraining without auxloss. Run as: ./Code/pretrain.sh
- pretrain_auxloss.sh: Pretraining with auxloss. Run as: ./Code/pretrain_auxloss.sh
After pretraining, pretrained model is save in the location you specify in the training script, which we further use to finetune and test our method on GlueCoS benchmark.

Fine tuning

You need to setup the dataset provided in the GlueCoS to run the fine tune scripts. You need the add the Processed_Data directory generated from the GlueCoS setup to the Data directory.
You need to give correct data directory path to the train_qa.sh and train_sa.sh scripts.
Question Answering task: You need to run train_qa script with appropriate hyperparameters. First mention the correct pretrained model and language in the script. Then, run as: bash ./Code/train_qa.
Sentiment Analysis task: You need to run train_sa script with appropriate hyperparameters and correct pretrained model. Run as: bash ./Code/train_sa $LANG.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Code Switched MLM

Overview of the repo

Setup

Get the dataset and setup environment

Generate tokenized input files:

Standard MLM

Switch MLM

Frequency MLM

Pretraining

Fine tuning

Files

README.md

Latest commit

History

README.md

File metadata and controls

Code Switched MLM

Overview of the repo

Setup

Get the dataset and setup environment

Generate tokenized input files:

Standard MLM

Switch MLM

Frequency MLM

Pretraining

Fine tuning