Code for ACL 2023 paper 'Improving Pretraining Techniques for Code-Switched NLP' Authors: Richeek Das, Sahasra Ranjan, Shreya Pathak, Preethi Jyothi
This project is based on the GlueCoS. The official repo provides more details about the setup and dataset provided for finetune tasks.
-
Code
:experiments
: Contains scripts for all the additional experiments viz. including amb tokens, inverting lids, maskable-OTHER, etc.taggedData
: Contains the final input data files. These files include tokenized text data marked as maskable or not maskable to be used during pretrainingutils
:tokenizeData.sh
is the most important script here which generates the tokenized text data file mentioned above which takes in LID tagged data as input.
utils
: Contains base python scripts for pretraining, finetuning and probing.- Contains bash scripts for pretraining, finetuning and probing. These are the files which you need to modify training arguments and run the code.
-
Data
:LIDtagged
: Default path the store the LIDtagged data.MLM
: Default path to store original code-switched text datascripts
: Standard data processing scripts
-
FreqMLM
: All the scrips and experiments related Frequency MLMdata
: Contains additional data required for freqMLMscripts
: Contains scripts for Frequency MLMvocab
: Contains vocabulary requrired for freq MLM. We cite the dataset we used to get the vocabulary in the paper.
- If you intend to you the Aksharantar, download this Google no swear dataset to the freqMLM/data directory.
- We provide a small debug data with ~20 sentences in the repo to give an idea of the text format. We share the full pretraining data here: Dataset-CS-MLM. Furthermode, we used the GlueCoS finetune dataset for QA and SA which we cite in the paper.
- Setup conda environment :
conda env create --name envname --file=environments.yml
Or alternately you can use requirements.txt with python3.6.9 as the base python version to setup the enviroment.
- We provide a debug data file here with no LID tags. We provide full dataset here.
- Use the gen_single_tagged_data.py to generate fake LID tagged data.
- Run tokenizeData.sh with the above file as source to generate tokenized input for the pretraining. Make sure the
mask-type
is set toall-tokens
.
- We provide a debug LID tagged data file here. We provide full dataset here.
- Run tokenizeData.sh with the above file as source to generate tokenized input for the pretraining. Make sure the
mask-type
is set toaround-switch
.
- Take the debug data file without LID tags here as source file. Use the gen_freqmlm_tags.py to indentify the LID tags using the x-hit or nll approach as we describe in the paper.
- Similar to SwitchMLM, use the tokenizeData.sh script to tokenize the text data.
- Use the split_train_eval.py to split the tokenized input file into train and eval set. Make sure the input file has
.txt
extension. - Use the pretraining script with correct training arguments and train, eval file.
pretrain.sh
: Pretraining without auxloss. Run as:./Code/pretrain.sh
pretrain_auxloss.sh
: Pretraining with auxloss. Run as:./Code/pretrain_auxloss.sh
- After pretraining, pretrained model is save in the location you specify in the training script, which we further use to finetune and test our method on GlueCoS benchmark.
- You need to setup the dataset provided in the GlueCoS to run the fine tune scripts. You need the add the
Processed_Data
directory generated from the GlueCoS setup to the Data directory. - You need to give correct data directory path to the train_qa.sh and train_sa.sh scripts.
- Question Answering task: You need to run train_qa script with appropriate hyperparameters. First mention the correct pretrained model and language in the script. Then, run as:
bash ./Code/train_qa
. - Sentiment Analysis task: You need to run train_sa script with appropriate hyperparameters and correct pretrained model. Run as:
bash ./Code/train_sa $LANG
.