Skip to content

Latest commit

 

History

History
66 lines (46 loc) · 5.25 KB

README.md

File metadata and controls

66 lines (46 loc) · 5.25 KB

Code Switched MLM

Code for ACL 2023 paper 'Improving Pretraining Techniques for Code-Switched NLP' Authors: Richeek Das, Sahasra Ranjan, Shreya Pathak, Preethi Jyothi

This project is based on the GlueCoS. The official repo provides more details about the setup and dataset provided for finetune tasks.

Overview of the repo

  1. Code:

    • experiments: Contains scripts for all the additional experiments viz. including amb tokens, inverting lids, maskable-OTHER, etc.
    • taggedData: Contains the final input data files. These files include tokenized text data marked as maskable or not maskable to be used during pretraining
      • utils: tokenizeData.sh is the most important script here which generates the tokenized text data file mentioned above which takes in LID tagged data as input.
    • utils: Contains base python scripts for pretraining, finetuning and probing.
    • Contains bash scripts for pretraining, finetuning and probing. These are the files which you need to modify training arguments and run the code.
  2. Data:

    • LIDtagged: Default path the store the LIDtagged data.
    • MLM: Default path to store original code-switched text data
      • scripts: Standard data processing scripts
  3. FreqMLM: All the scrips and experiments related Frequency MLM

    • data: Contains additional data required for freqMLM
    • scripts: Contains scripts for Frequency MLM
    • vocab: Contains vocabulary requrired for freq MLM. We cite the dataset we used to get the vocabulary in the paper.

Setup

Get the dataset and setup environment

  1. If you intend to you the Aksharantar, download this Google no swear dataset to the freqMLM/data directory.
  2. We provide a small debug data with ~20 sentences in the repo to give an idea of the text format. We share the full pretraining data here: Dataset-CS-MLM. Furthermode, we used the GlueCoS finetune dataset for QA and SA which we cite in the paper.
  3. Setup conda environment : conda env create --name envname --file=environments.yml Or alternately you can use requirements.txt with python3.6.9 as the base python version to setup the enviroment.

Generate tokenized input files:

Standard MLM

  1. We provide a debug data file here with no LID tags. We provide full dataset here.
  2. Use the gen_single_tagged_data.py to generate fake LID tagged data.
  3. Run tokenizeData.sh with the above file as source to generate tokenized input for the pretraining. Make sure the mask-type is set to all-tokens.

Switch MLM

  1. We provide a debug LID tagged data file here. We provide full dataset here.
  2. Run tokenizeData.sh with the above file as source to generate tokenized input for the pretraining. Make sure the mask-type is set to around-switch.

Frequency MLM

  1. Take the debug data file without LID tags here as source file. Use the gen_freqmlm_tags.py to indentify the LID tags using the x-hit or nll approach as we describe in the paper.
  2. Similar to SwitchMLM, use the tokenizeData.sh script to tokenize the text data.

Pretraining

  1. Use the split_train_eval.py to split the tokenized input file into train and eval set. Make sure the input file has .txt extension.
  2. Use the pretraining script with correct training arguments and train, eval file.
    • pretrain.sh: Pretraining without auxloss. Run as: ./Code/pretrain.sh
    • pretrain_auxloss.sh: Pretraining with auxloss. Run as: ./Code/pretrain_auxloss.sh
  3. After pretraining, pretrained model is save in the location you specify in the training script, which we further use to finetune and test our method on GlueCoS benchmark.

Fine tuning

  1. You need to setup the dataset provided in the GlueCoS to run the fine tune scripts. You need the add the Processed_Data directory generated from the GlueCoS setup to the Data directory.
  2. You need to give correct data directory path to the train_qa.sh and train_sa.sh scripts.
  3. Question Answering task: You need to run train_qa script with appropriate hyperparameters. First mention the correct pretrained model and language in the script. Then, run as: bash ./Code/train_qa.
  4. Sentiment Analysis task: You need to run train_sa script with appropriate hyperparameters and correct pretrained model. Run as: bash ./Code/train_sa $LANG.