THUDM · REIGN12 · Jan 14, 2023 · Jan 14, 2023
diff --git a/examples/commonsense_qa/README.md b/examples/commonsense_qa/README.md
@@ -0,0 +1,117 @@
+# commonsense_qa
+
+Author: Jingcheng Hu, hujc22@mails.tsinghua.edu.cn, https://reign12.github.io/
+
+Student ID: 2022312848
+
+## Task Description
+### Dataset Statistics
+We are following the train-validation-test split from [Huggingface commonsense_qa](https://huggingface.co/datasets/commonsense_qa).
+However, note that for test set the answers are not provided, so we are not using test set in our experiments. There are 9741 samples in training set and 1221 samples in validation set.
+
+### Task Introduction
+For task prompt, we are using prompt templates from [promptsource](https://github.com/bigscience-workshop/promptsource).
+The task of commonsense_qa is a multiple-choice question answering challenge requiring rich world knowledges. 
+Here is an example:
+```python
+{
+    'id': '075e483d21c29a511267ef62bedc0461',
+    'question': 'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?',
+    'question_concept': 'punishing',
+    'choices': {'label': ['A', 'B', 'C', 'D', 'E'],
+    'text': ['ignore', 'enforce', 'authoritarian', 'yell at', 'avoid']},
+    'answerKey': 'A'
+}
+```
+
+## How to Train and Eval
+### Dependency
+You can activate your own conda env and run command
+```bash
+bash env_setup.sh cuda # If you are running on nvidia GPUs
+
+bash env_setup.sh rocm # If you are running on amd GPUs
+```
+### Training and Evaluation
+You can run `python main.py --help` or directly go to `./config/basic.yaml` to see all the supported configuration.
+
+To run the distributed training, which will evaluate the results along the way; per step loss, per epoch loss and per epoch accuracy will be recorded:
+```bash
+torchrun --nproc_per_node <YOUR_GPU_NUM> main.py \
+    task="pc" \ # this is a Prompted Choice task
+    data.dataset="commonsense_qa" \
+    model.name="BAAI/glm-roberta-large" \ # we also support bert-large-uncased, roberta-large
+    data.prompt_id="2" \ # prompt_id of original_task=True prompt templates from promptsource; for the name of each prompt, you can refer to training log as you start the job, which will be like "train dataset prompt_key  ['answer_given_question_without_options', 'most_suitable_answer', 'question_answering', 'question_to_answer_index']"
+    jobname=<ANY_NAME_YOU_LIKE> \
+    debug=False \ # If you want to disable wandb, set debug=True; you can setup your wandb related var as env var, or just type it when the program need it; refer to logger.py for details
+    optimizer.lr="1r-5" \ # no lr scaling will be done, this lr will be the final lr
+    trainer.batch="32" \ # this is the total batch summed in all cards
+    trainer.accumulate_steps="2" \ # we support gradient accumulate steps to have larger effective batch size
+    trainer.epochs="10" trainer.warmup_epochs="1" # we use linear warmup and cosine decay
+    # there are some more configs can be changed, please refer to ./config/basic.yaml for details and simply follow the pattern here
+```
+
+## Results
+Report final performance and other methods.
+
+The final epoch accuracy using above commands are 64.22% on validation set. 
+For RoBERTa-Large and BERT-Large, we tuning the learning rate and the best performances are 74.59% and 62.90% for each.
+
+| Model | Accuracy |
+|:---:|:---:|
+glm-roberta-large | 64.22 |
+roberta-large | 74.59 |
+bert-large-uncased | 62.90 |
+
+## Reference
+### Dataset
+commonsense_qa dataset paper
+```latex
+@inproceedings{Talmor2019,
+  title = {{{CommonsenseQA}}: {{A Question Answering Challenge Targeting Commonsense Knowledge}}},
+  shorttitle = {{{CommonsenseQA}}},
+  booktitle = {Proceedings of the 2019 {{Conference}} of the {{North American Chapter}} of the {{Association}} for {{Computational Linguistics}}: {{Human Language Technologies}}, {{Volume}} 1 ({{Long}} and {{Short Papers}})},
+  author = {Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan},
+  date = {2019-06},
+  pages = {4149--4158},
+  publisher = {{Association for Computational Linguistics}},
+  location = {{Minneapolis, Minnesota}},
+  doi = {10.18653/v1/N19-1421},
+  url = {https://aclanthology.org/N19-1421},
+  urldate = {2023-01-07},
+  eventtitle = {{{NAACL-HLT}} 2019}
+}
+```
+commonsense_qa huggingface link are mentioned in above.
+
+roberta and bert we are directly using huggingface implementation. The original papers are:
+```latex
+@misc{Liu2019,
+  title = {{{RoBERTa}}: {{A Robustly Optimized BERT Pretraining Approach}}},
+  shorttitle = {{{RoBERTa}}},
+  author = {Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin},
+  date = {2019-07-26},
+  number = {arXiv:1907.11692},
+  eprint = {1907.11692},
+  eprinttype = {arxiv},
+  primaryclass = {cs},
+  publisher = {{arXiv}},
+  url = {http://arxiv.org/abs/1907.11692},
+  urldate = {2023-01-07},
+  archiveprefix = {arXiv},
+  version = {1}
+}
+
+@unpublished{devlinBERTPretrainingDeep2019,
+  title = {{{BERT}}: {{Pre-training}} of {{Deep Bidirectional Transformers}} for {{Language Understanding}}},
+  shorttitle = {{{BERT}}},
+  author = {Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
+  date = {2019-05-24},
+  eprint = {1810.04805},
+  eprinttype = {arxiv},
+  primaryclass = {cs},
+  url = {http://arxiv.org/abs/1810.04805},
+  urldate = {2022-04-12},
+  archiveprefix = {arXiv}
+}
+```
diff --git a/examples/commonsense_qa/config/basic.yaml b/examples/commonsense_qa/config/basic.yaml
@@ -0,0 +1,46 @@
+defaults:
+  - _self_
+  - override hydra/hydra_logging: disabled
+  - override hydra/job_logging: disabled
+
+hydra:
+  output_subdir: null
+  run:
+    dir: .
+
+debug: False
+jobname: test
+task: "pc" # pg:generation, pc:choice
+model:
+  # for pg: BAAI/glm-roberta-large,t5-large
+  # for pc: BAAI/glm-roberta-large, bert-large-uncased, roberta-large
+  name: "BAAI/glm-roberta-large" 
+  max_length: 384 # dataset related
+  max_gen_length: 128 # dataset related
+data:
+  dataset: "multi_news" # currently support commonsense_qa, multi_news
+  # dataset: "commonsense_qa" # currently support commonsense_qa, multi_news
+  tokenizer: ${model.name} 
+  max_length: ${model.max_length} 
+  max_gen_length: ${model.max_gen_length} 
+  prompt_id: 0 # id in promptsource original_task=True name_l 
+  answer_prompt: "Answer:"
+optimizer:
+  lr: 1e-5
+  beta1: 0.9
+  beta2: 0.999
+  wd: 0.01
+trainer:
+  batch: 64 # batch in total
+  accumulate_steps: 1
+  epochs: 10
+  lrscheduler: cosine
+  warmup_start: 1e-7
+  warmup_epochs: 1
+  num_workers: 1 # num_workers in total
+  pin_memory: True
+  log_interval: 100
+  qualitative_num: 5
+  checkpoint_dir: "./checkpoints" # path for ckps
+distributed:
+  backend: "nccl"
diff --git a/examples/commonsense_qa/data.py b/examples/commonsense_qa/data.py
@@ -0,0 +1,218 @@
+from typing import Dict,Tuple,List
+from torch import Tensor
+
+import torch
+from torch.utils.data import Dataset
+from omegaconf import DictConfig
+
+# Support GLM, BART, T5
+from transformers import AutoTokenizer
+# Support commonsense_qa, multi_news
+from datasets import load_dataset
+
+# prompt support
+from promptsource.templates import DatasetTemplates
+
+from einops import rearrange
+
+class PCDataCollator:
+    def __init__(self,datacollator_config: DictConfig):
+        self.datacollator_config = datacollator_config
+        self.tokenizer = AutoTokenizer.from_pretrained(self.datacollator_config.tokenizer,trust_remote_code=True)
+        self.collator = self.build_collator()
+    def __call__(self, batch:List[Tuple[str,List[str],int]]) -> Dict[str,Tensor]:
+        return self.collator(batch)
+    def build_collator(self):
+        if "glm" in self.datacollator_config.tokenizer:
+            return self.glm_collator
+        elif "roberta" in self.datacollator_config.tokenizer:
+            return self.roberta_collator
+        elif 'bert' in self.datacollator_config.tokenizer:
+            return self.roberta_collator
+        else:
+            raise NotImplementedError("Not implemented yet")
+    def roberta_collator(self,batch:List[Tuple[str,List[str],int]]) -> Dict[str,Tensor]:
+        prompt_l = []
+        choice_l = []
+        choice_ids = []
+        for item in batch:
+            for choice in item[1]:
+                prompt_l.append(item[0])
+                choice_l.append(choice)
+            choice_ids.append(item[2])
+        res = self.tokenizer(prompt_l,choice_l,
+                    return_tensors="pt",
+                    padding=True,truncation=True,max_length=self.datacollator_config.max_length)
+        for key in res:
+            res[key] = rearrange(res[key],'(b c) l -> b c l',b=len(batch),c=len(item[1])) 
+
+        labels = torch.tensor(choice_ids)
+        res['labels'] = labels
+        return res
+
+    def glm_collator(self,batch:List[Tuple[str,List[str],int]]) -> Dict[str,Tensor]:
+        prompts,choices_l,choice_ids = zip(*batch)
+        prompts = self.tokenizer(prompts,return_tensors="pt",
+                    padding=True,truncation=True,max_length=self.datacollator_config.max_length
+        )
+        res = self.tokenizer.build_inputs_for_multiple_choice(prompts,choices_l)
+        labels = torch.tensor(choice_ids)
+        res['labels'] = labels
+        return res
+
+class PCDataset(Dataset):
+    def __init__(self,dataset_config: DictConfig,split:str):
+        self.dataset_config = dataset_config
+        self.dataset = load_dataset(*dataset_config.dataset.split("/"),split=split)
+        self.prompt_key,self.prompter = self.build_prompter()
+        self.adapter = self.build_adapter()
+
+    def __len__(self):
+        return len(self.dataset)
+    def __getitem__(self, index: int) -> Tuple[str,List[str],int]:
+        data = self.dataset[index]
+        prompt,choice = self.prompter.apply(data)
+        choices_l = self.prompter.get_answer_choices_list(data)
+        choice_id = choices_l.index(choice)
+        prompt = prompt + "\n\n" + self.dataset_config.answer_prompt
+        res = self.adapter(prompt,choices_l,choice_id)
+        return res
+    def build_adapter(self):
+        if "glm" in self.dataset_config.tokenizer:
+            return self.glm_adapter
+        elif "roberta" in self.dataset_config.tokenizer:
+            return self.roberta_adapter
+        elif 'bert' in self.dataset_config.tokenizer:
+            return self.roberta_adapter
+        else:
+            raise NotImplementedError("Not implemented yet")
+    def roberta_adapter(self,prompt:str,choices_l:List[str],choice_id:int) -> Tuple[str,List[str],int]:
+        return prompt,choices_l,choice_id
+    def glm_adapter(self,prompt:str,choices_l:List[str],choice_id:int) -> Tuple[str,List[str],int]:
+        prompt += "[MASK]"
+        return prompt,choices_l,choice_id
+    def build_prompter(self):
+        all_prompts = DatasetTemplates(self.dataset_config.dataset)
+        # filter out those not original_task
+        prompt_key = [name for name in all_prompts.all_template_names if all_prompts[name].metadata.original_task ]
+        prompter = all_prompts[prompt_key[self.dataset_config.prompt_id]]
+        return prompt_key,prompter
+
+class PGDataCollator:
+    def __init__(self,datacollator_config: DictConfig,split:str):
+        self.datacollator_config = datacollator_config
+        self.split = split
+        self.tokenizer = AutoTokenizer.from_pretrained(self.datacollator_config.tokenizer,trust_remote_code=True)
+        self.collator = self.build_collator()
+    def build_collator(self):
+        if "glm" in self.datacollator_config.tokenizer:
+            if self.split == "train":
+                return self.glm_train_collator
+            else:
+                return self.glm_test_collator
+        elif "t5" in self.datacollator_config.tokenizer:
+            if self.split == "train":
+                return self.t5_train_collator
+            else:
+                return self.t5_test_collator
+        else:
+            raise NotImplementedError("Not implemented yet")
+    def t5_train_collator(self,batch: List[Tuple[str,str]]) -> Dict[str,Tensor]:
+        prompts,answers = [list(item) for item in zip(*batch)]
+        self.tokenizer.truncation_side = 'left' 
+        res = self.tokenizer(prompts,padding=True,truncation=True,max_length=self.datacollator_config.max_length,return_tensors="pt")
+        res['labels'] = self.tokenizer(answers,padding=True,truncation=True,max_length=self.datacollator_config.max_length,return_tensors="pt")['input_ids']
+        return res
+    def t5_test_collator(self,batch: List[Tuple[str,str]]) -> Dict[str,Tensor]:
+        prompts,answers = [list(item) for item in zip(*batch)]
+        self.tokenizer.truncation_side = 'left' 
+        res = self.tokenizer(prompts,padding=True,truncation=True,max_length=self.datacollator_config.max_length,return_tensors="pt")
+        res['labels'] = [answer[len("<extra_id_0> "):] for answer in answers] # rm the prepended <extra_id_0>
+        res['prompts'] = prompts
+        return res
+    def glm_train_collator(self,batch: List[Tuple[str,str]]) -> Dict[str,Tensor]:
+        prompts,answers = [list(item) for item in zip(*batch)]
+        res = self.tokenizer(prompts,padding=True,truncation=True,max_length=self.datacollator_config.max_length,return_tensors="pt")
+        res = self.tokenizer.build_inputs_for_generation(res,targets=answers,max_gen_length=self.datacollator_config.max_gen_length)
+        return res
+    def glm_test_collator(self,batch: List[Tuple[str,str]]) -> Dict[str,Tensor]:
+        prompts,answers = [list(item) for item in zip(*batch)]
+        res = self.tokenizer(prompts,padding=True,truncation=True,max_length=self.datacollator_config.max_length,return_tensors="pt")
+        res = self.tokenizer.build_inputs_for_generation(res,max_gen_length=self.datacollator_config.max_gen_length)
+        res['labels'] = answers
+        res['prompts'] = prompts
+        return res
+
+    def __call__(self, batch: List[Tuple[str,str]]) -> Dict[str, Tensor]:
+        return self.collator(batch)
+
+class PGDataset(Dataset):
+    def __init__(self,dataset_config:DictConfig,split:str):
+        """
+        split = "train" or "validation" or "test"
+        """
+        self.dataset_config = dataset_config
+        self.max_length = dataset_config.max_length
+        self.max_gen_length = dataset_config.max_gen_length
+
+        self.dataset = load_dataset(*dataset_config.dataset.split("/"),split=split)
+        self.prompt_key,self.prompter = self.build_prompter()
+        self.answer_prompt = dataset_config.answer_prompt
+        self.adapter = self.build_adapter()
+
+
+    def build_adapter(self):
+        adapter_name = self.dataset_config.tokenizer
+        if "glm" in adapter_name:
+            adapter = self.glm_adapter
+        elif "t5" in adapter_name:
+            adapter = self.t5_adapter
+        elif "bart" in adapter_name:
+            adapter = self.bart_adapter
+        else:
+            raise NotImplementedError(f"Adapter {adapter_name} is not supported")
+        return adapter
+
+    def glm_adapter(self,prompted_data:Tuple[str,str])->Tuple[str,str]:
+        prompt,answer = prompted_data
+        # add mask token
+        prompt += "[MASK]"
+        res = prompt,answer
+        return res
+
+    def t5_adapter(self,prompted_data):
+        prompt,answer = prompted_data
+        # add sentinel token for prompt and answer
+        prompt = f'{prompt} <extra_id_0>'
+        answer = f'<extra_id_0> {answer}'
+        return prompt,answer
+        return res
+
+    def build_prompter(self):
+        all_prompts = DatasetTemplates(self.dataset_config.dataset)
+        # filter out those not original_task
+        prompt_key = [name for name in all_prompts.all_template_names if all_prompts[name].metadata.original_task ]
+        prompter = all_prompts[prompt_key[self.dataset_config.prompt_id]]
+        return prompt_key,prompter
+
+    def __len__(self)->int:
+        return len(self.dataset)
+    def __getitem__(self, index:int)->Tuple[str,str]:
+        # TODO: format the data using prompt, add mask token based on model, padding based on max_lenght, then pass the tokenizer
+        data = self.dataset[index]
+        prompted_data = self.prompter.apply(data)
+        prompted_data[0] = prompted_data[0] + "\n\n" + self.answer_prompt
+        res = self.adapter(prompted_data)
+        return res
+
+
+
+if __name__ == "__main__":
+    dataset = load_dataset("commonsense_qa")
+    print(dataset.keys())
+    dataset = load_dataset("multi_news")
+    print(dataset.keys())
+
+    # multi_news_prompts = DatasetTemplates("multi_news")
+    # print(multi_news_prompts.all_template_names)
+
diff --git a/examples/commonsense_qa/env_setup.sh b/examples/commonsense_qa/env_setup.sh
@@ -0,0 +1,13 @@
+# Support cuda and rocm environment
+# for cuda we test on V100 and A100
+# for rocm we test on MI100, the image tested with building with pytorch-1.11.0-rocm5.1.3
+
+GPU_ENV="$1"
+if [ $GPU_ENV = "cuda" ] || [ $GPU_ENV = "rocm" ]; then
+    echo "Installing $GPU_ENV environment"
+    pip install -r "requirements_torch_${GPU_ENV}.txt"
+    pip install -r requirements.txt
+    pip install numpy tqdm -U
+else
+    echo "Unsupported environment $GPU_ENV"
+fi