Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions examples/commonsense_qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# commonsense_qa

Author: Jingcheng Hu, hujc22@mails.tsinghua.edu.cn, https://reign12.github.io/

Student ID: 2022312848

## Task Description
### Dataset Statistics
We are following the train-validation-test split from [Huggingface commonsense_qa](https://huggingface.co/datasets/commonsense_qa).
However, note that for test set the answers are not provided, so we are not using test set in our experiments. There are 9741 samples in training set and 1221 samples in validation set.

### Task Introduction
For task prompt, we are using prompt templates from [promptsource](https://github.com/bigscience-workshop/promptsource).
The task of commonsense_qa is a multiple-choice question answering challenge requiring rich world knowledges.
Here is an example:
```python
{
'id': '075e483d21c29a511267ef62bedc0461',
'question': 'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?',
'question_concept': 'punishing',
'choices': {'label': ['A', 'B', 'C', 'D', 'E'],
'text': ['ignore', 'enforce', 'authoritarian', 'yell at', 'avoid']},
'answerKey': 'A'
}
```

## How to Train and Eval
### Dependency
You can activate your own conda env and run command
```bash
bash env_setup.sh cuda # If you are running on nvidia GPUs

bash env_setup.sh rocm # If you are running on amd GPUs
```
### Training and Evaluation
You can run `python main.py --help` or directly go to `./config/basic.yaml` to see all the supported configuration.

To run the distributed training, which will evaluate the results along the way; per step loss, per epoch loss and per epoch accuracy will be recorded:
```bash
torchrun --nproc_per_node <YOUR_GPU_NUM> main.py \
task="pc" \ # this is a Prompted Choice task
data.dataset="commonsense_qa" \
model.name="BAAI/glm-roberta-large" \ # we also support bert-large-uncased, roberta-large
data.prompt_id="2" \ # prompt_id of original_task=True prompt templates from promptsource; for the name of each prompt, you can refer to training log as you start the job, which will be like "train dataset prompt_key ['answer_given_question_without_options', 'most_suitable_answer', 'question_answering', 'question_to_answer_index']"
jobname=<ANY_NAME_YOU_LIKE> \
debug=False \ # If you want to disable wandb, set debug=True; you can setup your wandb related var as env var, or just type it when the program need it; refer to logger.py for details
optimizer.lr="1r-5" \ # no lr scaling will be done, this lr will be the final lr
trainer.batch="32" \ # this is the total batch summed in all cards
trainer.accumulate_steps="2" \ # we support gradient accumulate steps to have larger effective batch size
trainer.epochs="10" trainer.warmup_epochs="1" # we use linear warmup and cosine decay
# there are some more configs can be changed, please refer to ./config/basic.yaml for details and simply follow the pattern here
```

## Results
Report final performance and other methods.

The final epoch accuracy using above commands are 64.22% on validation set.
For RoBERTa-Large and BERT-Large, we tuning the learning rate and the best performances are 74.59% and 62.90% for each.

| Model | Accuracy |
|:---:|:---:|
glm-roberta-large | 64.22 |
roberta-large | 74.59 |
bert-large-uncased | 62.90 |

## Reference
### Dataset
commonsense_qa dataset paper
```latex
@inproceedings{Talmor2019,
title = {{{CommonsenseQA}}: {{A Question Answering Challenge Targeting Commonsense Knowledge}}},
shorttitle = {{{CommonsenseQA}}},
booktitle = {Proceedings of the 2019 {{Conference}} of the {{North American Chapter}} of the {{Association}} for {{Computational Linguistics}}: {{Human Language Technologies}}, {{Volume}} 1 ({{Long}} and {{Short Papers}})},
author = {Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan},
date = {2019-06},
pages = {4149--4158},
publisher = {{Association for Computational Linguistics}},
location = {{Minneapolis, Minnesota}},
doi = {10.18653/v1/N19-1421},
url = {https://aclanthology.org/N19-1421},
urldate = {2023-01-07},
eventtitle = {{{NAACL-HLT}} 2019}
}
```
commonsense_qa huggingface link are mentioned in above.

roberta and bert we are directly using huggingface implementation. The original papers are:
```latex
@misc{Liu2019,
title = {{{RoBERTa}}: {{A Robustly Optimized BERT Pretraining Approach}}},
shorttitle = {{{RoBERTa}}},
author = {Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin},
date = {2019-07-26},
number = {arXiv:1907.11692},
eprint = {1907.11692},
eprinttype = {arxiv},
primaryclass = {cs},
publisher = {{arXiv}},
url = {http://arxiv.org/abs/1907.11692},
urldate = {2023-01-07},
archiveprefix = {arXiv},
version = {1}
}

@unpublished{devlinBERTPretrainingDeep2019,
title = {{{BERT}}: {{Pre-training}} of {{Deep Bidirectional Transformers}} for {{Language Understanding}}},
shorttitle = {{{BERT}}},
author = {Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
date = {2019-05-24},
eprint = {1810.04805},
eprinttype = {arxiv},
primaryclass = {cs},
url = {http://arxiv.org/abs/1810.04805},
urldate = {2022-04-12},
archiveprefix = {arXiv}
}
```
46 changes: 46 additions & 0 deletions examples/commonsense_qa/config/basic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
defaults:
- _self_
- override hydra/hydra_logging: disabled
- override hydra/job_logging: disabled

hydra:
output_subdir: null
run:
dir: .

debug: False
jobname: test
task: "pc" # pg:generation, pc:choice
model:
# for pg: BAAI/glm-roberta-large,t5-large
# for pc: BAAI/glm-roberta-large, bert-large-uncased, roberta-large
name: "BAAI/glm-roberta-large"
max_length: 384 # dataset related
max_gen_length: 128 # dataset related
data:
dataset: "multi_news" # currently support commonsense_qa, multi_news
# dataset: "commonsense_qa" # currently support commonsense_qa, multi_news
tokenizer: ${model.name}
max_length: ${model.max_length}
max_gen_length: ${model.max_gen_length}
prompt_id: 0 # id in promptsource original_task=True name_l
answer_prompt: "Answer:"
optimizer:
lr: 1e-5
beta1: 0.9
beta2: 0.999
wd: 0.01
trainer:
batch: 64 # batch in total
accumulate_steps: 1
epochs: 10
lrscheduler: cosine
warmup_start: 1e-7
warmup_epochs: 1
num_workers: 1 # num_workers in total
pin_memory: True
log_interval: 100
qualitative_num: 5
checkpoint_dir: "./checkpoints" # path for ckps
distributed:
backend: "nccl"
218 changes: 218 additions & 0 deletions examples/commonsense_qa/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
from typing import Dict,Tuple,List
from torch import Tensor

import torch
from torch.utils.data import Dataset
from omegaconf import DictConfig

# Support GLM, BART, T5
from transformers import AutoTokenizer
# Support commonsense_qa, multi_news
from datasets import load_dataset

# prompt support
from promptsource.templates import DatasetTemplates

from einops import rearrange

class PCDataCollator:
def __init__(self,datacollator_config: DictConfig):
self.datacollator_config = datacollator_config
self.tokenizer = AutoTokenizer.from_pretrained(self.datacollator_config.tokenizer,trust_remote_code=True)
self.collator = self.build_collator()
def __call__(self, batch:List[Tuple[str,List[str],int]]) -> Dict[str,Tensor]:
return self.collator(batch)
def build_collator(self):
if "glm" in self.datacollator_config.tokenizer:
return self.glm_collator
elif "roberta" in self.datacollator_config.tokenizer:
return self.roberta_collator
elif 'bert' in self.datacollator_config.tokenizer:
return self.roberta_collator
else:
raise NotImplementedError("Not implemented yet")
def roberta_collator(self,batch:List[Tuple[str,List[str],int]]) -> Dict[str,Tensor]:
prompt_l = []
choice_l = []
choice_ids = []
for item in batch:
for choice in item[1]:
prompt_l.append(item[0])
choice_l.append(choice)
choice_ids.append(item[2])
res = self.tokenizer(prompt_l,choice_l,
return_tensors="pt",
padding=True,truncation=True,max_length=self.datacollator_config.max_length)
for key in res:
res[key] = rearrange(res[key],'(b c) l -> b c l',b=len(batch),c=len(item[1]))

labels = torch.tensor(choice_ids)
res['labels'] = labels
return res

def glm_collator(self,batch:List[Tuple[str,List[str],int]]) -> Dict[str,Tensor]:
prompts,choices_l,choice_ids = zip(*batch)
prompts = self.tokenizer(prompts,return_tensors="pt",
padding=True,truncation=True,max_length=self.datacollator_config.max_length
)
res = self.tokenizer.build_inputs_for_multiple_choice(prompts,choices_l)
labels = torch.tensor(choice_ids)
res['labels'] = labels
return res

class PCDataset(Dataset):
def __init__(self,dataset_config: DictConfig,split:str):
self.dataset_config = dataset_config
self.dataset = load_dataset(*dataset_config.dataset.split("/"),split=split)
self.prompt_key,self.prompter = self.build_prompter()
self.adapter = self.build_adapter()

def __len__(self):
return len(self.dataset)
def __getitem__(self, index: int) -> Tuple[str,List[str],int]:
data = self.dataset[index]
prompt,choice = self.prompter.apply(data)
choices_l = self.prompter.get_answer_choices_list(data)
choice_id = choices_l.index(choice)
prompt = prompt + "\n\n" + self.dataset_config.answer_prompt
res = self.adapter(prompt,choices_l,choice_id)
return res
def build_adapter(self):
if "glm" in self.dataset_config.tokenizer:
return self.glm_adapter
elif "roberta" in self.dataset_config.tokenizer:
return self.roberta_adapter
elif 'bert' in self.dataset_config.tokenizer:
return self.roberta_adapter
else:
raise NotImplementedError("Not implemented yet")
def roberta_adapter(self,prompt:str,choices_l:List[str],choice_id:int) -> Tuple[str,List[str],int]:
return prompt,choices_l,choice_id
def glm_adapter(self,prompt:str,choices_l:List[str],choice_id:int) -> Tuple[str,List[str],int]:
prompt += "[MASK]"
return prompt,choices_l,choice_id
def build_prompter(self):
all_prompts = DatasetTemplates(self.dataset_config.dataset)
# filter out those not original_task
prompt_key = [name for name in all_prompts.all_template_names if all_prompts[name].metadata.original_task ]
prompter = all_prompts[prompt_key[self.dataset_config.prompt_id]]
return prompt_key,prompter

class PGDataCollator:
def __init__(self,datacollator_config: DictConfig,split:str):
self.datacollator_config = datacollator_config
self.split = split
self.tokenizer = AutoTokenizer.from_pretrained(self.datacollator_config.tokenizer,trust_remote_code=True)
self.collator = self.build_collator()
def build_collator(self):
if "glm" in self.datacollator_config.tokenizer:
if self.split == "train":
return self.glm_train_collator
else:
return self.glm_test_collator
elif "t5" in self.datacollator_config.tokenizer:
if self.split == "train":
return self.t5_train_collator
else:
return self.t5_test_collator
else:
raise NotImplementedError("Not implemented yet")
def t5_train_collator(self,batch: List[Tuple[str,str]]) -> Dict[str,Tensor]:
prompts,answers = [list(item) for item in zip(*batch)]
self.tokenizer.truncation_side = 'left'
res = self.tokenizer(prompts,padding=True,truncation=True,max_length=self.datacollator_config.max_length,return_tensors="pt")
res['labels'] = self.tokenizer(answers,padding=True,truncation=True,max_length=self.datacollator_config.max_length,return_tensors="pt")['input_ids']
return res
def t5_test_collator(self,batch: List[Tuple[str,str]]) -> Dict[str,Tensor]:
prompts,answers = [list(item) for item in zip(*batch)]
self.tokenizer.truncation_side = 'left'
res = self.tokenizer(prompts,padding=True,truncation=True,max_length=self.datacollator_config.max_length,return_tensors="pt")
res['labels'] = [answer[len("<extra_id_0> "):] for answer in answers] # rm the prepended <extra_id_0>
res['prompts'] = prompts
return res
def glm_train_collator(self,batch: List[Tuple[str,str]]) -> Dict[str,Tensor]:
prompts,answers = [list(item) for item in zip(*batch)]
res = self.tokenizer(prompts,padding=True,truncation=True,max_length=self.datacollator_config.max_length,return_tensors="pt")
res = self.tokenizer.build_inputs_for_generation(res,targets=answers,max_gen_length=self.datacollator_config.max_gen_length)
return res
def glm_test_collator(self,batch: List[Tuple[str,str]]) -> Dict[str,Tensor]:
prompts,answers = [list(item) for item in zip(*batch)]
res = self.tokenizer(prompts,padding=True,truncation=True,max_length=self.datacollator_config.max_length,return_tensors="pt")
res = self.tokenizer.build_inputs_for_generation(res,max_gen_length=self.datacollator_config.max_gen_length)
res['labels'] = answers
res['prompts'] = prompts
return res

def __call__(self, batch: List[Tuple[str,str]]) -> Dict[str, Tensor]:
return self.collator(batch)

class PGDataset(Dataset):
def __init__(self,dataset_config:DictConfig,split:str):
"""
split = "train" or "validation" or "test"
"""
self.dataset_config = dataset_config
self.max_length = dataset_config.max_length
self.max_gen_length = dataset_config.max_gen_length

self.dataset = load_dataset(*dataset_config.dataset.split("/"),split=split)
self.prompt_key,self.prompter = self.build_prompter()
self.answer_prompt = dataset_config.answer_prompt
self.adapter = self.build_adapter()


def build_adapter(self):
adapter_name = self.dataset_config.tokenizer
if "glm" in adapter_name:
adapter = self.glm_adapter
elif "t5" in adapter_name:
adapter = self.t5_adapter
elif "bart" in adapter_name:
adapter = self.bart_adapter
else:
raise NotImplementedError(f"Adapter {adapter_name} is not supported")
return adapter

def glm_adapter(self,prompted_data:Tuple[str,str])->Tuple[str,str]:
prompt,answer = prompted_data
# add mask token
prompt += "[MASK]"
res = prompt,answer
return res

def t5_adapter(self,prompted_data):
prompt,answer = prompted_data
# add sentinel token for prompt and answer
prompt = f'{prompt} <extra_id_0>'
answer = f'<extra_id_0> {answer}'
return prompt,answer
return res

def build_prompter(self):
all_prompts = DatasetTemplates(self.dataset_config.dataset)
# filter out those not original_task
prompt_key = [name for name in all_prompts.all_template_names if all_prompts[name].metadata.original_task ]
prompter = all_prompts[prompt_key[self.dataset_config.prompt_id]]
return prompt_key,prompter

def __len__(self)->int:
return len(self.dataset)
def __getitem__(self, index:int)->Tuple[str,str]:
# TODO: format the data using prompt, add mask token based on model, padding based on max_lenght, then pass the tokenizer
data = self.dataset[index]
prompted_data = self.prompter.apply(data)
prompted_data[0] = prompted_data[0] + "\n\n" + self.answer_prompt
res = self.adapter(prompted_data)
return res



if __name__ == "__main__":
dataset = load_dataset("commonsense_qa")
print(dataset.keys())
dataset = load_dataset("multi_news")
print(dataset.keys())

# multi_news_prompts = DatasetTemplates("multi_news")
# print(multi_news_prompts.all_template_names)

13 changes: 13 additions & 0 deletions examples/commonsense_qa/env_setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Support cuda and rocm environment
# for cuda we test on V100 and A100
# for rocm we test on MI100, the image tested with building with pytorch-1.11.0-rocm5.1.3

GPU_ENV="$1"
if [ $GPU_ENV = "cuda" ] || [ $GPU_ENV = "rocm" ]; then
echo "Installing $GPU_ENV environment"
pip install -r "requirements_torch_${GPU_ENV}.txt"
pip install -r requirements.txt
pip install numpy tqdm -U
else
echo "Unsupported environment $GPU_ENV"
fi
Loading