How to preprocess the SQuAD dataset for various NLP tasks.
SuqadGuru is an NLP expert who can easily change the original SQuAD dataset into one of the required input formats of NLP tasks. Currently, GEN_QA, GEN_QG, EXT_QA and CORPUS task's input format are available. Extract the end-to-end feature X and ground-truth Y from the SQuAD dataset by using the SquadGuru class.
-
Constructor Signature
SquadGuru(parser: SquadParser, #parser which implement SquadParser tokenizer=None, #tokenizer which implement .tokenize(text: str) tags=SQUAD_TAGS, #iterable of str versions=SQUAD_VERSIONS #iterable of float )
- Inject a
parser, which the guru will use to extract X and Y data from the original squad dataset. - Inject a
tokenizerthat will create tokenized X and Y. If None, no tokenization. - Inject an str iterable of
tagsthat describes the tags of the dataset to load. - Inject a float iterable of
versionsthat describes the versions of the dataset to load.
- Inject a
-
.gather(only_first_answer=False, verbose=False)SquadGurugathers feature X and Y from the dataset.- Set
only_first_answerto extract the first answer in each of question-answers sets. - Set
verboseto print some logs.
-
.to_dataframe()- Returns
pandas.DataFrameobject. - X is mapped into
'Input'series. - Y is mapped into
'Target'series.
- Returns
-
.to_numpy()- Returns numpy array shaped (N, 2) where N is the number of data.
- Column of 0 is X, Column of 1 is Y.
-
.to_file(x_outfile, y_outfile)- Writes text files saving X and Y.
SquadParser implements methods to parse the original SQuAD dataset into the task-specific X, Y format.
.from_nlp_task(task: str)- Currently Available
taskare "GEN_QA" or "GEN_QG" or "EXT_QA" or "CORPUS".
- Currently Available
.parse(context: str, question: str:, answers: str iterable)- Parses given quesition-answers pair in given context(paragraph) from the SQuAD dataset.
Any tokenizer that implements .tokenize(text: str). In examples, transformers/BertTokenizer is used.