-
Notifications
You must be signed in to change notification settings - Fork 448
FastNLP Tutorial
loader preprocessor Batch
raw dataset ------> 2-D list of strings -------> DataSet -------> data_iterator ------> batch_x
batch_y
data_loader = POSDatasetLoader("./data/pos_tag_data.txt")
train_data = pos_loader.load_lines()
"""
[
[["This", "is", "fast", "NLP"], ["label_1", "label_3", "label_2", "label_1"]],
...
]
"""
p = SeqLabelPreprocess()
data_train, data_dev = p.run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
# type(data_train) == DataSet
# type(data_dev) == DataSet
DataSet
[
Instance(Field_1, Field_2, Field_3, ...),
Instance(Field_1, Field_2, Field_3, ...),
...
]
data_iterator = Batch(data_train, batch_size=16, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
x = batch_x["word_seq"]
y = network(x)
get_loss(y, batch_y["label_seq"])
from fastNLP.fastnlp import FastNLP
PATH_TO_CWS_PICKLE_FILES = "/home/zyfeng/fastNLP/reproduction/chinese_word_segment/save/"
nlp = FastNLP(model_dir=PATH_TO_CWS_PICKLE_FILES)
nlp.load("cws_basic_model", config_file="cws.cfg", section_name="POS_test")
text = ["这是最好的基于深度学习的中文分词系统。",
"大王叫我来巡山。",
"我党多年来致力于改善人民生活水平。"]
results = nlp.run(text)
# [[('这', 'S'), ('是', 'S'), ('最', 'S'), ('好', 'S'), ('的', 'S'), ('基', 'B'), ('于', 'E'), ('深', 'B'), ('度', 'E'), ('学', 'B'), ('习', 'E'), ('的', 'S'), ('中', 'B'), ('文', 'E'), ('分', 'B'), ('词', 'E'), ('系', 'B'), ('统', 'E'), ('。', 'S')], [('大', 'B'), ('王', 'E'), ('叫', 'S'), ('我', 'S'), ('来', 'S'), ('巡', 'B'), ('山', 'E'), ('。', 'S')], [('我', 'B'), ('党', 'E'), ('多', 'S'), ('年', 'S'), ('来', 'S'), ('致', 'B'), ('力', 'E'), ('于', 'S'), ('改', 'B'), ('善', 'E'), ('人', 'B'), ('民', 'E'), ('生', 'B'), ('活', 'E'), ('水', 'B'), ('平', 'E'), ('。', 'S')]]
def train_and_test():
# Load config section from config file
trainer_args = ConfigSection()
model_args = ConfigSection()
ConfigLoader().load_config("./data/config", {
"test_seq_label_trainer": trainer_args, "test_seq_label_model": model_args})
# Load data with data loader
data_loader = POSDatasetLoader("./data/pos_tag_data.txt")
train_data = pos_loader.load_lines()
# Preprocessor: 2-D list of strings ----> DataSet
preprocess = SeqLabelPreprocess()
data_train, data_dev = preprocess .run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
model_args["vocab_size"] = preprocess.vocab_size
model_args["num_classes"] = preprocess.num_classes
# Define trainer
trainer = Trainer(
epochs=trainer_args["epochs"],
batch_size=trainer_args["batch_size"],
validate=trainer_args["validate"],
use_cuda=trainer_args["use_cuda"],
pickle_path=pickle_path,
save_best_dev=trainer_args["save_best_dev"],
model_name=model_name,
optimizer=Optimizer("SGD", lr=0.01, momentum=0.9),
)
# Define a model
model = SeqLabeling(model_args)
# Start training
trainer.train(model, data_train, data_dev)
print("Training finished!")
# Define Saver and save a model
saver = ModelSaver(os.path.join(pickle_path, model_name))
saver.save_pytorch(model)
print("Model saved!")
del model, trainer, pos_loader
# Define the same model
model = SeqLabeling(model_args)
# Load trained weights into the model
ModelLoader.load_pytorch(model, os.path.join(pickle_path, model_name))
print("model loaded!")
# Load test configuration
tester_args = ConfigSection()
ConfigLoader("config.cfg").load_config(config_dir, {"test_seq_label_tester": tester_args})
# Define a tester
tester = Tester(save_output=False,
save_loss=False,
save_best_dev=False,
batch_size=4,
use_cuda=False,
pickle_path=pickle_path,
model_name="seq_label_in_test.pkl",
print_every_step=1
)
# Start testing
tester.test(model, data_dev)
print(tester.show_metrics())
dataset.py
defines DataSet
, which is a list of Instance
s.
instance.py
defines Instance
, which is a single example and contains multiple Field
s.
field.py
defines Field
, which is the elementary data type or representation.
TextField
defines a list of strings. LabelField
defines single interger or string.
You can add extra fields to support more complex data.
Each field
- has a field name
- has a
is_target
boolean argument to specify whether it is Y or not (X) in training. - has a
to_tensor
method to define how this field data is transformed into tensors
dataset.py
defines a function to make DataSet from a list.
def create_dataset_from_lists(str_lists: list, word_vocab: dict, has_target: bool = False, label_vocab: dict = None) --> DataSet:
Example: https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_tester.py#L15 https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_trainer.py#L14
batch.py
defines Batch
, an iterable wrapper of DataSet
.
Sampling and padding is applied insides.
Iteration over a Batch
object returns two dict, batch_x
and batch_y
.
The key of the dict is the field name. The value is the corresponding batch tensor.
data_iterator = Batch(data_set, batch_size=8, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
batch_x["word_seq"] # torch.LongTensor
batch_y["label"] # torch.LongTensor
Batch
will keep a record of the origin length of a field before padding.
It returns the origin lengths with a string key created by appending "_origin_len" before the field name.
For example, batch_x["word_seq_origin_len"] # torch.LongTensor
.
Why origin length is tensor rather than a list of int ?
Because the sequence labeling model's forward() has added an extra arguemnt seq_len
to represent the origin lengths (The creation of sequence masks is moved into the model, which needs seq_len
.). And tensorboardX requires arguemnts passed to forward() to be nothing but tensor.
In previous design, different trainers are responsible for different tasks.
After introducing Fields & DataSet, different tasks are represented by different DataSet structures, which is the way Fields organize.
Therefore, all methods in SeqLabelTrainer
and ClassificationTrainer
are removed. They are just an empty sub-class to deprecate, and will throw an warning information when used.
So are those in Testers and Predictor.
However, trainers still need task information to know which fields are network inputs among batch_x. This is because
- we don't know which task is going to do when preprocessing and making DataSet.
- not all fields in batch_x are needed as network input. Some may be unused, such as
seq_len
in text classification. - in tester, different tasks require different evaluation methods and metrics.
- in predictor, different tasks require different pre-process and post-process.
Trainer & Tester have a required arguement (raise error if not provided, NO default value) self._task
to specify which task is going to perform.
if self._task == "seq_label":
y = network(x["word_seq"], x["word_seq_origin_len"])
elif self._task == "text_classify":
y = network(x["word_seq"])
- design a pytorch model, with forward method.
- choose fields or create new fields to describe your data set.
- modify preprocessor's
build_dict
method: to build dictionary over your data, and use the dictionary to transform multi-level list of strings into multi-level list of indices. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L182 - modify preprocessor's
convert_to_dataset
method: to transform multi-level list of indices into a DataSet object. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L244 - specify which fields you want to use as network inputs in Trainer, Tester, and Predictor. Where
self._task
appears, where there are modification. - run and debug.
- optimize Preprocessor: make it a callable object, customized processing function as argument
- more unit tests on core/
- eliminate
self._task
? - merge kezhen's code about build_dict
Any questions are welcome!