-
Notifications
You must be signed in to change notification settings - Fork 447
FastNLP Tutorial
This page is out of date!!!
FastNLP trainer will log the following values during training:
- loss
- mean and standard deviation of every tensor parameter of the model
- sum of the gradient of every tensor parameter of the model
How to run tensorboard in a remote machine and see results locally, see this answer
from fastNLP.fastnlp import FastNLP
PATH_TO_CWS_PICKLE_FILES = "/home/zyfeng/fastNLP/reproduction/chinese_word_segment/save/"
nlp = FastNLP(model_dir=PATH_TO_CWS_PICKLE_FILES)
nlp.load("cws_basic_model", config_file="cws.cfg", section_name="POS_test")
text = ["这是最好的基于深度学习的中文分词系统。",
"大王叫我来巡山。",
"我党多年来致力于改善人民生活水平。"]
results = nlp.run(text)
# [[('这', 'S'), ('是', 'S'), ('最', 'S'), ('好', 'S'), ('的', 'S'), ('基', 'B'), ('于', 'E'), ('深', 'B'), ('度', 'E'), ('学', 'B'), ('习', 'E'), ('的', 'S'), ('中', 'B'), ('文', 'E'), ('分', 'B'), ('词', 'E'), ('系', 'B'), ('统', 'E'), ('。', 'S')], [('大', 'B'), ('王', 'E'), ('叫', 'S'), ('我', 'S'), ('来', 'S'), ('巡', 'B'), ('山', 'E'), ('。', 'S')], [('我', 'B'), ('党', 'E'), ('多', 'S'), ('年', 'S'), ('来', 'S'), ('致', 'B'), ('力', 'E'), ('于', 'S'), ('改', 'B'), ('善', 'E'), ('人', 'B'), ('民', 'E'), ('生', 'B'), ('活', 'E'), ('水', 'B'), ('平', 'E'), ('。', 'S')]]
def train():
# Load configuration with a ConfigLoader
trainer_args = ConfigSection()
model_args = ConfigSection()
ConfigLoader("_").load_config(config_dir, {
"test_seq_label_trainer": trainer_args, "test_seq_label_model": model_args})
# Load data with a DataLoader
pos_loader = POSDatasetLoader(data_path)
train_data = pos_loader.load_lines()
# Pre-processing: generate DataSet objects
p = Preprocess()
data_train, data_dev = p.run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
model_args["vocab_size"] = p.vocab_size
model_args["num_classes"] = p.num_classes
# Define a trainer
trainer = Trainer(
task="seq_label",
epochs=trainer_args["epochs"],
batch_size=trainer_args["batch_size"],
validate=False,
use_cuda=False,
pickle_path=pickle_path,
save_best_dev=trainer_args["save_best_dev"],
model_name=model_name,
optimizer=Optimizer("SGD", lr=0.01, momentum=0.9),
)
# Define a model
model = SeqLabeling(model_args)
# Start training
trainer.train(model, data_train, data_dev)
# Save model with a ModelSaver
saver = ModelSaver(os.path.join(pickle_path, model_name))
saver.save_pytorch(model)
Before you start a new task, you first have corresponding datasets in hand. Implement a dataset loader which is a sub-class of DatasetLoader in dataset_loader.py
.
Your dataset loader is reponsible for transforming raw data into three-level Python lists. For example,
[
[[token_1, token_2, token_3, ...], [label_1, label_2, label_3, ...]],
...
]
The first dimension of your Python lists must be the number of examples. As for the rest dimension, you are free to design them, because you are responsible to parse them in the next section.
Preprocessor transforms three-level lists mentioned above into DataSet object(s), by constructing Feilds in the convert_to_dataset
method. Currently, different structures of the three-level lists lead to different field constructions. You are totally free to implement your construction method there.
for example in data:
words, label = example[0], example[1]
instance = Instance()
if isinstance(words, list):
x = TextField(words, is_target=False)
instance.add_field("word_seq", x)
use_word_seq = True
else:
raise NotImplementedError("words is a {}".format(type(words)))
if isinstance(label, list):
y = TextField(label, is_target=True)
instance.add_field("label_seq", y)
use_label_seq = True
elif isinstance(label, str):
y = LabelField(label, is_target=True)
instance.add_field("label", y)
use_label_str = True
else:
raise NotImplementedError("label is a {}".format(type(label)))
data_set.append(instance)
FastNLP uses a config file to store 1) model hyper-parameters; 2) trainer settings.
The config file is a text file. It contains any number of config sections. A section contains any number of configuarations. A configuration is a key-value pair linked by =
.
For example,
# test.cfg
[model]
vocab_size = 100
num_hidden_layers = 2
use_drop_out = false
pickle_path = "./save/"
[train]
learning_rate = 0.0001
pickle_path = "./save/"
validate = true
save_dev_output = false
Load config sections with ConfigLoader.
trainer_args = ConfigSection()
model_args = ConfigSection()
ConfigLoader().load_config("./test.cfg", {"train": trainer_args, "model": model_args})
Currently, trainer support only a few tasks. You can add more in data_forward
. The same as tester.
def data_forward(self, network, x):
if self._task == "seq_label":
y = network(x["word_seq"], x["word_seq_origin_len"])
elif self._task == "text_classify":
y = network(x["word_seq"])
else:
raise NotImplementedError("Unknown task type {}.".format(self._task))
trainer = Trainer(trainer_args)
model = SeqLabeling(model_args)
trainer.train(model, data_train, data_dev)
loader preprocessor Batch
raw dataset ------> 2-D list of strings -------> DataSet -------> data_iterator ------> batch_x
batch_y
data_loader = POSDatasetLoader("./data/pos_tag_data.txt")
train_data = pos_loader.load_lines()
"""
[
[["This", "is", "fast", "NLP"], ["label_1", "label_3", "label_2", "label_1"]],
...
]
"""
p = SeqLabelPreprocess()
data_train, data_dev = p.run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
# type(data_train) == DataSet
# type(data_dev) == DataSet
DataSet
[
Instance(Field_1, Field_2, Field_3, ...),
Instance(Field_1, Field_2, Field_3, ...),
...
]
data_iterator = Batch(data_train, batch_size=16, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
x = batch_x["word_seq"]
y = network(x)
get_loss(y, batch_y["label_seq"])
dataset.py
defines DataSet
, which is a list of Instance
s.
instance.py
defines Instance
, which is a single example and contains multiple Field
s.
field.py
defines Field
, which is the elementary data type or representation.
TextField
defines a list of strings. LabelField
defines single interger or string.
You can add extra fields to support more complex data.
Each field
- has a field name
- has a
is_target
boolean argument to specify whether it is Y or not (X) in training. - has a
to_tensor
method to define how this field data is transformed into tensors
dataset.py
defines a function to make DataSet from a list.
def create_dataset_from_lists(str_lists: list, word_vocab: dict, has_target: bool = False, label_vocab: dict = None) --> DataSet:
Example: https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_tester.py#L15 https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_trainer.py#L14
batch.py
defines Batch
, an iterable wrapper of DataSet
.
Sampling and padding is applied insides.
Iteration over a Batch
object returns two dict, batch_x
and batch_y
.
The key of the dict is the field name. The value is the corresponding batch tensor.
data_iterator = Batch(data_set, batch_size=8, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
batch_x["word_seq"] # torch.LongTensor
batch_y["label"] # torch.LongTensor
Batch
will keep a record of the origin length of a field before padding.
It returns the origin lengths with a string key created by appending "_origin_len" before the field name.
For example, batch_x["word_seq_origin_len"] # torch.LongTensor
.
Why origin length is tensor rather than a list of int ?
Because the sequence labeling model's forward() has added an extra arguemnt seq_len
to represent the origin lengths (The creation of sequence masks is moved into the model, which needs seq_len
.). And tensorboardX requires arguemnts passed to forward() to be nothing but tensor.
In previous design, different trainers are responsible for different tasks.
After introducing Fields & DataSet, different tasks are represented by different DataSet structures, which is the way Fields organize.
Therefore, all methods in SeqLabelTrainer
and ClassificationTrainer
are removed. They are just an empty sub-class to deprecate, and will throw an warning information when used.
So are those in Testers and Predictor.
However, trainers still need task information to know which fields are network inputs among batch_x. This is because
- we don't know which task is going to do when preprocessing and making DataSet.
- not all fields in batch_x are needed as network input. Some may be unused, such as
seq_len
in text classification. - in tester, different tasks require different evaluation methods and metrics.
- in predictor, different tasks require different pre-process and post-process.
Trainer & Tester have a required arguement (raise error if not provided, NO default value) self._task
to specify which task is going to perform.
if self._task == "seq_label":
y = network(x["word_seq"], x["word_seq_origin_len"])
elif self._task == "text_classify":
y = network(x["word_seq"])
- design a pytorch model, with forward method.
- choose fields or create new fields to describe your data set.
- modify preprocessor's
build_dict
method: to build dictionary over your data, and use the dictionary to transform multi-level list of strings into multi-level list of indices. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L182 - modify preprocessor's
convert_to_dataset
method: to transform multi-level list of indices into a DataSet object. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L244 - specify which fields you want to use as network inputs in Trainer, Tester, and Predictor. Where
self._task
appears, where there are modification. - run and debug.
- optimize Preprocessor: make it a callable object, customized processing function as argument
- more unit tests on core/
- eliminate
self._task
? - merge kezhen's code about build_dict
Any questions are welcome!