Skip to content

FastNLP Tutorial

Coet edited this page Nov 10, 2018 · 8 revisions

This page is out of date!!!

tensorboard

FastNLP trainer will log the following values during training:

  • loss
  • mean and standard deviation of every tensor parameter of the model
  • sum of the gradient of every tensor parameter of the model

How to run tensorboard in a remote machine and see results locally, see this answer

User Alpha: call a model

from fastNLP.fastnlp import FastNLP

PATH_TO_CWS_PICKLE_FILES = "/home/zyfeng/fastNLP/reproduction/chinese_word_segment/save/"

nlp = FastNLP(model_dir=PATH_TO_CWS_PICKLE_FILES)

nlp.load("cws_basic_model", config_file="cws.cfg", section_name="POS_test")

text = ["这是最好的基于深度学习的中文分词系统。",
            "大王叫我来巡山。",
            "我党多年来致力于改善人民生活水平。"]
results = nlp.run(text)

# [[('这', 'S'), ('是', 'S'), ('最', 'S'), ('好', 'S'), ('的', 'S'), ('基', 'B'), ('于', 'E'), ('深', 'B'), ('度', 'E'), ('学', 'B'), ('习', 'E'), ('的', 'S'), ('中', 'B'), ('文', 'E'), ('分', 'B'), ('词', 'E'), ('系', 'B'), ('统', 'E'), ('。', 'S')], [('大', 'B'), ('王', 'E'), ('叫', 'S'), ('我', 'S'), ('来', 'S'), ('巡', 'B'), ('山', 'E'), ('。', 'S')], [('我', 'B'), ('党', 'E'), ('多', 'S'), ('年', 'S'), ('来', 'S'), ('致', 'B'), ('力', 'E'), ('于', 'S'), ('改', 'B'), ('善', 'E'), ('人', 'B'), ('民', 'E'), ('生', 'B'), ('活', 'E'), ('水', 'B'), ('平', 'E'), ('。', 'S')]]

User Beta: train a defined model

def train():
    # Load configuration with a ConfigLoader
    trainer_args = ConfigSection()
    model_args = ConfigSection()
    ConfigLoader("_").load_config(config_dir, {
        "test_seq_label_trainer": trainer_args, "test_seq_label_model": model_args})

    # Load data with a DataLoader
    pos_loader = POSDatasetLoader(data_path)
    train_data = pos_loader.load_lines()

    # Pre-processing: generate DataSet objects
    p = Preprocess()
    data_train, data_dev = p.run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
    model_args["vocab_size"] = p.vocab_size
    model_args["num_classes"] = p.num_classes
    
    # Define a trainer
    trainer = Trainer(
        task="seq_label",
        epochs=trainer_args["epochs"],
        batch_size=trainer_args["batch_size"],
        validate=False,
        use_cuda=False,
        pickle_path=pickle_path,
        save_best_dev=trainer_args["save_best_dev"],
        model_name=model_name,
        optimizer=Optimizer("SGD", lr=0.01, momentum=0.9),
    )

    # Define a model
    model = SeqLabeling(model_args)

    # Start training
    trainer.train(model, data_train, data_dev)

    # Save model with a ModelSaver
    saver = ModelSaver(os.path.join(pickle_path, model_name))
    saver.save_pytorch(model)

User Gamma: train a model from scratch

One: prepare dataset loader

Before you start a new task, you first have corresponding datasets in hand. Implement a dataset loader which is a sub-class of DatasetLoader in dataset_loader.py. Your dataset loader is reponsible for transforming raw data into three-level Python lists. For example,

[
    [[token_1, token_2, token_3, ...], [label_1, label_2, label_3, ...]],
    ...
]

The first dimension of your Python lists must be the number of examples. As for the rest dimension, you are free to design them, because you are responsible to parse them in the next section.

Two: prepare a pre-processing method

Preprocessor transforms three-level lists mentioned above into DataSet object(s), by constructing Feilds in the convert_to_dataset method. Currently, different structures of the three-level lists lead to different field constructions. You are totally free to implement your construction method there.

for example in data:
    words, label = example[0], example[1]
    instance = Instance()

    if isinstance(words, list):
        x = TextField(words, is_target=False)
        instance.add_field("word_seq", x)
        use_word_seq = True
    else:
        raise NotImplementedError("words is a {}".format(type(words)))

    if isinstance(label, list):
        y = TextField(label, is_target=True)
        instance.add_field("label_seq", y)
        use_label_seq = True
    elif isinstance(label, str):
        y = LabelField(label, is_target=True)
        instance.add_field("label", y)
        use_label_str = True
    else:
        raise NotImplementedError("label is a {}".format(type(label)))
    data_set.append(instance)

Three: define your config file

FastNLP uses a config file to store 1) model hyper-parameters; 2) trainer settings. The config file is a text file. It contains any number of config sections. A section contains any number of configuarations. A configuration is a key-value pair linked by =. For example,

# test.cfg
[model]
vocab_size = 100
num_hidden_layers = 2
use_drop_out = false
pickle_path = "./save/"

[train]
learning_rate = 0.0001
pickle_path = "./save/"
validate = true
save_dev_output = false

Load config sections with ConfigLoader.

trainer_args = ConfigSection()
model_args = ConfigSection()
ConfigLoader().load_config("./test.cfg", {"train": trainer_args, "model": model_args})

Four: modify trainer

Currently, trainer support only a few tasks. You can add more in data_forward. The same as tester.

def data_forward(self, network, x):
    if self._task == "seq_label":
        y = network(x["word_seq"], x["word_seq_origin_len"])
    elif self._task == "text_classify":
        y = network(x["word_seq"])
    else:
        raise NotImplementedError("Unknown task type {}.".format(self._task))

Five: call trainer to train

trainer = Trainer(trainer_args)
model = SeqLabeling(model_args)
trainer.train(model, data_train, data_dev)

Data Flow

            loader                    preprocessor         Batch
raw dataset ------> 2-D list of strings ------->  DataSet -------> data_iterator ------> batch_x 
                                                                                         batch_y

step one

data_loader = POSDatasetLoader("./data/pos_tag_data.txt")
train_data = pos_loader.load_lines()
"""
[
    [["This", "is", "fast", "NLP"], ["label_1", "label_3", "label_2", "label_1"]],
    ...
]
"""

step two

p = SeqLabelPreprocess()
data_train, data_dev = p.run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
# type(data_train) == DataSet
# type(data_dev) == DataSet
DataSet 
[
    Instance(Field_1, Field_2, Field_3, ...),
    Instance(Field_1, Field_2, Field_3, ...),
    ...
]

step three & four

data_iterator = Batch(data_train, batch_size=16, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
    x = batch_x["word_seq"]
    y = network(x)
    get_loss(y, batch_y["label_seq"])

Implementation Details

Field, Instance & DataSet

dataset.py defines DataSet, which is a list of Instances. instance.py defines Instance, which is a single example and contains multiple Fields.

field.py defines Field, which is the elementary data type or representation. TextField defines a list of strings. LabelField defines single interger or string. You can add extra fields to support more complex data.

Each field

  • has a field name
  • has a is_target boolean argument to specify whether it is Y or not (X) in training.
  • has a to_tensor method to define how this field data is transformed into tensors

dataset.py defines a function to make DataSet from a list.

def create_dataset_from_lists(str_lists: list, word_vocab: dict, has_target: bool = False, label_vocab: dict = None) --> DataSet: 

Example: https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_tester.py#L15 https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_trainer.py#L14

Batch

batch.py defines Batch, an iterable wrapper of DataSet. Sampling and padding is applied insides. Iteration over a Batch object returns two dict, batch_x and batch_y. The key of the dict is the field name. The value is the corresponding batch tensor.

data_iterator = Batch(data_set, batch_size=8, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
   batch_x["word_seq"]  # torch.LongTensor
   batch_y["label"]  # torch.LongTensor

how to get the pre-paded (origin) length of the sequence

Batch will keep a record of the origin length of a field before padding. It returns the origin lengths with a string key created by appending "_origin_len" before the field name. For example, batch_x["word_seq_origin_len"] # torch.LongTensor.

Why origin length is tensor rather than a list of int ? Because the sequence labeling model's forward() has added an extra arguemnt seq_len to represent the origin lengths (The creation of sequence masks is moved into the model, which needs seq_len.). And tensorboardX requires arguemnts passed to forward() to be nothing but tensor.

One Trainer, Many Tasks

In previous design, different trainers are responsible for different tasks. After introducing Fields & DataSet, different tasks are represented by different DataSet structures, which is the way Fields organize. Therefore, all methods in SeqLabelTrainer and ClassificationTrainer are removed. They are just an empty sub-class to deprecate, and will throw an warning information when used. So are those in Testers and Predictor.

self._task for trainers, testers, and predictors

However, trainers still need task information to know which fields are network inputs among batch_x. This is because

  • we don't know which task is going to do when preprocessing and making DataSet.
  • not all fields in batch_x are needed as network input. Some may be unused, such as seq_len in text classification.
  • in tester, different tasks require different evaluation methods and metrics.
  • in predictor, different tasks require different pre-process and post-process.

Trainer & Tester have a required arguement (raise error if not provided, NO default value) self._task to specify which task is going to perform.

if self._task == "seq_label": 
    y = network(x["word_seq"], x["word_seq_origin_len"]) 
elif self._task == "text_classify": 
    y = network(x["word_seq"]) 

How to add your task ?

  1. design a pytorch model, with forward method.
  2. choose fields or create new fields to describe your data set.
  3. modify preprocessor's build_dict method: to build dictionary over your data, and use the dictionary to transform multi-level list of strings into multi-level list of indices. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L182
  4. modify preprocessor's convert_to_dataset method: to transform multi-level list of indices into a DataSet object. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L244
  5. specify which fields you want to use as network inputs in Trainer, Tester, and Predictor. Where self._task appears, where there are modification.
  6. run and debug.

Future Works

  1. optimize Preprocessor: make it a callable object, customized processing function as argument
  2. more unit tests on core/
  3. eliminate self._task ?
  4. merge kezhen's code about build_dict

Any questions are welcome!