Skip to content

FastNLP Tutorial

Coet edited this page Sep 17, 2018 · 8 revisions

Data Flow

            loader                    preprocessor         Batch
raw dataset ------> 2-D list of strings ------->  DataSet -------> data_iterator ------> batch_x 
                                                                                         batch_y

step one

data_loader = POSDatasetLoader("./data/pos_tag_data.txt")
train_data = pos_loader.load_lines()
"""
[
    [["This", "is", "fast", "NLP"], ["label_1", "label_3", "label_2", "label_1"]],
    ...
]
"""

step two

p = SeqLabelPreprocess()
data_train, data_dev = p.run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
# type(data_train) == DataSet
# type(data_dev) == DataSet
DataSet 
[
    Instance(Field_1, Field_2, Field_3, ...),
    Instance(Field_1, Field_2, Field_3, ...),
    ...
]

step three & four

data_iterator = Batch(data_train, batch_size=16, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
    x = batch_x["word_seq"]
    y = network(x)
    get_loss(y, batch_y["label_seq"])

User Alpha: call a model

from fastNLP.fastnlp import FastNLP

PATH_TO_CWS_PICKLE_FILES = "/home/zyfeng/fastNLP/reproduction/chinese_word_segment/save/"

nlp = FastNLP(model_dir=PATH_TO_CWS_PICKLE_FILES)

nlp.load("cws_basic_model", config_file="cws.cfg", section_name="POS_test")

text = ["这是最好的基于深度学习的中文分词系统。",
            "大王叫我来巡山。",
            "我党多年来致力于改善人民生活水平。"]
results = nlp.run(text)

# [[('这', 'S'), ('是', 'S'), ('最', 'S'), ('好', 'S'), ('的', 'S'), ('基', 'B'), ('于', 'E'), ('深', 'B'), ('度', 'E'), ('学', 'B'), ('习', 'E'), ('的', 'S'), ('中', 'B'), ('文', 'E'), ('分', 'B'), ('词', 'E'), ('系', 'B'), ('统', 'E'), ('。', 'S')], [('大', 'B'), ('王', 'E'), ('叫', 'S'), ('我', 'S'), ('来', 'S'), ('巡', 'B'), ('山', 'E'), ('。', 'S')], [('我', 'B'), ('党', 'E'), ('多', 'S'), ('年', 'S'), ('来', 'S'), ('致', 'B'), ('力', 'E'), ('于', 'S'), ('改', 'B'), ('善', 'E'), ('人', 'B'), ('民', 'E'), ('生', 'B'), ('活', 'E'), ('水', 'B'), ('平', 'E'), ('。', 'S')]]

User Beta: train a defined model

def train_and_test():
    # Load config section from config file
    trainer_args = ConfigSection()
    model_args = ConfigSection()
    ConfigLoader().load_config("./data/config", {
        "test_seq_label_trainer": trainer_args, "test_seq_label_model": model_args})

    # Load data with data loader
    data_loader = POSDatasetLoader("./data/pos_tag_data.txt")
    train_data = pos_loader.load_lines()

    # Preprocessor: 2-D list of strings ----> DataSet
    preprocess = SeqLabelPreprocess()
    data_train, data_dev = preprocess .run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
    model_args["vocab_size"] = preprocess.vocab_size
    model_args["num_classes"] = preprocess.num_classes

    # Define trainer
    trainer = Trainer(
        epochs=trainer_args["epochs"],
        batch_size=trainer_args["batch_size"],
        validate=trainer_args["validate"],
        use_cuda=trainer_args["use_cuda"],
        pickle_path=pickle_path,
        save_best_dev=trainer_args["save_best_dev"],
        model_name=model_name,
        optimizer=Optimizer("SGD", lr=0.01, momentum=0.9),
    )

    # Define a model
    model = SeqLabeling(model_args)

    # Start training
    trainer.train(model, data_train, data_dev)
    print("Training finished!")

    # Define Saver and save a model
    saver = ModelSaver(os.path.join(pickle_path, model_name))
    saver.save_pytorch(model)
    print("Model saved!")

    del model, trainer, pos_loader

    # Define the same model
    model = SeqLabeling(model_args)

    # Load trained weights into the model
    ModelLoader.load_pytorch(model, os.path.join(pickle_path, model_name))
    print("model loaded!")

    # Load test configuration
    tester_args = ConfigSection()
    ConfigLoader("config.cfg").load_config(config_dir, {"test_seq_label_tester": tester_args})

    # Define a tester
    tester = Tester(save_output=False,
                            save_loss=False,
                            save_best_dev=False,
                            batch_size=4,
                            use_cuda=False,
                            pickle_path=pickle_path,
                            model_name="seq_label_in_test.pkl",
                            print_every_step=1
                            )
    # Start testing
    tester.test(model, data_dev)

    print(tester.show_metrics())

User Gamma: train a new model

Implementation Details

Field, Instance & DataSet

dataset.py defines DataSet, which is a list of Instances. instance.py defines Instance, which is a single example and contains multiple Fields.

field.py defines Field, which is the elementary data type or representation. TextField defines a list of strings. LabelField defines single interger or string. You can add extra fields to support more complex data.

Each field

  • has a field name
  • has a is_target boolean argument to specify whether it is Y or not (X) in training.
  • has a to_tensor method to define how this field data is transformed into tensors

dataset.py defines a function to make DataSet from a list.

def create_dataset_from_lists(str_lists: list, word_vocab: dict, has_target: bool = False, label_vocab: dict = None) --> DataSet: 

Example: https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_tester.py#L15 https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_trainer.py#L14

Batch

batch.py defines Batch, an iterable wrapper of DataSet. Sampling and padding is applied insides. Iteration over a Batch object returns two dict, batch_x and batch_y. The key of the dict is the field name. The value is the corresponding batch tensor.

data_iterator = Batch(data_set, batch_size=8, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
   batch_x["word_seq"]  # torch.LongTensor
   batch_y["label"]  # torch.LongTensor

how to get the pre-paded (origin) length of the sequence

Batch will keep a record of the origin length of a field before padding. It returns the origin lengths with a string key created by appending "_origin_len" before the field name. For example, batch_x["word_seq_origin_len"] # torch.LongTensor.

Why origin length is tensor rather than a list of int ? Because the sequence labeling model's forward() has added an extra arguemnt seq_len to represent the origin lengths (The creation of sequence masks is moved into the model, which needs seq_len.). And tensorboardX requires arguemnts passed to forward() to be nothing but tensor.

One Trainer, Many Tasks

In previous design, different trainers are responsible for different tasks. After introducing Fields & DataSet, different tasks are represented by different DataSet structures, which is the way Fields organize. Therefore, all methods in SeqLabelTrainer and ClassificationTrainer are removed. They are just an empty sub-class to deprecate, and will throw an warning information when used. So are those in Testers and Predictor.

self._task for trainers, testers, and predictors

However, trainers still need task information to know which fields are network inputs among batch_x. This is because

  • we don't know which task is going to do when preprocessing and making DataSet.
  • not all fields in batch_x are needed as network input. Some may be unused, such as seq_len in text classification.
  • in tester, different tasks require different evaluation methods and metrics.
  • in predictor, different tasks require different pre-process and post-process.

Trainer & Tester have a required arguement (raise error if not provided, NO default value) self._task to specify which task is going to perform.

if self._task == "seq_label": 
    y = network(x["word_seq"], x["word_seq_origin_len"]) 
elif self._task == "text_classify": 
    y = network(x["word_seq"]) 

How to add your task ?

  1. design a pytorch model, with forward method.
  2. choose fields or create new fields to describe your data set.
  3. modify preprocessor's build_dict method: to build dictionary over your data, and use the dictionary to transform multi-level list of strings into multi-level list of indices. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L182
  4. modify preprocessor's convert_to_dataset method: to transform multi-level list of indices into a DataSet object. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L244
  5. specify which fields you want to use as network inputs in Trainer, Tester, and Predictor. Where self._task appears, where there are modification.
  6. run and debug.

Future Works

  1. optimize Preprocessor: make it a callable object, customized processing function as argument
  2. more unit tests on core/
  3. eliminate self._task ?
  4. merge kezhen's code about build_dict

Any questions are welcome!

Clone this wiki locally