Skip to content

Latest commit

 

History

History
164 lines (122 loc) · 5.04 KB

File metadata and controls

164 lines (122 loc) · 5.04 KB

🪐 spaCy Project: Categorization of emotions in Reddit posts (Text Classification)

This project uses spaCy to train a text classifier on the GoEmotions dataset with options for a pipeline with and without transformer weights. To use the BERT-based config, change the config variable in the project.yml.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command Description
init-vectors Download vectors and convert to model
preprocess Convert the corpus to spaCy's format
train Train a spaCy pipeline using the specified corpus and config
evaluate Evaluate on the test data and save the metrics
package Package the trained model so it can be installed
visualize Visualize the model's output interactively using Streamlit

⏭ Workflows

The following workflows are defined by the project. They can be executed using spacy project run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
all preprocesstrainevaluatepackage

🗂 Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File Source Description
assets/categories.txt URL The categories to train
assets/train.tsv URL The training data
assets/dev.tsv URL The development data
assets/test.tsv URL The test data

Usage

If you want to use the BERT-based config (bert.cfg), make sure you have spacy-transformers installed:

pip install spacy-transformers

You can choose your GPU by setting the gpu_id variable in the project.yml.

Tuning a hyper-parameter in the config

To change hyperparameters, you can edit the config (or create a new custom config). For instance, you could edit the components.textcat.model.tok2vec.encode.width value, changing it to 32:

[components.textcat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 32
depth = 4
window_size = 1
maxout_pieces = 3

Now you can retrain and reevaluate, and commit the updated config and metrics:

spacy project run train
spacy project run evaluate
git commit configs/my_new_config.cfg metrics/my_new_config.cfg -m "Scores TODO%"

You can also run experiments in a more lightweight way by running spacy train directly and overwriting hyperparameters on the command line:

spacy train \
    configs/my_new_config.cfg \
    --components.textcat.model.tok2vec.encode.width 32

Adding components from another model

Let's say you want to take tagger and NER components from the en_core_web_sm model, and add a new textcat model that you'll train, while keeping the existing models from the tagger and NER. This requires three changes to the config.

  1. Add the components to the nlp.pipeline.

    [nlp]
    pipeline = ["tagger", "ner", "textcat"]
  2. Add the "sourced" components in the [components] block. This tells the config to build the NER and tagger components from the en_core_web_sm config and to load their models from disk.

    [components]
    tagger = {"source": "en_core_web_sm"}
    ner = {"source": "en_core_web_sm"}
  3. Specify that the tagger and NER are "frozen". This stops the weights of these models from being reset, and stops the components from being updated.

    [training]
    frozen_components = ["tagger", "ner"]

Using embeddings from a spaCy package

spacy train \
    configs/cnn.cfg \
    --training.vectors "en_vectors_web_lg" \
    --components.textcat.model.tok2vec.embed.also_use_static_vectors true

Making and using new embeddings

Uncomment the asset in your project.yml:

assets:
  - dest: 'assets/vectors.zip'
    url: 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip'

Then download the asset and run the init-vectors command:

spacy project assets
spacy project run init-vectors

Use the vectors:

spacy train \
    configs/cnn.cfg \
    --training.vectors "assets/en_fasttext_vectors" \
    --components.textcat.model.tok2vec.embed.also_use_static_vectors true