🪐 spaCy Project: Categorization of emotions in Reddit posts (Text Classification)

This project uses spaCy to train a text classifier on the GoEmotions dataset with options for a pipeline with and without transformer weights. To use the BERT-based config, change the config variable in the project.yml.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command	Description
`init-vectors`	Download vectors and convert to model
`preprocess`	Convert the corpus to spaCy's format
`train`	Train a spaCy pipeline using the specified corpus and config
`evaluate`	Evaluate on the test data and save the metrics
`package`	Package the trained model so it can be installed
`visualize`	Visualize the model's output interactively using Streamlit

⏭ Workflows

The following workflows are defined by the project. They can be executed using spacy project run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow	Steps
`all`	`preprocess` → `train` → `evaluate` → `package`

🗂 Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File	Source	Description
`assets/categories.txt`	URL	The categories to train
`assets/train.tsv`	URL	The training data
`assets/dev.tsv`	URL	The development data
`assets/test.tsv`	URL	The test data

Usage

If you want to use the BERT-based config (bert.cfg), make sure you have spacy-transformers installed:

pip install spacy-transformers

You can choose your GPU by setting the gpu_id variable in the project.yml.

Tuning a hyper-parameter in the config

To change hyperparameters, you can edit the config (or create a new custom config). For instance, you could edit the components.textcat.model.tok2vec.encode.width value, changing it to 32:

[components.textcat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 32
depth = 4
window_size = 1
maxout_pieces = 3

Now you can retrain and reevaluate, and commit the updated config and metrics:

spacy project run train
spacy project run evaluate
git commit configs/my_new_config.cfg metrics/my_new_config.cfg -m "Scores TODO%"

You can also run experiments in a more lightweight way by running spacy train directly and overwriting hyperparameters on the command line:

spacy train \
    configs/my_new_config.cfg \
    --components.textcat.model.tok2vec.encode.width 32

Adding components from another model

Let's say you want to take tagger and NER components from the en_core_web_sm model, and add a new textcat model that you'll train, while keeping the existing models from the tagger and NER. This requires three changes to the config.

Add the components to the nlp.pipeline.

[nlp]
pipeline = ["tagger", "ner", "textcat"]

Add the "sourced" components in the [components] block. This tells the config to build the NER and tagger components from the en_core_web_sm config and to load their models from disk.
```
[components]
tagger = {"source": "en_core_web_sm"}
ner = {"source": "en_core_web_sm"}
```
Specify that the tagger and NER are "frozen". This stops the weights of these models from being reset, and stops the components from being updated.
```
[training]
frozen_components = ["tagger", "ner"]
```

Using embeddings from a spaCy package

spacy train \
    configs/cnn.cfg \
    --training.vectors "en_vectors_web_lg" \
    --components.textcat.model.tok2vec.embed.also_use_static_vectors true

Making and using new embeddings

Uncomment the asset in your project.yml:

assets:
  - dest: 'assets/vectors.zip'
    url: 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip'

Then download the asset and run the init-vectors command:

spacy project assets
spacy project run init-vectors

Use the vectors:

spacy train \
    configs/cnn.cfg \
    --training.vectors "assets/en_fasttext_vectors" \
    --components.textcat.model.tok2vec.embed.also_use_static_vectors true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

🪐 spaCy Project: Categorization of emotions in Reddit posts (Text Classification)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

Usage

Tuning a hyper-parameter in the config

Adding components from another model

Using embeddings from a spaCy package

Making and using new embeddings

Files

README.md

Latest commit

History

README.md

File metadata and controls

🪐 spaCy Project: Categorization of emotions in Reddit posts (Text Classification)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

Usage

Tuning a hyper-parameter in the config

Adding components from another model

Using embeddings from a spaCy package

Making and using new embeddings