Skip to content

Latest commit

 

History

History
137 lines (89 loc) · 4.42 KB

label_studio_text_en.md

File metadata and controls

137 lines (89 loc) · 4.42 KB

简体中文 | English

Label Studio User Guide - Text Classification

Table of contents

1. Installation

** Dependencies used in the following annotation examples:**

  • Python 3.8+
  • label-studio == 1.6.0

Use pip to install label-studio in the terminal:

pip install label-studio==1.6.0

Once the installation is complete, run the following command line:

label-studio start

Open http://localhost:8080/ in the browser, enter the user name and password to log in, and start using label-studio for labeling.

2. Text Classification Task Annotation

2.1 Project Creation

Click Create to start creating a new project, fill in the project name, description, and select Text Classification in Labeling Setup.

  • Fill in the project name, description
  • Upload the txt format file locally, select List of tasks, and then choose to import this project.

  • Define labels

2.2 Data Upload

You can continue to import local txt format data after project creation. See more details in Project Creation.

2.3 Label Construction

After project creation, you can add/delete labels in Setting/Labeling Interface just as in Project Creation

2.4 Task annotation

2.5 Data Export

Check the marked text ID, select the exported file type as JSON, and export the data:

2.6 Data conversion

First, create a label file in the ./data directory, with one label candidate per line. You can also directly set label condidates list by options. Rename the exported file to label_studio.json and put it in the ./data directory. Through the label_studio.py script, it can be converted to the data format of UTC.

python label_studio.py \
    --label_studio_file ./data/label_studio.json \
    --save_dir ./data \
    --splits 0.8 0.1 0.1 \
    --options ./data/label.txt

2.7 More Configuration

  • label_studio_file: Data labeling file exported from label studio.
  • save_dir: The storage directory of the training data, which is stored in the data directory by default.
  • splits: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of 8:1:1.
  • options: Specify the label candidates set. For filename, there should be one label per line in the file. For list, the length should be longer than 1.
  • is_shuffle: Whether to randomly shuffle the data set, the default is True.
  • seed: random seed, default is 1000.

Note:

  • By default the label_studio.py script will divide the data proportionally into train/dev/test datasets
  • Each time the label_studio.py script is executed, the existing data file with the same name will be overwritten
  • For files exported from label_studio, each piece of data in the default file is correctly labeled manually.

References