The source code and seed word sets used for Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation, to appear in NAACL-HLT SRW 2021.
Please see keyword_expansion.ipynb. It takes an initial seed word for each category and an unlabeled corpus to expand seed words for each category. Since there're many experiments, I used yaml config files (under the configs folder) to manage the experiments. You can ignore the other fields and update the following parameters in the config file.
categories = config['categories'] # the list of categories
seed_words = config['seed_words'] # the initial seed words (usually the category name)
output_file = config['kw_file'] # the output keyword file after keyword expansion
corpus_path = config['train_corpus_path'] # the input corpus
In the notebook I added scripts to load each dataset from their original form (individual files under folder/csv). If you experiment on your own dataset, you might modify from an existing example or write your own.
We use the Generalized Expectation (GE) Java implementation provided in the MALLET library. The code is relatively straight-forward to set up and run. We do not provide the Java code to train interim classifiers. You can either follow the instructions in the MALLET library to use GE or replace it with another weakly-supervised text classification model.
Note: I implemented the Bayesian error estimation (BEE) algorithm in Numpy following the paper Estimating Accuracy from Unlabeled Data: A Bayesian Approach. I tried to follow the paper as closely as possible. However, due to my limited knowledge in Bayesian statistics, I cannot guarantee that the code is error-free. Later I realized that the original author of the BEE paper published his source code here. You can consider using his repo instead if you're familiar with Java.
The main class for BEE is bee.py.
def __init__(self, labeling_matrix, num_iters=50, init_method='sampling', filter_estimators=False, filter_by_std=False):
""" Initialize the BEE model
:param labeling_matrix: the prediction matrix [num_samples, num_estimators]. Each entry is either 0 or 1
:param num_iters: the number of Gibbs sampling iterations
:param init_method: [maj, sampling]
"""
The only parameter you need to pass to it is the labeling matrix, which contains the binary predictions from interim classifiers. All other parameters you can leave as default. The model performs inference upon initialization and you can access the following attributes after it's done.
bee.true_labels: the inferred latent label for each example.
bee.error_rates: the estimated error rate for each interim classifier (the inverse of the accuracy).
We provide the full list of seed words for all datasets under the seedwords folder. The seed word files are named according to the following convention:
[dataset name]-[category A]-[category B]-[seed word set]
An example AGNews-business-sports-all
means it's from AGNews dataset the Business-Sports classification task and using all keywords after keyword expansion. Note that the naming of the seed word sets are slightly different. Please refer to the mapping below:
Seed word set in this repo | Seed word set in the paper (Table 3 & 4) |
---|---|
seed | cate |
all | - |
ours | ours |
manual | seed |
All datasets used in this work are public datasets published in previous work.
Dataset | Link |
---|---|
AGNews | Link |
NYT | Link |
Yelp | Link |
IMDB | Link |
Please cite the following paper if you find the code helpful for your research.
@inproceedings{jin2021seed,
title={Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation},
author={Jin, Yiping and Bhatia, Akshay and Wanvarie, Dittaya},
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop},
year={2021}
}