This repository is the companion for the paper ``Your Next State-of-the-Art Could Come from Another Domain: A Cross-Domain Analysis of Hierarchical Text Classification.''
We provide a collection of seven diverse datasets for hierarchical text classification, spanning legal, scientific, medical, and patent domains. Each dataset comes with gold-standard taxonomies, making them ideal for developing and evaluating hierarchical text classification methods.
Dataset | Domain | Documents | Labels | Hierarchy Depth | Avg Length |
---|---|---|---|---|---|
EurLex-3985 | Legal | 19,306 | 3,985 | 2 | 2,635 |
EurLex-DC-410 | Legal | 19,340 | 410 | 2 | 2,635 |
WOS-141 | Scientific | 46,985 | 141 | 2 | 200 |
SciHTC-83 | Scientific | 186,160 | 83 | 6 | 145 |
SciHTC-800 | Scientific | 186,160 | 800 | 6 | 145 |
MIMIC3-3681 | Medical | 52,712 | 3,681 | 3* | 1,514 |
USPTO2M-632 | Patent | 1,998,408 | 632 | 2* | 117 |
* Expanded hierarchy for certain methods (see paper for details)
Please fill the Consent Form to get access to the datasets.
Please see src/README.md for more details.
If you find this repository useful, please cite our paper:
@misc{li2024stateoftheartcomedomaincrossdomain,
title={Your Next State-of-the-Art Could Come from Another Domain: A Cross-Domain Analysis of Hierarchical Text Classification},
author={Nan Li and Bo Kang and Tijl De Bie},
year={2024},
eprint={2412.12744},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.12744},
}