Cross-Domain Resources for Text Classification with Hierarchical Labels

This repository is the companion for the paper ``Your Next State-of-the-Art Could Come from Another Domain: A Cross-Domain Analysis of Hierarchical Text Classification.''

📊 Datasets

We provide a collection of seven diverse datasets for hierarchical text classification, spanning legal, scientific, medical, and patent domains. Each dataset comes with gold-standard taxonomies, making them ideal for developing and evaluating hierarchical text classification methods.

Dataset	Domain	Documents	Labels	Hierarchy Depth	Avg Length
EurLex-3985	Legal	19,306	3,985	2	2,635
EurLex-DC-410	Legal	19,340	410	2	2,635
WOS-141	Scientific	46,985	141	2	200
SciHTC-83	Scientific	186,160	83	6	145
SciHTC-800	Scientific	186,160	800	6	145
MIMIC3-3681	Medical	52,712	3,681	3*	1,514
USPTO2M-632	Patent	1,998,408	632	2*	117

* Expanded hierarchy for certain methods (see paper for details)

🚀 Getting Started

Please fill the Consent Form to get access to the datasets.

💻 Code

Please see src/README.md for more details.

📚 Citation

If you find this repository useful, please cite our paper:

@misc{li2024stateoftheartcomedomaincrossdomain,
      title={Your Next State-of-the-Art Could Come from Another Domain: A Cross-Domain Analysis of Hierarchical Text Classification}, 
      author={Nan Li and Bo Kang and Tijl De Bie},
      year={2024},
      eprint={2412.12744},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.12744}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Domain Resources for Text Classification with Hierarchical Labels

📊 Datasets

🚀 Getting Started

💻 Code

📚 Citation

About

Releases

Packages

Languages

License

aida-ugent/cross-domain-HTC

Folders and files

Latest commit

History

Repository files navigation

Cross-Domain Resources for Text Classification with Hierarchical Labels

📊 Datasets

🚀 Getting Started

💻 Code

📚 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages