Skip to content

aida-ugent/cross-domain-HTC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Cross-Domain Resources for Text Classification with Hierarchical Labels

This repository is the companion for the paper ``Your Next State-of-the-Art Could Come from Another Domain: A Cross-Domain Analysis of Hierarchical Text Classification.''

arXiv

📊 Datasets

We provide a collection of seven diverse datasets for hierarchical text classification, spanning legal, scientific, medical, and patent domains. Each dataset comes with gold-standard taxonomies, making them ideal for developing and evaluating hierarchical text classification methods.

Dataset Domain Documents Labels Hierarchy Depth Avg Length
EurLex-3985 Legal 19,306 3,985 2 2,635
EurLex-DC-410 Legal 19,340 410 2 2,635
WOS-141 Scientific 46,985 141 2 200
SciHTC-83 Scientific 186,160 83 6 145
SciHTC-800 Scientific 186,160 800 6 145
MIMIC3-3681 Medical 52,712 3,681 3* 1,514
USPTO2M-632 Patent 1,998,408 632 2* 117

* Expanded hierarchy for certain methods (see paper for details)

🚀 Getting Started

Please fill the Consent Form to get access to the datasets.

💻 Code

Please see src/README.md for more details.

📚 Citation

If you find this repository useful, please cite our paper:

@misc{li2024stateoftheartcomedomaincrossdomain,
      title={Your Next State-of-the-Art Could Come from Another Domain: A Cross-Domain Analysis of Hierarchical Text Classification}, 
      author={Nan Li and Bo Kang and Tijl De Bie},
      year={2024},
      eprint={2412.12744},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.12744}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published