Skip to content

sciknoworg/subject-indexing-dataset

Repository files navigation

📚 Welcome to the Subject Indexing Dataset Repository!

💡 About

This dataset empowers the research community 🤝 to build advanced LLM-based semantic solutions for automated domain classification and subject indexing 📑 of library records from a national library in Germany. The records are mainly in German or English but not limited to these natural languages. For the subject taxonomy, we rely on the GND (Gemeinsame Normdatei / Integrated Authority File), an international authority file widely used the in German-speaking library systems to catalog and link information on people, organizations, topics, and works.

📂 Repositories Included

To support system development, we release four key components:

  • 28_domains_list.csv: This file lists 28 domains representing the coarse-grained classification scheme applied to the library records. A record can be assigned more than one domain.

  • GND: Resources related to the GND, including a human-readable version of the GND subject taxonomy. The taxonomy comprises over 200,000 subject headings and serves as the controlled vocabulary for fine-grained subject indexing of the library’s bibliographic records.

  • library-records-dataset: Open-access annotated library records with pre-defined train/dev/test splits. The dataset includes records in German and English, annotated with domain labels and GND subjects. It covers five representative record types: article, book, conference, report, and thesis.

    Both the GND taxonomy and the open-access records have been reorganized and reformatted with human-readable tags for seamless machine learning use. Since standardized library taxonomies often rely on complex legacy codes ⏳, we consulted subject specialists to preprocess and simplify the data. This allows researchers to focus on developing ML models rather than decoding intricate data formats.

  • evaluation: Evaluation scripts providing quantitative metrics—precision@k, recall@k, f1@k, recall_precision@k, and ndcg@k—for assessing system predictions against the released gold-standard annotations.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages