📚 Welcome to the Subject Indexing Dataset Repository!

💡 About

This dataset empowers the research community 🤝 to build advanced LLM-based semantic solutions for automated domain classification and subject indexing 📑 of library records from a national library in Germany. The records are mainly in German or English but not limited to these natural languages. For the subject taxonomy, we rely on the GND (Gemeinsame Normdatei / Integrated Authority File), an international authority file widely used the in German-speaking library systems to catalog and link information on people, organizations, topics, and works.

📂 Repositories Included

To support system development, we release four key components:

28_domains_list.csv: This file lists 28 domains representing the coarse-grained classification scheme applied to the library records. A record can be assigned more than one domain.
GND: Resources related to the GND, including a human-readable version of the GND subject taxonomy. The taxonomy comprises over 200,000 subject headings and serves as the controlled vocabulary for fine-grained subject indexing of the library’s bibliographic records.
library-records-dataset: Open-access annotated library records with pre-defined train/dev/test splits. The dataset includes records in German and English, annotated with domain labels and GND subjects. It covers five representative record types: article, book, conference, report, and thesis.

Both the GND taxonomy and the open-access records have been reorganized and reformatted with human-readable tags for seamless machine learning use. Since standardized library taxonomies often rely on complex legacy codes ⏳, we consulted subject specialists to preprocess and simplify the data. This allows researchers to focus on developing ML models rather than decoding intricate data formats.
evaluation: Evaluation scripts providing quantitative metrics—precision@k, recall@k, f1@k, recall_precision@k, and ndcg@k—for assessing system predictions against the released gold-standard annotations.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
GND		GND
evaluation		evaluation
library-records-dataset		library-records-dataset
.gitattributes		.gitattributes
.gitignore		.gitignore
28_domains_list.csv		28_domains_list.csv
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📚 Welcome to the Subject Indexing Dataset Repository!

💡 About

📂 Repositories Included

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

sciknoworg/subject-indexing-dataset

Folders and files

Latest commit

History

Repository files navigation

📚 Welcome to the Subject Indexing Dataset Repository!

💡 About

📂 Repositories Included

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages