Evaluating Embedding Models on Danish Historical Newspapers

This repository contains code for embeddings, plots and results of our paper:

Lassche, Alie, Pascale Feldkamp, Yuri Bizzoni, Katrine Baunvig, Kristoffer Nielbo, and Johan Heinsen. ‘Evaluating Embedding Models on Danish Historical Newspapers: A Corpus and Benchmark Resource’. Proceedings of the International Conference on Language Resources and Evaluation (LREC), May 2026.

Useful directions 📌

Some useful directions:

info.md contains a description for the source code
/src/benchmark contains scripts for creating embeddings for the benchmark tasks
/src/full_corpus contains scripts for creating embeddings and predicting categories for the full corpus
/data/test_task contains the gold sample used in benchmark task I
/notebooks/ contains the notebooks used for the analysis
/results/contains the results of benchmark task I
/figs/ contains the figures generated by the notebooks

Data & paper 📝

The dataset and embeddings generated in this paper are available at HuggingFace, which is an enriched version of this dataset.

Please cite our forthcoming paper if you use the code, dataset or embeddings.

Project Organization 🏗️

├── LICENSE                            <- Open-source license if one is chosen.
│
├── README.md                          <- The top-level README for developers using this project.
│
├── info.md                            <- Contains a description for the source code.
│
├── src/benchmark                                          
│       │
│       ├── process_articles.py        <- Code to get embeddings from newspaper article chunks.
│       ├── mean_pooling.py            <- Code to get average embeddings from newspaper articles.
│       ├── merge_text_embs.py         <- Merge texts and embeddings.
│       ├── classify.py                <- Code for benchmark task II.
│       └── clustering_task.py         <- Code for benchmark task III.       
│   
├── src/full_corpus                                          
│       │
│       ├── process_articles_all.py    <- Code to prepare articles for create_embs_all.py
│       ├── create_embs_all.py         <- Code to get embeddings from newspaper articles.
│       ├── mean_pooling_all.py        <- Code to get average embeddings from newspaper articles.
│       └── predict_cats_all.py        <- Predict categories of all newspaper articles.
│
├── data/                              <- Data used for the analysis in notebooks.
│
│
├── notebooks/                         <- Jupyter notebooks.
│   │
│   ├── create_gold_cats.ipynb         <- Notebook to create gold sample.
│   ├── newsp_stats.ipynb              <- Notebook to get descriptive statistics.
│   ├── train_classifier.ipynb         <- Notebook for benchmark task I. 
│   ├── clustering_article_cats.ipynb  <- Notebook for additional clustering task article categories (not included in paper).
│   ├── visualization_categories.ipynb <- Notebook to create images in paper.
│   └── visz.ipynb                     <- Notebook to create images in paper.
│
└── figs/                              <- Generated graphics and figures used in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
figs		figs
logs		logs
models		models
notebooks		notebooks
results/test_task		results/test_task
src		src
.gitignore		.gitignore
README.md		README.md
info.md		info.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Embedding Models on Danish Historical Newspapers

Useful directions 📌

Data & paper 📝

Project Organization 🏗️

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating Embedding Models on Danish Historical Newspapers

Useful directions 📌

Data & paper 📝

Project Organization 🏗️

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages