Skip to content

Commit

Permalink
Merge pull request #287 from centre-for-humanities-computing/dbc-data…
Browse files Browse the repository at this point in the history
…sets

datasheets for the four datasets from DBC D1G1TAL
  • Loading branch information
KennethEnevoldsen authored Jun 10, 2024
2 parents 4c18d82 + a98c3e4 commit e23aa32
Show file tree
Hide file tree
Showing 4 changed files with 604 additions and 0 deletions.
153 changes: 153 additions & 0 deletions docs/datasheets/dbc-abstracts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
pretty_name: dbc-abstracts
language:
- da
- en
- "no"
- sv
- de
license: other
license_name: agreement (public models, private data)
size_categories:
- 10-100m
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
---
# dbc-abstracts: all abstract from DBC D1G1TAL

*Version*: 1.0.0

*Homepage*: https://github.com/centre-for-humanities-computing/danish-foundation-models

*license*: Not publicly available.

---

dbc-abstracts consists of more than 11.6 million abstracts of books and other materials collected and created by DBC D1G1TAL (former Dansk Bibliotekscenter).

The dataset contains millions of abstracts in Danish lamnugage, which are supplemented by English, Norwegian, Swedish, German, and other language abstracts.

## Datasheet

Following the recommendation and framework of [1], we add the following datasheet.

### Motivation:

**For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?**

The dataset was collected and created by DBC D1G1TAL A/S as one of the backbones for their catalogue of books and other materials.

## Composition

**What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?**

Instances that comprise this dataset represent short abstracts of books or other materials.

**How many instances are there in total (of each type, if appropriate)?**

There are 11,663,988 abstracts in this dataset.

**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?**

The dataset contains all the instances underlying DBC D1G1TAL's database.

**If the dataset is a sample from a larger set, what was the sampling strategy?**

There was no sampling involved.

**Who was involved in the data collection process?**

The data was collected by DBC D1G1TAL.

**Over what timeframe was the data collected?**

The dataset includes abstracts created between 1991 and 2024.

**Were any ethical review processes conducted?**

No ethical review processes were conducted.

## Preprocessing/cleaning/labeling

**Was any preprocessing/Cleaning/Labeling of the data done**

THe only pre-processing applied is a lossless transformation of the JSONL format to the one preferred by the Danish Foundation Models project, including the addition of timestamps.

**Is the software used to preprocess/clean/label the instances available?**

The following Python script was used to convert this and the other dbc datasets:
```python
import datetime
import json
import subprocess
import tqdm

FILE_NAMES = {
"abstracts.jsonl": "abstract",
"faktalink.jsonl": None,
"forfatterweb.jsonl": None,
"reviews.jsonl": "content",
}

for file_name, text in tqdm.tqdm(FILE_NAMES.items()):
length = int(subprocess.check_output(["wc", "-l", file_name]).split()[0])
with open(f"dfm-{file_name}", "wt") as f:
for line in tqdm.tqdm(open(file_name, "rt"), total=length, desc=file_name):
d = json.loads(line)
if text is None:
meta = d["metadata"]["@graph"][0]
lines = [meta["headline"]]
for key, items in d.pop("text").items():
if not key.startswith("empty_"):
lines.append(key)
lines.extend(items)
i, t = meta["mainEntityOfPage"].split("//", maxsplit=1)[1], "\n".join(lines)
else:
i, t = d.pop("id"), d.pop(text)
e = {
"id": i,
"text": t,
"source": "dbc",
"added": datetime.datetime.now().strftime("%Y-%m-%d"),
"created": "1991-04-18, 2024-04-18",
"metadata": d,
}
print(json.dumps(e), file=f)
```

## Uses

**Has the dataset been used for any tasks already?**

Yes, the dataset has been used to pre-train a Danish encoder-decoder model using a T5 architecture.

**Is there a repository that links to any or all papers or systems that use the dataset?**

No, but as of 2024-06-05, no others have used the dataset.

**What (other) tasks could the dataset be used for?**

The dataset contains high-quality texts, many of which are written in Danish. Thus, the dataset could be used for pre-training Danish decocer-only and encoder-only models.

**Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?**

No.

## Distribution

**Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?**

Data will only be available at the entity during the project. Requests regarding access to the dataset should be directed to the data owner DBC D1G1TAL.

### Citation
If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models


### References:

- [1] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III,
and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.

149 changes: 149 additions & 0 deletions docs/datasheets/dbc-faktalink.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
---
pretty_name: dbc-faktalink
language:
- da
license: other
license_name: agreement (public models, private data)
size_categories:
- 0.1-1k
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
---
# dbc-faktalink: all faktalink articles from DBC D1G1TAL

*Version*: 1.0.0

*Homepage*: https://github.com/centre-for-humanities-computing/danish-foundation-models

*license*: Not publicly available.

---

dbc-faktalink consists of more than 5 hundred articles created by DBC D1G1TAL (former Dansk Bibliotekscenter).

All articles are written in Danish lamnugage.

## Datasheet

Following the recommendation and framework of [1], we add the following datasheet.

### Motivation:

**For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?**

The dataset was collected and created by DBC D1G1TAL A/S for their faktalink.dk website.

## Composition

**What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?**

Instances that comprise this dataset represent articles on a variety of topics on aspects relevant to Danish society.

**How many instances are there in total (of each type, if appropriate)?**

There are 523 articles in this dataset.

**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?**

The dataset contains all the instances underlying DBC D1G1TAL's faktalink.dk website.

**If the dataset is a sample from a larger set, what was the sampling strategy?**

There was no sampling involved.

**Who was involved in the data collection process?**

The data was collected by DBC D1G1TAL.

**Over what timeframe was the data collected?**

The dataset includes articles created between 1991 and 2024.

**Were any ethical review processes conducted?**

No ethical review processes were conducted.

## Preprocessing/cleaning/labeling

**Was any preprocessing/Cleaning/Labeling of the data done**

THe only pre-processing applied is a mostly lossless transformation of the JSONL format to the one preferred by the Danish Foundation Models project, including the addition of timestamps.

**Is the software used to preprocess/clean/label the instances available?**

The following Python script was used to convert this and the other dbc datasets:
```python
import datetime
import json
import subprocess
import tqdm

FILE_NAMES = {
"abstracts.jsonl": "abstract",
"faktalink.jsonl": None,
"forfatterweb.jsonl": None,
"reviews.jsonl": "content",
}

for file_name, text in tqdm.tqdm(FILE_NAMES.items()):
length = int(subprocess.check_output(["wc", "-l", file_name]).split()[0])
with open(f"dfm-{file_name}", "wt") as f:
for line in tqdm.tqdm(open(file_name, "rt"), total=length, desc=file_name):
d = json.loads(line)
if text is None:
meta = d["metadata"]["@graph"][0]
lines = [meta["headline"]]
for key, items in d.pop("text").items():
if not key.startswith("empty_"):
lines.append(key)
lines.extend(items)
i, t = meta["mainEntityOfPage"].split("//", maxsplit=1)[1], "\n".join(lines)
else:
i, t = d.pop("id"), d.pop(text)
e = {
"id": i,
"text": t,
"source": "dbc",
"added": datetime.datetime.now().strftime("%Y-%m-%d"),
"created": "1991-04-18, 2024-04-18",
"metadata": d,
}
print(json.dumps(e), file=f)
```

## Uses

**Has the dataset been used for any tasks already?**

Yes, the dataset has been used to pre-train a Danish encoder-decoder model using a T5 architecture.

**Is there a repository that links to any or all papers or systems that use the dataset?**

No, but as of 2024-06-05, no others have used the dataset.

**What (other) tasks could the dataset be used for?**

The dataset contains high-quality texts, all of which are written in Danish. Thus, the dataset could be used for pre-training Danish decocer-only and encoder-only models.

**Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?**

The title, headings, and paragraphs have been concatenated into one text separated by new lines. The semantic information of what constitutes a heading or a paragraph is thus lost. This should be easy enough to recover in most cases but might be hard to do in cases where the heading ends in a punctuation sign.

## Distribution

**Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?**

Data will only be available at the entity during the project. Requests regarding access to the dataset should be directed to the data owner DBC D1G1TAL.

### Citation
If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models


### References:

- [1] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III,
and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.

Loading

0 comments on commit e23aa32

Please sign in to comment.