-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #287 from centre-for-humanities-computing/dbc-data…
…sets datasheets for the four datasets from DBC D1G1TAL
- Loading branch information
Showing
4 changed files
with
604 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
--- | ||
pretty_name: dbc-abstracts | ||
language: | ||
- da | ||
- en | ||
- "no" | ||
- sv | ||
- de | ||
license: other | ||
license_name: agreement (public models, private data) | ||
size_categories: | ||
- 10-100m | ||
task_categories: | ||
- text-generation | ||
- fill-mask | ||
task_ids: | ||
- language-modeling | ||
--- | ||
# dbc-abstracts: all abstract from DBC D1G1TAL | ||
|
||
*Version*: 1.0.0 | ||
|
||
*Homepage*: https://github.com/centre-for-humanities-computing/danish-foundation-models | ||
|
||
*license*: Not publicly available. | ||
|
||
--- | ||
|
||
dbc-abstracts consists of more than 11.6 million abstracts of books and other materials collected and created by DBC D1G1TAL (former Dansk Bibliotekscenter). | ||
|
||
The dataset contains millions of abstracts in Danish lamnugage, which are supplemented by English, Norwegian, Swedish, German, and other language abstracts. | ||
|
||
## Datasheet | ||
|
||
Following the recommendation and framework of [1], we add the following datasheet. | ||
|
||
### Motivation: | ||
|
||
**For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?** | ||
|
||
The dataset was collected and created by DBC D1G1TAL A/S as one of the backbones for their catalogue of books and other materials. | ||
|
||
## Composition | ||
|
||
**What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?** | ||
|
||
Instances that comprise this dataset represent short abstracts of books or other materials. | ||
|
||
**How many instances are there in total (of each type, if appropriate)?** | ||
|
||
There are 11,663,988 abstracts in this dataset. | ||
|
||
**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?** | ||
|
||
The dataset contains all the instances underlying DBC D1G1TAL's database. | ||
|
||
**If the dataset is a sample from a larger set, what was the sampling strategy?** | ||
|
||
There was no sampling involved. | ||
|
||
**Who was involved in the data collection process?** | ||
|
||
The data was collected by DBC D1G1TAL. | ||
|
||
**Over what timeframe was the data collected?** | ||
|
||
The dataset includes abstracts created between 1991 and 2024. | ||
|
||
**Were any ethical review processes conducted?** | ||
|
||
No ethical review processes were conducted. | ||
|
||
## Preprocessing/cleaning/labeling | ||
|
||
**Was any preprocessing/Cleaning/Labeling of the data done** | ||
|
||
THe only pre-processing applied is a lossless transformation of the JSONL format to the one preferred by the Danish Foundation Models project, including the addition of timestamps. | ||
|
||
**Is the software used to preprocess/clean/label the instances available?** | ||
|
||
The following Python script was used to convert this and the other dbc datasets: | ||
```python | ||
import datetime | ||
import json | ||
import subprocess | ||
import tqdm | ||
|
||
FILE_NAMES = { | ||
"abstracts.jsonl": "abstract", | ||
"faktalink.jsonl": None, | ||
"forfatterweb.jsonl": None, | ||
"reviews.jsonl": "content", | ||
} | ||
|
||
for file_name, text in tqdm.tqdm(FILE_NAMES.items()): | ||
length = int(subprocess.check_output(["wc", "-l", file_name]).split()[0]) | ||
with open(f"dfm-{file_name}", "wt") as f: | ||
for line in tqdm.tqdm(open(file_name, "rt"), total=length, desc=file_name): | ||
d = json.loads(line) | ||
if text is None: | ||
meta = d["metadata"]["@graph"][0] | ||
lines = [meta["headline"]] | ||
for key, items in d.pop("text").items(): | ||
if not key.startswith("empty_"): | ||
lines.append(key) | ||
lines.extend(items) | ||
i, t = meta["mainEntityOfPage"].split("//", maxsplit=1)[1], "\n".join(lines) | ||
else: | ||
i, t = d.pop("id"), d.pop(text) | ||
e = { | ||
"id": i, | ||
"text": t, | ||
"source": "dbc", | ||
"added": datetime.datetime.now().strftime("%Y-%m-%d"), | ||
"created": "1991-04-18, 2024-04-18", | ||
"metadata": d, | ||
} | ||
print(json.dumps(e), file=f) | ||
``` | ||
|
||
## Uses | ||
|
||
**Has the dataset been used for any tasks already?** | ||
|
||
Yes, the dataset has been used to pre-train a Danish encoder-decoder model using a T5 architecture. | ||
|
||
**Is there a repository that links to any or all papers or systems that use the dataset?** | ||
|
||
No, but as of 2024-06-05, no others have used the dataset. | ||
|
||
**What (other) tasks could the dataset be used for?** | ||
|
||
The dataset contains high-quality texts, many of which are written in Danish. Thus, the dataset could be used for pre-training Danish decocer-only and encoder-only models. | ||
|
||
**Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?** | ||
|
||
No. | ||
|
||
## Distribution | ||
|
||
**Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?** | ||
|
||
Data will only be available at the entity during the project. Requests regarding access to the dataset should be directed to the data owner DBC D1G1TAL. | ||
|
||
### Citation | ||
If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models | ||
|
||
|
||
### References: | ||
|
||
- [1] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, | ||
and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
--- | ||
pretty_name: dbc-faktalink | ||
language: | ||
- da | ||
license: other | ||
license_name: agreement (public models, private data) | ||
size_categories: | ||
- 0.1-1k | ||
task_categories: | ||
- text-generation | ||
- fill-mask | ||
task_ids: | ||
- language-modeling | ||
--- | ||
# dbc-faktalink: all faktalink articles from DBC D1G1TAL | ||
|
||
*Version*: 1.0.0 | ||
|
||
*Homepage*: https://github.com/centre-for-humanities-computing/danish-foundation-models | ||
|
||
*license*: Not publicly available. | ||
|
||
--- | ||
|
||
dbc-faktalink consists of more than 5 hundred articles created by DBC D1G1TAL (former Dansk Bibliotekscenter). | ||
|
||
All articles are written in Danish lamnugage. | ||
|
||
## Datasheet | ||
|
||
Following the recommendation and framework of [1], we add the following datasheet. | ||
|
||
### Motivation: | ||
|
||
**For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?** | ||
|
||
The dataset was collected and created by DBC D1G1TAL A/S for their faktalink.dk website. | ||
|
||
## Composition | ||
|
||
**What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?** | ||
|
||
Instances that comprise this dataset represent articles on a variety of topics on aspects relevant to Danish society. | ||
|
||
**How many instances are there in total (of each type, if appropriate)?** | ||
|
||
There are 523 articles in this dataset. | ||
|
||
**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?** | ||
|
||
The dataset contains all the instances underlying DBC D1G1TAL's faktalink.dk website. | ||
|
||
**If the dataset is a sample from a larger set, what was the sampling strategy?** | ||
|
||
There was no sampling involved. | ||
|
||
**Who was involved in the data collection process?** | ||
|
||
The data was collected by DBC D1G1TAL. | ||
|
||
**Over what timeframe was the data collected?** | ||
|
||
The dataset includes articles created between 1991 and 2024. | ||
|
||
**Were any ethical review processes conducted?** | ||
|
||
No ethical review processes were conducted. | ||
|
||
## Preprocessing/cleaning/labeling | ||
|
||
**Was any preprocessing/Cleaning/Labeling of the data done** | ||
|
||
THe only pre-processing applied is a mostly lossless transformation of the JSONL format to the one preferred by the Danish Foundation Models project, including the addition of timestamps. | ||
|
||
**Is the software used to preprocess/clean/label the instances available?** | ||
|
||
The following Python script was used to convert this and the other dbc datasets: | ||
```python | ||
import datetime | ||
import json | ||
import subprocess | ||
import tqdm | ||
|
||
FILE_NAMES = { | ||
"abstracts.jsonl": "abstract", | ||
"faktalink.jsonl": None, | ||
"forfatterweb.jsonl": None, | ||
"reviews.jsonl": "content", | ||
} | ||
|
||
for file_name, text in tqdm.tqdm(FILE_NAMES.items()): | ||
length = int(subprocess.check_output(["wc", "-l", file_name]).split()[0]) | ||
with open(f"dfm-{file_name}", "wt") as f: | ||
for line in tqdm.tqdm(open(file_name, "rt"), total=length, desc=file_name): | ||
d = json.loads(line) | ||
if text is None: | ||
meta = d["metadata"]["@graph"][0] | ||
lines = [meta["headline"]] | ||
for key, items in d.pop("text").items(): | ||
if not key.startswith("empty_"): | ||
lines.append(key) | ||
lines.extend(items) | ||
i, t = meta["mainEntityOfPage"].split("//", maxsplit=1)[1], "\n".join(lines) | ||
else: | ||
i, t = d.pop("id"), d.pop(text) | ||
e = { | ||
"id": i, | ||
"text": t, | ||
"source": "dbc", | ||
"added": datetime.datetime.now().strftime("%Y-%m-%d"), | ||
"created": "1991-04-18, 2024-04-18", | ||
"metadata": d, | ||
} | ||
print(json.dumps(e), file=f) | ||
``` | ||
|
||
## Uses | ||
|
||
**Has the dataset been used for any tasks already?** | ||
|
||
Yes, the dataset has been used to pre-train a Danish encoder-decoder model using a T5 architecture. | ||
|
||
**Is there a repository that links to any or all papers or systems that use the dataset?** | ||
|
||
No, but as of 2024-06-05, no others have used the dataset. | ||
|
||
**What (other) tasks could the dataset be used for?** | ||
|
||
The dataset contains high-quality texts, all of which are written in Danish. Thus, the dataset could be used for pre-training Danish decocer-only and encoder-only models. | ||
|
||
**Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?** | ||
|
||
The title, headings, and paragraphs have been concatenated into one text separated by new lines. The semantic information of what constitutes a heading or a paragraph is thus lost. This should be easy enough to recover in most cases but might be hard to do in cases where the heading ends in a punctuation sign. | ||
|
||
## Distribution | ||
|
||
**Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?** | ||
|
||
Data will only be available at the entity during the project. Requests regarding access to the dataset should be directed to the data owner DBC D1G1TAL. | ||
|
||
### Citation | ||
If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models | ||
|
||
|
||
### References: | ||
|
||
- [1] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, | ||
and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018. | ||
|
Oops, something went wrong.