Skip to content

Commit

Permalink
Merge branch 'main' into convert_domsdatabasen
Browse files Browse the repository at this point in the history
  • Loading branch information
peterbjorgensen committed May 30, 2024
2 parents 620e1e8 + 59943cb commit 08cd24a
Show file tree
Hide file tree
Showing 62 changed files with 4,856 additions and 496 deletions.
8 changes: 0 additions & 8 deletions .vscode/settings.json

This file was deleted.

20 changes: 4 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,19 +43,7 @@ You can contribute both:
- Validation tasks can even be private benchmarks where you only wish to share the performance metrics.
- And probably in many other ways

## Setting up development environment
### Method 1: Dev container
By far the easiest way is to use our included development container. If you're using VSCode:

* Ensure you have either [Orbstack](https://orbstack.dev) or [Docker](https://docker.com) installed
* Press this button: [![Open in Dev Container](https://img.shields.io/static/v1?label=Dev%20Containers&message=Open&color=blue&logo=visualstudiocode)](https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/centre-for-humanities-computing/danish-foundation-models/)
* Select "From Dockerfile"
* Press "OK" on the feature screen

### Method 2: Manual install
Install as you usually would, replicating the commands in the `Dockerfile.dev`.

## Current Contributors and Collaborators
This project has collaborators across industry, national institutions and research centers. This project uses compute resources supplied by [Ucloud](https://docs.cloud.sdu.dk/index.html) through the [DeiC e-infrastructure grant](https://www.deic.dk/en/supercomputing/Apply-for-HPC-resources).


# For Contributors
| | |
| ---------------------------------------------------- | ----------------------------------- |
| 🗣 [**Adding a dataset**](/docs/Adding_a_new_dataset) | A guide on how to add a new dataset |
61 changes: 61 additions & 0 deletions data-processing/configs/2024-v1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# A common crawl dataset + (dagw, lexdk, scandi-reddit) with minimal amount of cleaning

The cleaning is run on ucloud `Terminal Ubuntu Jan2024`.


## Load Dependencies

```bash
module load Python/3.11.5-GCCcore-13.2.0
sudo apt-get update
# rust should be at least 1.72 (1.70 does not work)
sudo apt-get install rustc cargo
export GIT_SSH_COMMAND='ssh -i PATH/TO/PRIVATE/SSH_KEY -o IdentitiesOnly=yes'
```

## Install Data Processing Toolkit
```bash
git clone https://github.com/centre-for-humanities-computing/danish-foundation-models.git
cd danish-foundation-models/data-processing
python -m venv venv
source venv/bin/activate
pip install -e .
```

## Run Taggers
```bash
cd configs/2024-v1
```

Run url blocklist tagger:
```bash
dolma -c dolma_run_url_taggers_mc4da_hplt.yaml tag
```

Run paragraph-level deduplication:

```bash
dolma -c dolma_dedupe_v1.yaml dedupe
```

## Mix Dataset

Since we did not run the URL tagger on the non-common-crawl datasets we hack a workaround and put in an empty placeholder attributes file.
In future datasets this should instead be configured in the mixer by using different streams.
```bash
mkdir /work/dfm-data/pre-training/dagw/v1blockurltaggers/
mkdir /work/dfm-data/pre-training/scandi-reddit/v1blockurltaggers/
mkdir /work/dfm-data/pre-training/lexdk/v1blockurltaggers/
touch /work/dfm-data/pre-training/dagw/v1blockurltaggers/data.jsonl
touch /work/dfm-data/pre-training/scandi-reddit/v1blockurltaggers/scandi-reddit.jsonl
touch /work/dfm-data/pre-training/lexdk/v1blockurltaggers/lexdk_articles.jsonl
gzip /work/dfm-data/pre-training/dagw/v1blockurltaggers/data.jsonl
gzip /work/dfm-data/pre-training/scandi-reddit/v1blockurltaggers/scandi-reddit.jsonl
gzip /work/dfm-data/pre-training/lexdk/v1blockurltaggers/lexdk_articles.jsonl
```

Finally mix the dataset:

```bash
dolma -c mix.yaml mix
```
19 changes: 19 additions & 0 deletions data-processing/configs/2024-v1/dolma_dedupe_v1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
bloom_filter:
desired_false_positive_rate: 1.0e-08
estimated_doc_count: 1_000_000_000
#size_in_bytes: 100_000_000
file: /tmp/deduper_bloom_filter_v1.bin
read_only: false
dedupe:
name: bff_duplicate_paragraph_spans
paragraphs:
attribute_name: bff_duplicate_paragraph_spans
skip_empty: true
documents:
- /work/dfm-data/pre-training/lexdk/documents/*.jsonl.gz
- /work/dfm-data/pre-training/scandi-reddit/documents/*.jsonl.gz
- /work/dfm-data/pre-training/hplt/documents/*.jsonl.gz
- /work/dfm-data/pre-training/dagw/documents/*.jsonl.gz
- /work/dfm-data/pre-training/mC4_da/documents/*.json.gz
- /work/dfm-data/pre-training/ncc/documents/*.jsonl.gz
processes: 16
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
debug: false
destination: null
documents:
- /work/dfm-data/pre-training/mC4_da/documents/*.json.gz
- /work/dfm-data/pre-training/hplt/documents/*.jsonl.gz
- /work/dfm-data/pre-training/ncc/documents/*.jsonl.gz
dryrun: false
experiment: v1blockurltaggers
ignore_existing: false
processes: 4
profile:
enable: false
lines: 100
output: null
sort_key: tottime
steps: null
taggers:
- domain_blocklist_phishing_v1
- domain_blocklist_utp_v1
#- oisd_big_abp_v1
#- oisd_nsfw_abp_v1
#- oisd_big_abp_v1
#- blocklist_firebog_ads_v1
#- blocklist_firebog_crypto_v1
#- blocklist_firebog_malicious_v1
#- blocklist_firebog_nsfw_v1
#- blocklist_firebog_social_v1
#- blocklist_firebog_suspicious_v1
#- blocklist_firebog_trackers_v1
#- blocklist_hosts_adware_malware_v1
#- blocklist_hosts_fakenews_v1
#- blocklist_hosts_gambling_v1
#- blocklist_hosts_porn_v1
#- blocklist_hosts_social_v1
#- blocklist_project_ads_v1
#- blocklist_project_crime_v1
#- blocklist_project_nsfw_v1
#- blocklist_project_social_v1
#- blocklist_project_vice_v1
#- brave_core_abp_v1
#- brave_nsfw_abp_v1
#- allowlist_wikidata_cleaned_v1
#- allowlist_wikidata_v1
25 changes: 25 additions & 0 deletions data-processing/configs/2024-v1/mix.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
streams:
- name: munin_alpha_v0.1_zero_cleaned
documents:
- /work/dfm-data/pre-training/dagw/documents/*.gz
- /work/dfm-data/pre-training/hplt/documents/*.gz
- /work/dfm-data/pre-training/lexdk/documents/*.gz
- /work/dfm-data/pre-training/mC4_da/documents/*.gz
- /work/dfm-data/pre-training/scandi-reddit/documents/*.gz
output:
path: /work/dfm-data/pre-training-clean/2024-v1/documents
max_size_in_bytes: 1_000_000_000
attributes:
- v1blockurltaggers
- bff_duplicate_paragraph_spans
filter:
# Remove documents that are in the utp domain blocklist
exclude:
- "$@.attributes[?(@.v1blockurltaggers__domain_blocklist_utp_v1__url && @.v1blockurltaggers__domain_blocklist_utp_v1__url[0] && @.v1blockurltaggers__domain_blocklist_utp_v1__url[0][2] >=1.0)]"
# Replace duplicate lines with empty string
span_replacement:
- span: "$.attributes.bff_duplicate_paragraph_spans"
min_score: 0.5
replacement: ''

processes: 16
6 changes: 5 additions & 1 deletion data-processing/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,15 @@ dependencies = [
"requests>=2.31.0",
"polars>0.19.1",
# Data cleaning dependencies:
"dolma[pii,code]@git+https://github.com/allenai/dolma.git@5b8109d718f1e69a87094623bca109aee1c33378", # install from git until 1.0.2 is released
"dolma[pii,code]@git+https://github.com/allenai/dolma.git@476629dc4d8d804dd2123509dc48b549e6b49dfb", # Install from git until a 1.0.2 package is released
"kenlm>=0.2.0", # Used for perplexity tagging
"blingfire>=0.1.8", # Used for perplexity tagging
"mosaicml-streaming",
"orjsonl",
"tqdm",
"zstandard",
"nlp_dedup",
"pyyaml",
]

[project.optional-dependencies]
Expand Down
85 changes: 85 additions & 0 deletions data-processing/scripts/convert_ai_aktindsigt_to_jsonlgz.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
"""
Script for downloading the AI-aktindsigt (Sønderborg kommune) dataset and
converting it to jsonl.gz.
The data is in xlsx format
"""
import datetime
import subprocess
from pathlib import Path
from typing import Any, Iterable

import pandas

# Git lfs must be installed
# git clone https://huggingface.co/datasets/AI-aktindsigt/Skrabet_kommunale_hjemmesider


def assume_same(x: Iterable[Any]):
iterator = iter(x)
first = next(iterator)

for item in iterator:
if pandas.isna(first):
assert pandas.isna(item)
else:
assert first == item

return first


def main():
subprocess.run(("git", "clone", "https://huggingface.co/datasets/AI-aktindsigt/Skrabet_kommunale_hjemmesider"))

dfs = []
for path in sorted(Path("Skrabet_kommunale_hjemmesider").glob("*.xlsx")):
print(f"Reading {path}")

dtype = {
"text": str,
"sentence": int,
"kommune": str,
"url": str,
#"klassifikation": int, # This column seems to be unused
"sha512": str,
"ppl_score": float,
}
df = pandas.read_excel(path, header=1, dtype=dtype)
df.dropna(subset=["text"], inplace=True) # Drop empty sentences
# Convert all column names to lowercase (in some sheets URL is in capital letters)
df.columns = map(str.lower, df.columns)

dfs.append(df)

megaframe = pandas.concat(dfs)

print("Grouping by [id, sha512]")
groups = megaframe.groupby(by=["id", "sha512"])

agg_funcs = {
"text": "\n".join, # join the sentences with newlines.
"sentence": lambda x: max(x) + 1,
"kommune": assume_same,
"url": assume_same,
#"sha512": assume_same,
"ppl_score": lambda x: [float(a) for a in x],
}
print("Aggregating frame")
df = groups.agg(agg_funcs)

print("Reshaping into dolma format")
df["id"] = df.apply(lambda row:'%s_%s' % (row.name[0],row.name[1]),axis=1)
df["sha512"] = df.apply(lambda row:'%s' % row.name[1],axis=1)
df["source"] = "ai_aktindsigt"
df["added"] = datetime.datetime.now(datetime.UTC).strftime("%Y-%m-%d")
df["created"] = "1970-01-01, 2024-04-01" # best guess creation time, between 1970 and release time

metadata_keys = ["url", "kommune", "sentence", "ppl_score", "sha512"]
df["metadata"] = df.apply(lambda row: {k: row[k] for k in metadata_keys}, axis=1)
df.drop(columns=metadata_keys, inplace=True)

print("Writing to file")
df.to_json("ai_aktindsigt.jsonl.gz", orient="records", lines=True)
print("Done")

if __name__ == "__main__":
main()
Loading

0 comments on commit 08cd24a

Please sign in to comment.