-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into convert_domsdatabasen
- Loading branch information
Showing
62 changed files
with
4,856 additions
and
496 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# A common crawl dataset + (dagw, lexdk, scandi-reddit) with minimal amount of cleaning | ||
|
||
The cleaning is run on ucloud `Terminal Ubuntu Jan2024`. | ||
|
||
|
||
## Load Dependencies | ||
|
||
```bash | ||
module load Python/3.11.5-GCCcore-13.2.0 | ||
sudo apt-get update | ||
# rust should be at least 1.72 (1.70 does not work) | ||
sudo apt-get install rustc cargo | ||
export GIT_SSH_COMMAND='ssh -i PATH/TO/PRIVATE/SSH_KEY -o IdentitiesOnly=yes' | ||
``` | ||
|
||
## Install Data Processing Toolkit | ||
```bash | ||
git clone https://github.com/centre-for-humanities-computing/danish-foundation-models.git | ||
cd danish-foundation-models/data-processing | ||
python -m venv venv | ||
source venv/bin/activate | ||
pip install -e . | ||
``` | ||
|
||
## Run Taggers | ||
```bash | ||
cd configs/2024-v1 | ||
``` | ||
|
||
Run url blocklist tagger: | ||
```bash | ||
dolma -c dolma_run_url_taggers_mc4da_hplt.yaml tag | ||
``` | ||
|
||
Run paragraph-level deduplication: | ||
|
||
```bash | ||
dolma -c dolma_dedupe_v1.yaml dedupe | ||
``` | ||
|
||
## Mix Dataset | ||
|
||
Since we did not run the URL tagger on the non-common-crawl datasets we hack a workaround and put in an empty placeholder attributes file. | ||
In future datasets this should instead be configured in the mixer by using different streams. | ||
```bash | ||
mkdir /work/dfm-data/pre-training/dagw/v1blockurltaggers/ | ||
mkdir /work/dfm-data/pre-training/scandi-reddit/v1blockurltaggers/ | ||
mkdir /work/dfm-data/pre-training/lexdk/v1blockurltaggers/ | ||
touch /work/dfm-data/pre-training/dagw/v1blockurltaggers/data.jsonl | ||
touch /work/dfm-data/pre-training/scandi-reddit/v1blockurltaggers/scandi-reddit.jsonl | ||
touch /work/dfm-data/pre-training/lexdk/v1blockurltaggers/lexdk_articles.jsonl | ||
gzip /work/dfm-data/pre-training/dagw/v1blockurltaggers/data.jsonl | ||
gzip /work/dfm-data/pre-training/scandi-reddit/v1blockurltaggers/scandi-reddit.jsonl | ||
gzip /work/dfm-data/pre-training/lexdk/v1blockurltaggers/lexdk_articles.jsonl | ||
``` | ||
|
||
Finally mix the dataset: | ||
|
||
```bash | ||
dolma -c mix.yaml mix | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
bloom_filter: | ||
desired_false_positive_rate: 1.0e-08 | ||
estimated_doc_count: 1_000_000_000 | ||
#size_in_bytes: 100_000_000 | ||
file: /tmp/deduper_bloom_filter_v1.bin | ||
read_only: false | ||
dedupe: | ||
name: bff_duplicate_paragraph_spans | ||
paragraphs: | ||
attribute_name: bff_duplicate_paragraph_spans | ||
skip_empty: true | ||
documents: | ||
- /work/dfm-data/pre-training/lexdk/documents/*.jsonl.gz | ||
- /work/dfm-data/pre-training/scandi-reddit/documents/*.jsonl.gz | ||
- /work/dfm-data/pre-training/hplt/documents/*.jsonl.gz | ||
- /work/dfm-data/pre-training/dagw/documents/*.jsonl.gz | ||
- /work/dfm-data/pre-training/mC4_da/documents/*.json.gz | ||
- /work/dfm-data/pre-training/ncc/documents/*.jsonl.gz | ||
processes: 16 |
43 changes: 43 additions & 0 deletions
43
data-processing/configs/2024-v1/dolma_run_url_taggers_mc4da_hplt.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
debug: false | ||
destination: null | ||
documents: | ||
- /work/dfm-data/pre-training/mC4_da/documents/*.json.gz | ||
- /work/dfm-data/pre-training/hplt/documents/*.jsonl.gz | ||
- /work/dfm-data/pre-training/ncc/documents/*.jsonl.gz | ||
dryrun: false | ||
experiment: v1blockurltaggers | ||
ignore_existing: false | ||
processes: 4 | ||
profile: | ||
enable: false | ||
lines: 100 | ||
output: null | ||
sort_key: tottime | ||
steps: null | ||
taggers: | ||
- domain_blocklist_phishing_v1 | ||
- domain_blocklist_utp_v1 | ||
#- oisd_big_abp_v1 | ||
#- oisd_nsfw_abp_v1 | ||
#- oisd_big_abp_v1 | ||
#- blocklist_firebog_ads_v1 | ||
#- blocklist_firebog_crypto_v1 | ||
#- blocklist_firebog_malicious_v1 | ||
#- blocklist_firebog_nsfw_v1 | ||
#- blocklist_firebog_social_v1 | ||
#- blocklist_firebog_suspicious_v1 | ||
#- blocklist_firebog_trackers_v1 | ||
#- blocklist_hosts_adware_malware_v1 | ||
#- blocklist_hosts_fakenews_v1 | ||
#- blocklist_hosts_gambling_v1 | ||
#- blocklist_hosts_porn_v1 | ||
#- blocklist_hosts_social_v1 | ||
#- blocklist_project_ads_v1 | ||
#- blocklist_project_crime_v1 | ||
#- blocklist_project_nsfw_v1 | ||
#- blocklist_project_social_v1 | ||
#- blocklist_project_vice_v1 | ||
#- brave_core_abp_v1 | ||
#- brave_nsfw_abp_v1 | ||
#- allowlist_wikidata_cleaned_v1 | ||
#- allowlist_wikidata_v1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
streams: | ||
- name: munin_alpha_v0.1_zero_cleaned | ||
documents: | ||
- /work/dfm-data/pre-training/dagw/documents/*.gz | ||
- /work/dfm-data/pre-training/hplt/documents/*.gz | ||
- /work/dfm-data/pre-training/lexdk/documents/*.gz | ||
- /work/dfm-data/pre-training/mC4_da/documents/*.gz | ||
- /work/dfm-data/pre-training/scandi-reddit/documents/*.gz | ||
output: | ||
path: /work/dfm-data/pre-training-clean/2024-v1/documents | ||
max_size_in_bytes: 1_000_000_000 | ||
attributes: | ||
- v1blockurltaggers | ||
- bff_duplicate_paragraph_spans | ||
filter: | ||
# Remove documents that are in the utp domain blocklist | ||
exclude: | ||
- "$@.attributes[?(@.v1blockurltaggers__domain_blocklist_utp_v1__url && @.v1blockurltaggers__domain_blocklist_utp_v1__url[0] && @.v1blockurltaggers__domain_blocklist_utp_v1__url[0][2] >=1.0)]" | ||
# Replace duplicate lines with empty string | ||
span_replacement: | ||
- span: "$.attributes.bff_duplicate_paragraph_spans" | ||
min_score: 0.5 | ||
replacement: '' | ||
|
||
processes: 16 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
85 changes: 85 additions & 0 deletions
85
data-processing/scripts/convert_ai_aktindsigt_to_jsonlgz.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
""" | ||
Script for downloading the AI-aktindsigt (Sønderborg kommune) dataset and | ||
converting it to jsonl.gz. | ||
The data is in xlsx format | ||
""" | ||
import datetime | ||
import subprocess | ||
from pathlib import Path | ||
from typing import Any, Iterable | ||
|
||
import pandas | ||
|
||
# Git lfs must be installed | ||
# git clone https://huggingface.co/datasets/AI-aktindsigt/Skrabet_kommunale_hjemmesider | ||
|
||
|
||
def assume_same(x: Iterable[Any]): | ||
iterator = iter(x) | ||
first = next(iterator) | ||
|
||
for item in iterator: | ||
if pandas.isna(first): | ||
assert pandas.isna(item) | ||
else: | ||
assert first == item | ||
|
||
return first | ||
|
||
|
||
def main(): | ||
subprocess.run(("git", "clone", "https://huggingface.co/datasets/AI-aktindsigt/Skrabet_kommunale_hjemmesider")) | ||
|
||
dfs = [] | ||
for path in sorted(Path("Skrabet_kommunale_hjemmesider").glob("*.xlsx")): | ||
print(f"Reading {path}") | ||
|
||
dtype = { | ||
"text": str, | ||
"sentence": int, | ||
"kommune": str, | ||
"url": str, | ||
#"klassifikation": int, # This column seems to be unused | ||
"sha512": str, | ||
"ppl_score": float, | ||
} | ||
df = pandas.read_excel(path, header=1, dtype=dtype) | ||
df.dropna(subset=["text"], inplace=True) # Drop empty sentences | ||
# Convert all column names to lowercase (in some sheets URL is in capital letters) | ||
df.columns = map(str.lower, df.columns) | ||
|
||
dfs.append(df) | ||
|
||
megaframe = pandas.concat(dfs) | ||
|
||
print("Grouping by [id, sha512]") | ||
groups = megaframe.groupby(by=["id", "sha512"]) | ||
|
||
agg_funcs = { | ||
"text": "\n".join, # join the sentences with newlines. | ||
"sentence": lambda x: max(x) + 1, | ||
"kommune": assume_same, | ||
"url": assume_same, | ||
#"sha512": assume_same, | ||
"ppl_score": lambda x: [float(a) for a in x], | ||
} | ||
print("Aggregating frame") | ||
df = groups.agg(agg_funcs) | ||
|
||
print("Reshaping into dolma format") | ||
df["id"] = df.apply(lambda row:'%s_%s' % (row.name[0],row.name[1]),axis=1) | ||
df["sha512"] = df.apply(lambda row:'%s' % row.name[1],axis=1) | ||
df["source"] = "ai_aktindsigt" | ||
df["added"] = datetime.datetime.now(datetime.UTC).strftime("%Y-%m-%d") | ||
df["created"] = "1970-01-01, 2024-04-01" # best guess creation time, between 1970 and release time | ||
|
||
metadata_keys = ["url", "kommune", "sentence", "ppl_score", "sha512"] | ||
df["metadata"] = df.apply(lambda row: {k: row[k] for k in metadata_keys}, axis=1) | ||
df.drop(columns=metadata_keys, inplace=True) | ||
|
||
print("Writing to file") | ||
df.to_json("ai_aktindsigt.jsonl.gz", orient="records", lines=True) | ||
print("Done") | ||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.