Skip to content

Commit

Permalink
Dec 2023 Update
Browse files Browse the repository at this point in the history
  • Loading branch information
droher committed Dec 5, 2023
1 parent 8fa2f43 commit 35bf42b
Show file tree
Hide file tree
Showing 5 changed files with 14 additions and 9 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -102,3 +102,7 @@ venv.bak/

# mypy
.mypy_cache/

etymology.csv
etymology.csv.gz
etymology.parquet
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# etymology-db
**Downloads:** (Last generated 2021-11-14)
**Downloads:** (Last generated 2023-12-05)
[**Gzipped CSV**](https://1drv.ms/u/s!AtpEocFNRNBWhAe7co0JFvac-OfA?e=wnJe4r)
[**Parquet**](https://1drv.ms/u/s!AtpEocFNRNBWhhP6w5D9XfdtPH9I?e=jWRwnI)

A structured, comprehensive, and multilingual etymology dataset created by parsing Wiktionary's etymology sections. Key features:
* 3.8+ million etymological relationships between 1.8+ million terms in 2900+ languages/dialects
* 4.2+ million etymological relationships between 2.0+ million terms in 3300+ languages/dialects
* 31 different types of etymological relations, distinguishing between inheritance, borrowing, etc.
* Hierarchical data that preserves relationship structures, such as the evolution of a term across languages

Expand Down
7 changes: 4 additions & 3 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import csv
import logging
import re
from multiprocessing import Pool
from multiprocessing import Pool, freeze_support
from datetime import datetime, timedelta
from pathlib import Path
from typing import Generator, List, Tuple
Expand Down Expand Up @@ -64,8 +64,9 @@ def write_all():
elapsed = (datetime.now() - time)
if elapsed.total_seconds() > 1:
elapsed -= timedelta(microseconds=elapsed.microseconds)
print(f"Entries parsed: {entries_parsed} Time elapsed: {elapsed} "
f"Entries per second: {entries_parsed // elapsed.total_seconds()}{' ' * 10}", end="\r", flush=True)
if entries_parsed % 1000 == 0:
print(f"Entries parsed: {entries_parsed} Time elapsed: {elapsed} "
f"Entries per second: {entries_parsed // elapsed.total_seconds()}{' ' * 10}", end="\r", flush=True)


def stream_terms() -> Generator[Tuple[str, str], None, None]:
Expand Down
6 changes: 3 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
lxml==4.9.1
requests==2.26.0
mwparserfromhell==0.6.3
lxml==4.9.3
requests==2.31.0
mwparserfromhell==0.6.5
2 changes: 1 addition & 1 deletion templates.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from elements import Etymology


unparsed_templates = Manager().dict()
unparsed_templates = dict()

class RelType(Enum):
Inherited = "inherited_from"
Expand Down

0 comments on commit 35bf42b

Please sign in to comment.