Skip to content

Commit 1c90dd3

Browse files
Refacto s3 (#75)
* Add path_within_buckety * Utilise MasterScrapper pour duplicate_vectorfile_ign * Ajoute une fonction pour écrire avec S3 le md5 * Add support for s3fs access * possibility to load a .env file with python-dotenv (keys = token, key, secret) * black formatting * fix logger call * fix undefined name 'logger' * add black formatting to utils.dict_update.py * Samll refacto Dataset * reset update_json_md5 as a Dataset method; * add fs argument for instanciation of Dataset; * fixed Dataset docstring; * fix bug on Dataset if downloaded prevented because of md5 match * temporary fix in s3/s3.py of multiple s3fs creation; * fix duplicate_vectorfile_ign when file already uptodate on s3 * Move constants creation to package init * Update download.py * Update s3.py * Update __init__.py * Update misc.write_s3 * Update docstrings + notes * Notes/TODO sur s3 * Fix exception on missing file in json * Update write_s3.py * add logging configuration * Update write_s3.py * reset os.chdir('cartiflette') just in case * Move utils from s3 refactorize functions to get path (both from web access or from inside s3) * Move public functions into ad hoc subpackage * Fix typo * Start refacto of s3 * Update dev.py Black formatting * Fix typo * Default year in download.dev * Update download.py Fix default year in download.download.py * Set current year as default everywhere * Cleanup corrupt files after download * Unfinished refactorization * Remove geometry sanitations * Remove unecessary functions in s3 * Fix mockups _get_last_md5 * Add magic file detection and CachedSession * Update .gitignore * Add CSV support (COG Insee) * Update sources.yaml * Update download.py * Spec custom filetype for output * Create csv_magic.py utility for unknown csv reading * RecRefacto download Use requests-cache Refacto yaml Rename "field" argument in yaml to "territory" Handle zip Handle nested zip/7zip Handle CSV/DBF pattern (not only shapefiles) Refacto tests with CachedSession patching Split download on multiple files (download, scraper, dataset) * Add poetry and pytest to CI * Set os-specific dependency * Fix check test * Add incomplete s3 refacto for building purpose * Add feedback to test * Fix proxy error on github tests * Jobs' names differentiation * Cleanup unused files since poetry's usage * Fix copy/paste duplicates * Merge / upgrade standard patchs on bucket Set a config file which centralize all constants which relates to s3fs * Move create_path_bucket test to separate test * Full download pipeline * Fix bug on pipeline with year as int * add configuration option for tqdm * Update config.py * Recreate base gedataframes directly from s3 * Remove dev * Refactorize s3 (for a start...) * Fix _download_sources import in tests --------- Co-authored-by: linogaliana <lino.galiana@insee.fr> Co-authored-by: thomas.grandjean <thomas.grandjean@developpement-durable.gouv.fr>
1 parent 80b8a5a commit 1c90dd3

34 files changed

+3614
-2468
lines changed

.github/workflows/check.yml

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name: Test Python package
33
on: [push]
44

55
jobs:
6-
build-linux:
6+
testing:
77
runs-on: ubuntu-latest
88
strategy:
99
max-parallel: 5
@@ -14,16 +14,20 @@ jobs:
1414
uses: actions/setup-python@v3
1515
with:
1616
python-version: "3.10"
17+
- name: Add libmagic for python-magic on linux
18+
run: sudo apt-get install libmagic1
19+
- name: Install Poetry
20+
uses: snok/install-poetry@v1
1721
- name: Install dependencies
1822
run: |
19-
pip install -r requirements.txt
20-
pip install .
23+
poetry install --without dev
24+
poetry add pytest
2125
- name: Test import
2226
run: |
2327
export AWS_ACCESS_KEY_ID=${{ secrets.S3_ACCESS_KEY }}
2428
export AWS_SECRET_ACCESS_KEY=${{ secrets.S3_SECRET_KEY }}
25-
python example/download.py
26-
# - name: Test with pytest
27-
# run: |
28-
# conda install pytest
29-
# pytest
29+
# python example/download.py
30+
- name: Test with pytest
31+
run: |
32+
poetry run pytest
33+

.github/workflows/lint.yml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name: Lint Python package
33
on: [push]
44

55
jobs:
6-
build-linux:
6+
lint-checking:
77
runs-on: ubuntu-latest
88
strategy:
99
max-parallel: 5
@@ -14,10 +14,13 @@ jobs:
1414
uses: actions/setup-python@v3
1515
with:
1616
python-version: "3.10"
17+
- name: Add libmagic for python-magic on linux
18+
run: sudo apt-get install libmagic1
19+
- name: Install Poetry
20+
uses: snok/install-poetry@v1
1721
- name: Install dependencies
1822
run: |
19-
pip install -r requirements.txt
20-
pip install .
23+
poetry install --without dev
2124
- name: Lint with flake8
2225
run: |
2326
cd cartiflette

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,3 +131,6 @@ dmypy.json
131131
# Setuptools vs. poetry
132132
*.lock
133133
.toml
134+
135+
*.sqlite
136+
*.sqlite*

cartiflette/__init__.py

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1-
from .utils import *
2-
from .download import *
3-
from .s3 import *
1+
from cartiflette.config import (
2+
BUCKET,
3+
PATH_WITHIN_BUCKET,
4+
ENDPOINT_URL,
5+
FS,
6+
THREADS_DOWNLOAD,
7+
LEAVE_TQDM,
8+
)
9+
from cartiflette.constants import REFERENCES, DOWNLOAD_PIPELINE_ARGS
10+
from cartiflette.utils import *
11+
from cartiflette.download import *
12+
from cartiflette.s3 import *

cartiflette/config.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# -*- coding: utf-8 -*-
2+
from dotenv import load_dotenv
3+
import os
4+
import s3fs
5+
6+
load_dotenv()
7+
8+
BUCKET = "projet-cartiflette"
9+
PATH_WITHIN_BUCKET = "diffusion/shapefiles-test4"
10+
ENDPOINT_URL = "https://minio.lab.sspcloud.fr"
11+
12+
kwargs = {}
13+
for key in ["token", "secret", "key"]:
14+
try:
15+
kwargs[key] = os.environ[key]
16+
except KeyError:
17+
continue
18+
FS = s3fs.S3FileSystem(client_kwargs={"endpoint_url": ENDPOINT_URL}, **kwargs)
19+
20+
THREADS_DOWNLOAD = 5
21+
# Nota : each thread may also span the same number of children threads;
22+
# set to 1 for debugging purposes (will deactivate multithreading)
23+
24+
LEAVE_TQDM = False

cartiflette/constants.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# -*- coding: utf-8 -*-
2+
3+
import geopandas as gpd
4+
import logging
5+
from shapely.geometry import box
6+
7+
8+
logger = logging.getLogger(__name__)
9+
10+
REFERENCES = [
11+
# use : https://boundingbox.klokantech.com/
12+
{"location": "metropole", "geometry": box(-5.45, 41.26, 9.83, 51.31)},
13+
{"location": "guyane", "geometry": box(-54.6, 2.11, -51.5, 5.98)},
14+
{
15+
"location": "martinique",
16+
"geometry": box(-61.4355, 14.2217, -60.6023, 15.0795),
17+
},
18+
{
19+
"location": "guadeloupe",
20+
"geometry": box(-62.018, 15.6444, -60.792, 16.714),
21+
},
22+
{
23+
"location": "reunion",
24+
"geometry": box(55.0033, -21.5904, 56.0508, -20.6728),
25+
},
26+
{
27+
"location": "mayotte",
28+
"geometry": box(44.7437, -13.2733, 45.507, -12.379),
29+
},
30+
{
31+
"location": "saint_pierre_et_miquelon",
32+
"geometry": box(-56.6975, 46.5488, -55.9066, 47.3416),
33+
},
34+
]
35+
36+
REFERENCES = gpd.GeoDataFrame(REFERENCES, crs=4326)
37+
38+
DOWNLOAD_PIPELINE_ARGS = {
39+
"ADMIN-EXPRESS": [
40+
"IGN",
41+
"ADMINEXPRESS",
42+
"EXPRESS-COG-TERRITOIRE",
43+
[
44+
"guadeloupe",
45+
"martinique",
46+
"guyane",
47+
"reunion",
48+
"mayotte",
49+
"metropole",
50+
],
51+
],
52+
"BDTOPO": ["IGN", "BDTOPO", "ROOT", "france_entiere"],
53+
"IRIS": ["IGN", "CONTOUR-IRIS", "ROOT", None],
54+
"COG": [
55+
"Insee",
56+
"COG",
57+
[
58+
"COMMUNE",
59+
"CANTON",
60+
"ARRONDISSEMENT",
61+
"DEPARTEMENT",
62+
"REGION",
63+
"COLLECTIVITE",
64+
"PAYS",
65+
],
66+
"france_entiere",
67+
],
68+
"BV 2022": ["Insee", "BV", "FondsDeCarte_BV_2022", "france_entiere"],
69+
"BV 2012": ["Insee", "BV", "FondsDeCarte_BV_2012", "france_entiere"],
70+
}
71+
72+
# EXPRESS-COG ?
73+
# EXPRESS-COG-CARTO-TERRITOIRE ?
74+
# EXPRESS-COG-CARTO ?

cartiflette/download/__init__.py

Lines changed: 9 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,14 @@
1-
from .dev import (
2-
# create_url_adminexpress,
3-
get_vectorfile_ign,
4-
# get_administrative_level_available_ign,
5-
store_vectorfile_ign,
6-
get_vectorfile_communes_arrondissement,
7-
# get_BV,
8-
get_cog_year,
9-
)
1+
# from cartiflette.download.dev import (
2+
# get_vectorfile_communes_arrondissement,
3+
# # get_BV,
4+
# )
5+
106

11-
from .download import (
12-
Dataset,
13-
BaseScraper,
14-
HttpScraper,
15-
FtpScraper,
16-
MasterScraper,
17-
download_sources,
7+
from cartiflette.download.pipeline import (
8+
download_all,
189
)
1910

11+
2012
__all__ = [
21-
# "create_url_adminexpress",
22-
"get_vectorfile_ign",
23-
# "get_administrative_level_available_ign",
24-
"store_vectorfile_ign",
25-
"get_vectorfile_communes_arrondissement",
26-
# "get_BV",
27-
"get_cog_year",
28-
"Dataset",
29-
"BaseScraper",
30-
"HttpScraper",
31-
"FtpScraper",
32-
"MasterScraper",
33-
"download_sources",
13+
"download_all",
3414
]

0 commit comments

Comments
 (0)