Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Updates to PANGAEA collection and cleaning pipeline #55

Merged
merged 92 commits into from
May 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
9a64b6e
MNT: Show which ds has None for size
scottclowe Sep 1, 2022
0460e60
ENH: Add option to control whether URL columns are required
scottclowe Dec 9, 2021
b038c4f
API: Disable ensure url option in search
scottclowe Sep 1, 2022
09006e2
MNT: Handle alt campaign name within set_metadata
scottclowe Sep 1, 2022
ac76bc7
MNT: Use dataset ID as alt instead of DOI
scottclowe Sep 1, 2022
f156f49
MNT: Skip children of parents without URL
scottclowe Sep 1, 2022
d8894f9
STY: Split arg per line with ensure_url included
scottclowe Sep 1, 2022
1392028
ENH: Get lat/lon from PanDataSet if not scraped
scottclowe Sep 2, 2022
f52e3b5
ENH: Add auth_token support
scottclowe Sep 2, 2022
4da4e30
MNT: Only print saving dataframe if verbosity high enough
scottclowe Mar 7, 2023
aec74c6
BUG: Remove unused import of IPython, not specified in requirements
scottclowe Mar 8, 2023
682fba4
BUG: Skip non-CSV files when processing outputs
scottclowe Mar 8, 2023
759b1be
MNT: Don't add dummy campaign and site columns when downloading datasets
scottclowe Mar 8, 2023
18651b8
MNT: Print message explaining errors being repeated
scottclowe Mar 8, 2023
5332481
MNT: Ignore existing dataset and site columns
scottclowe Mar 8, 2023
df49254
MNT: Change default site to be based on dataset name, not DOI
scottclowe Mar 8, 2023
a578655
BUG: Reflect latitudesouth, latitude-, longitudewest, longitude-
scottclowe Mar 8, 2023
ea5eba4
MNT: Save results for parents whose children don't have URLs
scottclowe Mar 9, 2023
8203880
MNT: Inherit verbosity from caller
scottclowe Mar 24, 2023
77d4cf1
MNT: Read CSV files without low_memory mode due to 'mixed types'
scottclowe Mar 29, 2023
db0fae0
ENH: Interpolate or extract missing lat, lon, datetime metadata
scottclowe Mar 29, 2023
2c16f24
MNT: Increase default verbosity level 0->1
scottclowe Mar 29, 2023
073efff
MNT: Exclude some more dataset titles
scottclowe Mar 29, 2023
0a0b632
MNT: Manually exclude dataset 805690, which was downloaded without it…
scottclowe Mar 29, 2023
2bbbd66
MNT: Make final report only appear if verbosity enabled
scottclowe Mar 29, 2023
37bedd6
MNT: Remove unused import of dateutil.parser
scottclowe Mar 29, 2023
f7a44ee
ENH: Use caching functionality built into PanDataSet
scottclowe Mar 29, 2023
1643a8b
MNT: Extract datetime from filename for rest of 896160 series
scottclowe Mar 29, 2023
5e4cf23
MNT: Extract from filename from two more datasets
scottclowe Mar 29, 2023
2f0d49c
MNT: Also manually exclude 803979, parent of 805690
scottclowe Mar 29, 2023
f33f564
ENH: Extract datetime from filename for datasets 371062, 371063, 371064
scottclowe Mar 30, 2023
6929f5f
MNT: C416 Unnecessary dict comprehension - rewrite using dict()
scottclowe Mar 30, 2023
4e1309c
Revert "MNT: Save results for parents whose children don't have URLs"
scottclowe Mar 30, 2023
a7e1521
DOC: Typo Scrapping -> Scraping
scottclowe Mar 30, 2023
322b20c
MNT: Check other children even if one is a restricted tabular dataset
scottclowe Mar 30, 2023
f3f8d27
RF: Better loop conditioning structure, with common code at the end
scottclowe Mar 30, 2023
24583c9
MNT: Save title as dataset_title, not Dataset column
scottclowe Mar 30, 2023
a752bdf
RF: Move auto-deleting of partial file into save_df utility
scottclowe Mar 30, 2023
03e381f
ENH: Record ds_id while acquiring each dataset
scottclowe Mar 30, 2023
2a45e0d
MNT: Save children of parents individually, not merged together
scottclowe Mar 30, 2023
93ec48c
ENH: Record child to parent dataset ID mapping
scottclowe Mar 30, 2023
c0c75cc
MNT: Fix latitude- and longitude- lookup
scottclowe Mar 30, 2023
0099de3
MNT: Fix method for merging elevation data with depth data
scottclowe Mar 30, 2023
36570fc
MNT: Redact erroneously negative depth values
scottclowe Mar 30, 2023
5760dd5
ENH: Handle heightaboveseafloor as an altitude field
scottclowe Mar 30, 2023
daedd46
ENH: Add kwargs pass-through to interpolate_by_datetime
scottclowe Mar 30, 2023
df1a855
MNT: Don't extrapolate depth beyond measured values
scottclowe Mar 30, 2023
9bfe308
ENH: Interpolate holes in depth values based on datetime
scottclowe Mar 30, 2023
10caaa0
BUG: Check if child dataframe is empty before trying to save
scottclowe Mar 30, 2023
53356a6
MNT: Save empty parent CSV for easy search download resumption
scottclowe Mar 30, 2023
b5f3da5
STY: Import from instead of aliasing
scottclowe Mar 30, 2023
ff2e9a4
ENH: Add wrapper to requests.get with 30s backoff on 429 status
scottclowe Mar 30, 2023
55a4563
ENH: Use 30s backoff on 429 status
scottclowe Mar 30, 2023
87cd24a
JNB: Fix reference to benthicnet.io utilities
scottclowe Mar 30, 2023
d6cb211
JNB: Don't use low_memory mode loading df
scottclowe Mar 30, 2023
6cb1bc5
JNB: Fix typo
scottclowe Mar 30, 2023
973b25d
JNB+BUG: Need to reset val_exception before parsing new keys
scottclowe Mar 30, 2023
b583e8d
JNB+MNT: Reflect yaxis instead of plotting negative of depth
scottclowe Mar 30, 2023
2782beb
JNB: Add title, ylabel, and print link to dataset
scottclowe Mar 30, 2023
f8663be
JNB: Highlight negative depth
scottclowe Mar 30, 2023
e809f54
JNB: Plot elevation
scottclowe Mar 30, 2023
154ff1f
BUG: Need to drop columns after handling reversed columns
scottclowe Mar 30, 2023
4a3cb25
MNT: Drop latitude-, longitude- if used
scottclowe Mar 30, 2023
42bdc53
MNT: Save depth_of_observer, bathymetry, and elevation separately
scottclowe Mar 30, 2023
eda209d
MNT: Rearrange so old columns are dropped before mapping new ones ont…
scottclowe Mar 30, 2023
4d57188
MNT: Change warning colour from red to yellow
scottclowe Mar 30, 2023
0dc20c6
MNT: Change Campaign -> campaign
scottclowe Mar 30, 2023
87452af
BUG: Add pangaea- to ds_id for dataframe output
scottclowe Apr 3, 2023
d135600
MNT: Allow photographs of tiles
scottclowe Apr 3, 2023
a3b14e2
ENH: Include parent_ds_id in output dataframe
scottclowe Apr 3, 2023
2082aa3
MNT: Remove self-imposed rate-limit so cached data is loaded immediately
scottclowe Apr 3, 2023
f236e0c
ENH: Include url_thumbnail column
scottclowe Apr 4, 2023
2873911
ENH: Find area columns encoding image area in square meters
scottclowe Apr 4, 2023
6a5c687
MNT: Find and remove additional FAVOURITE duplicate images
scottclowe Apr 4, 2023
2808ad9
MNT: Print files which had duplicated URLs resolved
scottclowe Apr 4, 2023
aadf712
Revert "MNT: Remove self-imposed rate-limit so cached data is loaded …
scottclowe Apr 5, 2023
4f152f1
MNT: Print number of records before and after dropping duplicates
scottclowe Apr 5, 2023
df0beea
MNT: Print IDs of datasets which may have label columns
scottclowe Apr 5, 2023
f9d5b48
MNT: Rename parent_ds_id -> collection
scottclowe Apr 5, 2023
45a3492
BUG: Fix nanosecond output format of datetime in pangaea-907025
scottclowe Apr 6, 2023
13080a1
MNT: Convert parent_ds_id into pangaea-IDENTIFIER like ds_id
scottclowe Apr 6, 2023
1db2a4a
DOC: Fix rate limit comment
scottclowe Apr 6, 2023
b1b2a70
ENH: Merge down metadata across rows with repeated URLs, preserving d…
scottclowe Apr 6, 2023
f631dd8
MNT: Save a copy with duplicates before removing them, so duplicates …
scottclowe Apr 6, 2023
eebb878
BUG: Need to convert datetime to string before merging (some are date…
scottclowe Apr 6, 2023
f1cf985
MNT: Rewrite any(list comp) as any(generator) instead (flake8:C419)
scottclowe May 8, 2024
4a42f49
MNT: Rename depth columns
scottclowe May 8, 2024
e5d9db0
MNT: Exclude AntGlassSponges with DOWN in their URL - not Benthic ima…
scottclowe May 8, 2024
3dfaba1
MNT: Skip missing URL cols
scottclowe May 8, 2024
a8967dc
ENH: Add process_single to cleanup metadata for a single dataset
scottclowe May 8, 2024
c4ce7ab
JNB: More EDA and new output files
scottclowe May 8, 2024
d8e0aef
DEV: Remove malfunctioning pretty-format-json
scottclowe May 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,6 @@ repos:
- id: detect-private-key
- id: end-of-file-fixer
exclude: ^LICENSE|\.(html|csv|txt|svg|py)$
- id: pretty-format-json
args: ["--autofix", "--no-ensure-ascii", "--no-sort-keys"]
- id: requirements-txt-fixer
- id: trailing-whitespace
args: [--markdown-linebreak-ext=md]
Expand Down
Loading
Loading