-
Notifications
You must be signed in to change notification settings - Fork 1
Feature/27 generic csv loader #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…s, create subjects
…reate_price, create_location, begin refactoring
rhigman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I've made a few comments about edge cases etc as they've occurred to me, but they may already be things you've thought of and chosen not to overcomplicate the logic with.
csvloader.py
Outdated
| # Convert NaN to None for all fields | ||
| contributor = self.convert_nan_to_none(contributor) | ||
|
|
||
| if full_name not in self.all_contributors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could backfire for common names like "John Smith".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree, and it's also following a common pattern in the repo - 10 other loaders use the same logic. Without an ORCID, which we don't require and is frequently not present, is there another way that we can fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some logic for checking by ORCID before full_name, which at least improves the existing logic, even if it doesn't fix the common names issue
| else: | ||
| contributor_id = self.all_contributors[full_name] | ||
| logging.info(f"contributor {full_name} already in Thoth, skipping creation") | ||
| existing_contribution = next( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note it's possible for the same contributor to be listed with multiple contributions of different types
|
|
||
| existing_affiliations = next( | ||
| (c.affiliations for c in work.contributions if c.contributionId == contribution_id), []) | ||
| if any(a.institution.institutionId == institution_id for a in existing_affiliations): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than one affiliation at the same institution is technically possible, I think
| if publication_type in ["paperback", "hardback"]: | ||
| self.create_price(row, publication_type, publication_id) | ||
| elif publication_type in ["pdf", "epub"]: | ||
| self.create_location(row, publication_type, publication_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm intrigued that the template presumably doesn't contain locations for paperback/hardback, when they can have landing pages if not full text URLs (and, indeed, epubs can have prices)
| if pd.notna(row[f"publication_{publication_type}_price_{index}_currency_code"]): | ||
| currency = row[f"publication_{publication_type}_price_{index}_currency_code"] | ||
|
|
||
| if unit_price and currency: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any possibility of hitting errors here if a publisher accidentally includes more than one price with the same currency?
| if pd.isna(series_name): | ||
| logging.info(f"{work.fullTitle} missing series name; skipping create_series") | ||
| return | ||
| if series_name not in self.all_series: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, duplicates are possible
csvloader.py
Outdated
|
|
||
| # if series_issue_number is present in CSV, use it | ||
| if pd.notna(row["series_issue_number"]): | ||
| issue_ordinal = row["series_issue_number"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth checking that the series_issue_number doesn't happen to match an existing issue_ordinal?
| else: | ||
| logging.info(f"Series Issue already exists for work; skipping creating issue of {current_series}") | ||
|
|
||
| def convert_nan_to_none(self, data_dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I haven't checked the implications, but would there be any benefit to running this conversion on the whole record at the start, so as to bypass having to call this function multiple times?
|
Thanks for the comments, @rhigman ! I unfortunately won't have time to address them. I did read through them and appreciate you taking the time anyway. |
Addresses #27