Skip to content

Conversation

@brendan-oconnell
Copy link
Contributor

@brendan-oconnell brendan-oconnell commented Aug 8, 2025

Addresses #27

@brendan-oconnell brendan-oconnell self-assigned this Aug 8, 2025
@brendan-oconnell brendan-oconnell changed the base branch from master to develop August 8, 2025 14:10
@brendan-oconnell brendan-oconnell linked an issue Aug 20, 2025 that may be closed by this pull request
@brendan-oconnell brendan-oconnell marked this pull request as ready for review August 20, 2025 13:17
@brendan-oconnell brendan-oconnell requested a review from ja573 August 20, 2025 13:18
@brendan-oconnell brendan-oconnell requested review from rhigman and removed request for ja573 November 6, 2025 16:16
Copy link
Member

@rhigman rhigman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I've made a few comments about edge cases etc as they've occurred to me, but they may already be things you've thought of and chosen not to overcomplicate the logic with.

csvloader.py Outdated
# Convert NaN to None for all fields
contributor = self.convert_nan_to_none(contributor)

if full_name not in self.all_contributors:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could backfire for common names like "John Smith".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree, and it's also following a common pattern in the repo - 10 other loaders use the same logic. Without an ORCID, which we don't require and is frequently not present, is there another way that we can fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some logic for checking by ORCID before full_name, which at least improves the existing logic, even if it doesn't fix the common names issue

else:
contributor_id = self.all_contributors[full_name]
logging.info(f"contributor {full_name} already in Thoth, skipping creation")
existing_contribution = next(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note it's possible for the same contributor to be listed with multiple contributions of different types


existing_affiliations = next(
(c.affiliations for c in work.contributions if c.contributionId == contribution_id), [])
if any(a.institution.institutionId == institution_id for a in existing_affiliations):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than one affiliation at the same institution is technically possible, I think

if publication_type in ["paperback", "hardback"]:
self.create_price(row, publication_type, publication_id)
elif publication_type in ["pdf", "epub"]:
self.create_location(row, publication_type, publication_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm intrigued that the template presumably doesn't contain locations for paperback/hardback, when they can have landing pages if not full text URLs (and, indeed, epubs can have prices)

if pd.notna(row[f"publication_{publication_type}_price_{index}_currency_code"]):
currency = row[f"publication_{publication_type}_price_{index}_currency_code"]

if unit_price and currency:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any possibility of hitting errors here if a publisher accidentally includes more than one price with the same currency?

if pd.isna(series_name):
logging.info(f"{work.fullTitle} missing series name; skipping create_series")
return
if series_name not in self.all_series:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, duplicates are possible

csvloader.py Outdated

# if series_issue_number is present in CSV, use it
if pd.notna(row["series_issue_number"]):
issue_ordinal = row["series_issue_number"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth checking that the series_issue_number doesn't happen to match an existing issue_ordinal?

else:
logging.info(f"Series Issue already exists for work; skipping creating issue of {current_series}")

def convert_nan_to_none(self, data_dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I haven't checked the implications, but would there be any benefit to running this conversion on the whole record at the start, so as to bypass having to call this function multiple times?

@brendan-oconnell
Copy link
Contributor Author

Thanks for the comments, @rhigman ! I unfortunately won't have time to address them. I did read through them and appreciate you taking the time anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create generic CSV loader

3 participants