Feature/27 generic csv loader #28

brendan-oconnell · 2025-08-08T14:10:27Z

Addresses #27

…s, create subjects

…reate_price, create_location, begin refactoring

rhigman

Looks good! I've made a few comments about edge cases etc as they've occurred to me, but they may already be things you've thought of and chosen not to overcomplicate the logic with.

csvloader.py

rhigman · 2025-11-10T13:37:58Z

csvloader.py

+            # Convert NaN to None for all fields
+            contributor = self.convert_nan_to_none(contributor)
+
+            if full_name not in self.all_contributors:


This could backfire for common names like "John Smith".

I totally agree, and it's also following a common pattern in the repo - 10 other loaders use the same logic. Without an ORCID, which we don't require and is frequently not present, is there another way that we can fix this?

I added some logic for checking by ORCID before full_name, which at least improves the existing logic, even if it doesn't fix the common names issue

rhigman · 2025-11-10T13:48:38Z

csvloader.py

+            else:
+                contributor_id = self.all_contributors[full_name]
+                logging.info(f"contributor {full_name} already in Thoth, skipping creation")
+            existing_contribution = next(


Note it's possible for the same contributor to be listed with multiple contributions of different types

rhigman · 2025-11-10T13:53:03Z

csvloader.py

+
+            existing_affiliations = next(
+                (c.affiliations for c in work.contributions if c.contributionId == contribution_id), [])
+            if any(a.institution.institutionId == institution_id for a in existing_affiliations):


More than one affiliation at the same institution is technically possible, I think

rhigman · 2025-11-10T13:58:39Z

csvloader.py

+            if publication_type in ["paperback", "hardback"]:
+                self.create_price(row, publication_type, publication_id)
+            elif publication_type in ["pdf", "epub"]:
+                self.create_location(row, publication_type, publication_id)


I'm intrigued that the template presumably doesn't contain locations for paperback/hardback, when they can have landing pages if not full text URLs (and, indeed, epubs can have prices)

rhigman · 2025-11-10T13:59:54Z

csvloader.py

+            if pd.notna(row[f"publication_{publication_type}_price_{index}_currency_code"]):
+                currency = row[f"publication_{publication_type}_price_{index}_currency_code"]
+
+            if unit_price and currency:


Is there any possibility of hitting errors here if a publisher accidentally includes more than one price with the same currency?

rhigman · 2025-11-10T14:01:08Z

csvloader.py

+        if pd.isna(series_name):
+            logging.info(f"{work.fullTitle} missing series name; skipping create_series")
+            return
+        if series_name not in self.all_series:


Again, duplicates are possible

rhigman · 2025-11-10T14:02:47Z

csvloader.py

+
+            # if series_issue_number is present in CSV, use it
+            if pd.notna(row["series_issue_number"]):
+                issue_ordinal = row["series_issue_number"]


Worth checking that the series_issue_number doesn't happen to match an existing issue_ordinal?

rhigman · 2025-11-10T14:04:03Z

csvloader.py

+        else:
+            logging.info(f"Series Issue already exists for work; skipping creating issue of {current_series}")
+
+    def convert_nan_to_none(self, data_dict):


Nice. I haven't checked the implications, but would there be any benefit to running this conversion on the whole record at the start, so as to bypass having to call this function multiple times?

brendan-oconnell · 2025-11-28T12:21:08Z

Thanks for the comments, @rhigman ! I unfortunately won't have time to address them. I did read through them and appreciate you taking the time anyway.

…s CSV

Created get_work, create contributors

9c2f5e2

brendan-oconnell self-assigned this Aug 8, 2025

brendan-oconnell changed the base branch from master to develop August 8, 2025 14:10

brendan-oconnell added 3 commits August 14, 2025 16:00

complete logic for create_contributors, add logic for create_language…

32e5571

…s, create subjects

Add logic for create_languages, create_subject, create_publication, c…

92d6452

…reate_price, create_location, begin refactoring

use convert_nan_to_none method

5d93e79

brendan-oconnell linked an issue Aug 20, 2025 that may be closed by this pull request

Create generic CSV loader #27

Open

brendan-oconnell marked this pull request as ready for review August 20, 2025 13:17

brendan-oconnell requested a review from ja573 August 20, 2025 13:18

Merge branch 'develop' into feature/27-generic-csv-loader

0d9f8a0

brendan-oconnell requested review from rhigman and removed request for ja573 November 6, 2025 16:16

rhigman reviewed Nov 10, 2025

View reviewed changes

Partially addressed review comments

6c90b13

Minor improvements to loader used in ingesting Venice University Pres…

bfd64b2

…s CSV

Feature/27 generic csv loader #28

Are you sure you want to change the base?

Feature/27 generic csv loader #28

Uh oh!

Conversation

brendan-oconnell commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhigman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brendan-oconnell commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brendan-oconnell commented Aug 8, 2025 •

edited

Loading