Add `pdc-process-base-fields` script by kfogel · Pull Request #211 · PhilanthropyDataCommons/data-scripts

kfogel · 2025-09-22T02:08:13Z

As per this Zulip conversation with @bickelj:

https://chat.opentechstrategies.com/#narrow/channel/75-PDC/topic/base.20fields/near/238401

bickelj

Hurray! When I run it with --help, it prints the help I needed to successfully run it. And it did what it says it does, it produced a friendly CSV file!

The output is inconsistently quoted which is not an issue yet but could be an issue because there is no prohibition of a comma character, for example, in a base field label. For maximum compatibility I think all values should be quoted.

I see the localization JSON surrounded by quotation marks which is great for the CSV file. However, the JSON therein contained becomes invalid when quoted with single-quotes/apostrophes. I am not sure of the best solution, but maybe escaping these quotation marks, as ugly as that is. That or expanding the localizations into their own columns by some naming convention. Another acceptable stop-gap measure would be to omit the localizations until they are fully supported by the script.

We do not (yet) have the linter set up for Python in this repository, so I installed and ran pylint locally, with the result that several changes are needed:

$ pip3 install --user pylint
...
$ pylint ./pdc-process-base-fields
************* Module pdc-process-base-fields
pdc-process-base-fields:92:39: C0303: Trailing whitespace (trailing-whitespace)
pdc-process-base-fields:121:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
pdc-process-base-fields:1:0: C0103: Module name "pdc-process-base-fields" doesn't conform to snake_case naming style (invalid-name)
pdc-process-base-fields:90:0: C0116: Missing function or method docstring (missing-function-docstring)
pdc-process-base-fields:112:8: W0622: Redefining built-in 'input' (redefined-builtin)
pdc-process-base-fields:64:0: W0611: Unused import json (unused-import)

-----------------------------------
Your code has been rated at 7.00/10

Overall, this is useful and looks good, with some minor complaints as seen above and in-line/below.

standalone/pdc-process-base-fields

kfogel · 2025-09-22T22:39:56Z

The output is inconsistently quoted which is not an issue yet but could be an issue because there is no prohibition of a comma character, for example, in a base field label. For maximum compatibility I think all values should be quoted.

I think it is automatically quoting commas properly when generating the CSV -- that's what causes the localization JSON value to be double-quoted: the CSV output needs to surround the JSON with double quotes in order to protect the commas that are part of the JSON.

I see the localization JSON surrounded by quotation marks which is great for the CSV file. However, the JSON therein contained becomes invalid when quoted with single-quotes/apostrophes. I am not sure of the best solution, but maybe escaping these quotation marks, as ugly as that is. That or expanding the localizations into their own columns by some naming convention. Another acceptable stop-gap measure would be to omit the localizations until they are fully supported by the script.

I see what's happening, yeah. The reason this doesn't happen for the JSON blobs that are empty (that is, all of the localization values except for the one belonging to Organization Name) is because, as JSON, they don't contain any commas. They're just {}, which is of course not problematic as far as CSV format is concerned.

But shouldn't a CSV parser just treat the double-quoted blob as a string? That is, why would the double quotes be part of the value returned by a CSV parser? They're not themselves doubled -- they're just there to protect the commas inside.

I ask out of genuine puzzlement, because when I ran csvcut on the output, I get the double quotes in csvcut's output too:

$ csvcut -c "localizations" pdc-base-fields.csv 
localizations
{}
{}
{}
"{'en': {'label': 'Organization Name', 'language': 'en', 'createdAt': '2025-08-18T20:01:55.973012+00:00', 'description': '', 'baseFieldShortCode': 'organization_name'}, 'zh': {'label': '组织名称', 'language': 'zh', 'createdAt': '2025-08-18T20:01:55.973215+00:00', 'description': '组织的名称（参见「组织法定名称」）', 'baseFieldShortCode': 'organization_name'}}"
{}
{}
{}
$

(Obviously, I've elided a great many {} lines to make that readable, but you get the idea.)

I think, though, that this may be because csvcut produces CSV-parseable output itself by default.

IOW, are we sure there's a bug here?

I'll do some more testing with some other libraries (like Python's own CSV parser).

We do not (yet) have the linter set up for Python in this repository, so I installed and ran pylint locally, with the result that several changes are needed:

$ pip3 install --user pylint
...
$ pylint ./pdc-process-base-fields
************* Module pdc-process-base-fields
pdc-process-base-fields:92:39: C0303: Trailing whitespace (trailing-whitespace)
pdc-process-base-fields:121:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
pdc-process-base-fields:1:0: C0103: Module name "pdc-process-base-fields" doesn't conform to snake_case naming style (invalid-name)
pdc-process-base-fields:90:0: C0116: Missing function or method docstring (missing-function-docstring)
pdc-process-base-fields:112:8: W0622: Redefining built-in 'input' (redefined-builtin)
pdc-process-base-fields:64:0: W0611: Unused import json (unused-import)

-----------------------------------
Your code has been rated at 7.00/10

Thanks -- I wasn't in the habit of linting. I'll fix the above and revise this PR.

Overall, this is useful and looks good, with some minor complaints as seen above and in-line/below.

kfogel · 2025-09-22T22:42:44Z

In the meantime:

@jmergy, here is a CSV of the PDC base fields!

pdc-base-fields.csv

kfogel · 2025-09-23T02:39:36Z

... That or expanding the localizations into their own columns by some naming convention.

By the way, this change implements the above idea:

index 3637e1c..844ffc3 100755
--- pdc-process-base-fields
+++ pdc-process-base-fields
@@ -109,9 +109,10 @@ def main():
     args = arg_parser.parse_args()
 
     if args.output_format == "csv":
-        input = pandas.read_json(sys.stdin)
+        json_input = json.load(sys.stdin)
+        parsed_input = pandas.json_normalize(json_input)
         # Switch to index=True to include a first column that shows DataFrame index
-        input.to_csv(sys.stdout, index=False, encoding='utf-8')
+        parsed_input.to_csv(sys.stdout, index=False, encoding='utf-8')
     else:  # can't happen
         raise ValueError(
             f"ERROR: Unknown output format 'f{args.output_format}' requested")

Here's the effect it has:

$ cat base-fields.json | ./pdc-process-base-fields | grep zh
label,category,dataType,createdAt,shortCode,description,valueRelevanceHours,sensitivityClassification,localizations.en.label,localizations.en.language,localizations.en.createdAt,localizations.en.description,localizations.en.baseFieldShortCode,localizations.zh.label,localizations.zh.language,localizations.zh.createdAt,localizations.zh.description,localizations.zh.baseFieldShortCode
Organization Name,organization,string,2025-08-18T20:01:55.057921+00:00,organization_name,,,restricted,Organization Name,en,2025-08-18T20:01:55.973012+00:00,,organization_name,组织名称,zh,2025-08-18T20:01:55.973215+00:00,组织的名称（参见「组织法定名称」）,organization_name
$

I kind of like this as a solution. I'll implement it, clean up the lint items too, rebase, and resubmit this PR.

@bickelj

As suggested by @bickelj in PR #211 (#pullrequestreview-3252907800). The only thing in our current PDC base fields that would affected by this is localizations. Before this change, "localizations" would be a single column in the CSV, and would have a JSON structure as its value -- but that JSON structure would incorrectly use single quotes for string values instead of double quotes, because pandas isn't thinking of it as JSON. As far as pandas was concerned, we inhaled some JSON, got an internal data structure, and then we were writing that data structure out as a *Python* syntax string to CSV. There was a solution available (re-encode the value as JSON before writing), but if the whole point of converting the base fields JSON to CSV is to get *CSV*, then asking recipients to parse further JSON out of the CSV seems non-optimal lose. Better to give them more columns, in a predictable way, with values directly readable in CSV format. This change also takes care of the linting items @bickelj noted.

kfogel · 2025-09-23T03:14:01Z

Okay, ready for re-review @bickelj! Thanks.

bickelj

I see that only quoting fields that need them is typical. I stand corrected.

$ pylint ./pdc-process-base-fields

-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 7.00/10, +3.00)

Pylint now passes.

The expansion of localizations into columns works great, this is a big improvement.

kfogel requested a review from bickelj September 22, 2025 02:08

bickelj requested changes Sep 22, 2025

View reviewed changes

standalone/pdc-process-base-fields Show resolved Hide resolved

standalone/pdc-process-base-fields Show resolved Hide resolved

Add pdc-process-base-fields script

9afc3dc

kfogel force-pushed the add-process-base-fields-script branch from bbc471b to 9afc3dc Compare September 23, 2025 03:11

bickelj approved these changes Sep 23, 2025

View reviewed changes

bickelj merged commit 5005149 into main Sep 23, 2025
4 checks passed

bickelj deleted the add-process-base-fields-script branch September 23, 2025 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `pdc-process-base-fields` script#211

Add `pdc-process-base-fields` script#211
bickelj merged 1 commit intomainfrom
add-process-base-fields-script

kfogel commented Sep 22, 2025

Uh oh!

bickelj left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

kfogel commented Sep 22, 2025

Uh oh!

kfogel commented Sep 22, 2025 •

edited

Loading

Uh oh!

kfogel commented Sep 23, 2025 •

edited

Loading

Uh oh!

kfogel commented Sep 23, 2025

Uh oh!

bickelj left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kfogel commented Sep 22, 2025

Uh oh!

bickelj left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kfogel commented Sep 22, 2025

Uh oh!

kfogel commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfogel commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfogel commented Sep 23, 2025

Uh oh!

bickelj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bickelj left a comment •

edited

Loading

kfogel commented Sep 22, 2025 •

edited

Loading

kfogel commented Sep 23, 2025 •

edited

Loading