Skip to content

Add pdc-process-base-fields script#211

Merged
bickelj merged 1 commit intomainfrom
add-process-base-fields-script
Sep 23, 2025
Merged

Add pdc-process-base-fields script#211
bickelj merged 1 commit intomainfrom
add-process-base-fields-script

Conversation

@kfogel
Copy link
Contributor

@kfogel kfogel commented Sep 22, 2025

@kfogel kfogel requested a review from bickelj September 22, 2025 02:08
Copy link
Collaborator

@bickelj bickelj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hurray! When I run it with --help, it prints the help I needed to successfully run it. And it did what it says it does, it produced a friendly CSV file!

The output is inconsistently quoted which is not an issue yet but could be an issue because there is no prohibition of a comma character, for example, in a base field label. For maximum compatibility I think all values should be quoted.

I see the localization JSON surrounded by quotation marks which is great for the CSV file. However, the JSON therein contained becomes invalid when quoted with single-quotes/apostrophes. I am not sure of the best solution, but maybe escaping these quotation marks, as ugly as that is. That or expanding the localizations into their own columns by some naming convention. Another acceptable stop-gap measure would be to omit the localizations until they are fully supported by the script.

We do not (yet) have the linter set up for Python in this repository, so I installed and ran pylint locally, with the result that several changes are needed:

$ pip3 install --user pylint
...
$ pylint ./pdc-process-base-fields
************* Module pdc-process-base-fields
pdc-process-base-fields:92:39: C0303: Trailing whitespace (trailing-whitespace)
pdc-process-base-fields:121:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
pdc-process-base-fields:1:0: C0103: Module name "pdc-process-base-fields" doesn't conform to snake_case naming style (invalid-name)
pdc-process-base-fields:90:0: C0116: Missing function or method docstring (missing-function-docstring)
pdc-process-base-fields:112:8: W0622: Redefining built-in 'input' (redefined-builtin)
pdc-process-base-fields:64:0: W0611: Unused import json (unused-import)

-----------------------------------
Your code has been rated at 7.00/10

Overall, this is useful and looks good, with some minor complaints as seen above and in-line/below.

@kfogel
Copy link
Contributor Author

kfogel commented Sep 22, 2025

The output is inconsistently quoted which is not an issue yet but could be an issue because there is no prohibition of a comma character, for example, in a base field label. For maximum compatibility I think all values should be quoted.

I think it is automatically quoting commas properly when generating the CSV -- that's what causes the localization JSON value to be double-quoted: the CSV output needs to surround the JSON with double quotes in order to protect the commas that are part of the JSON.

I see the localization JSON surrounded by quotation marks which is great for the CSV file. However, the JSON therein contained becomes invalid when quoted with single-quotes/apostrophes. I am not sure of the best solution, but maybe escaping these quotation marks, as ugly as that is. That or expanding the localizations into their own columns by some naming convention. Another acceptable stop-gap measure would be to omit the localizations until they are fully supported by the script.

I see what's happening, yeah. The reason this doesn't happen for the JSON blobs that are empty (that is, all of the localization values except for the one belonging to Organization Name) is because, as JSON, they don't contain any commas. They're just {}, which is of course not problematic as far as CSV format is concerned.

But shouldn't a CSV parser just treat the double-quoted blob as a string? That is, why would the double quotes be part of the value returned by a CSV parser? They're not themselves doubled -- they're just there to protect the commas inside.

I ask out of genuine puzzlement, because when I ran csvcut on the output, I get the double quotes in csvcut's output too:

$ csvcut -c "localizations" pdc-base-fields.csv 
localizations
{}
{}
{}
"{'en': {'label': 'Organization Name', 'language': 'en', 'createdAt': '2025-08-18T20:01:55.973012+00:00', 'description': '', 'baseFieldShortCode': 'organization_name'}, 'zh': {'label': '组织名称', 'language': 'zh', 'createdAt': '2025-08-18T20:01:55.973215+00:00', 'description': '组织的名称(参见「组织法定名称」)', 'baseFieldShortCode': 'organization_name'}}"
{}
{}
{}
$ 

(Obviously, I've elided a great many {} lines to make that readable, but you get the idea.)

I think, though, that this may be because csvcut produces CSV-parseable output itself by default.

IOW, are we sure there's a bug here?

I'll do some more testing with some other libraries (like Python's own CSV parser).

We do not (yet) have the linter set up for Python in this repository, so I installed and ran pylint locally, with the result that several changes are needed:

$ pip3 install --user pylint
...
$ pylint ./pdc-process-base-fields
************* Module pdc-process-base-fields
pdc-process-base-fields:92:39: C0303: Trailing whitespace (trailing-whitespace)
pdc-process-base-fields:121:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
pdc-process-base-fields:1:0: C0103: Module name "pdc-process-base-fields" doesn't conform to snake_case naming style (invalid-name)
pdc-process-base-fields:90:0: C0116: Missing function or method docstring (missing-function-docstring)
pdc-process-base-fields:112:8: W0622: Redefining built-in 'input' (redefined-builtin)
pdc-process-base-fields:64:0: W0611: Unused import json (unused-import)

-----------------------------------
Your code has been rated at 7.00/10

Thanks -- I wasn't in the habit of linting. I'll fix the above and revise this PR.

Overall, this is useful and looks good, with some minor complaints as seen above and in-line/below.

@kfogel
Copy link
Contributor Author

kfogel commented Sep 22, 2025

In the meantime:

@jmergy, here is a CSV of the PDC base fields!

pdc-base-fields.csv

@kfogel
Copy link
Contributor Author

kfogel commented Sep 23, 2025

... That or expanding the localizations into their own columns by some naming convention.

By the way, this change implements the above idea:

index 3637e1c..844ffc3 100755
--- pdc-process-base-fields
+++ pdc-process-base-fields
@@ -109,9 +109,10 @@ def main():
     args = arg_parser.parse_args()
 
     if args.output_format == "csv":
-        input = pandas.read_json(sys.stdin)
+        json_input = json.load(sys.stdin)
+        parsed_input = pandas.json_normalize(json_input)
         # Switch to index=True to include a first column that shows DataFrame index
-        input.to_csv(sys.stdout, index=False, encoding='utf-8')
+        parsed_input.to_csv(sys.stdout, index=False, encoding='utf-8')
     else:  # can't happen
         raise ValueError(
             f"ERROR: Unknown output format 'f{args.output_format}' requested")

Here's the effect it has:

$ cat base-fields.json | ./pdc-process-base-fields | grep zh
label,category,dataType,createdAt,shortCode,description,valueRelevanceHours,sensitivityClassification,localizations.en.label,localizations.en.language,localizations.en.createdAt,localizations.en.description,localizations.en.baseFieldShortCode,localizations.zh.label,localizations.zh.language,localizations.zh.createdAt,localizations.zh.description,localizations.zh.baseFieldShortCode
Organization Name,organization,string,2025-08-18T20:01:55.057921+00:00,organization_name,,,restricted,Organization Name,en,2025-08-18T20:01:55.973012+00:00,,organization_name,组织名称,zh,2025-08-18T20:01:55.973215+00:00,组织的名称(参见「组织法定名称」),organization_name
$ 

I kind of like this as a solution. I'll implement it, clean up the lint items too, rebase, and resubmit this PR.

kfogel added a commit that referenced this pull request Sep 23, 2025
As suggested by @bickelj in PR #211 (#pullrequestreview-3252907800).

The only thing in our current PDC base fields that would affected by
this is localizations.  Before this change, "localizations" would be a
single column in the CSV, and would have a JSON structure as its value
-- but that JSON structure would incorrectly use single quotes for
string values instead of double quotes, because pandas isn't thinking
of it as JSON.  As far as pandas was concerned, we inhaled some JSON,
got an internal data structure, and then we were writing that data
structure out as a *Python* syntax string to CSV.  There was a
solution available (re-encode the value as JSON before writing), but
if the whole point of converting the base fields JSON to CSV is to get
*CSV*, then asking recipients to parse further JSON out of the CSV
seems non-optimal lose.  Better to give them more columns, in a
predictable way, with values directly readable in CSV format.

This change also takes care of the linting items @bickelj noted.
@kfogel kfogel force-pushed the add-process-base-fields-script branch from bbc471b to 9afc3dc Compare September 23, 2025 03:11
@kfogel
Copy link
Contributor Author

kfogel commented Sep 23, 2025

Okay, ready for re-review @bickelj! Thanks.

Copy link
Collaborator

@bickelj bickelj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that only quoting fields that need them is typical. I stand corrected.

$ pylint ./pdc-process-base-fields

-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 7.00/10, +3.00)

Pylint now passes.

The expansion of localizations into columns works great, this is a big improvement.

@bickelj bickelj merged commit 5005149 into main Sep 23, 2025
4 checks passed
@bickelj bickelj deleted the add-process-base-fields-script branch September 23, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants