Conversation
There was a problem hiding this comment.
Hurray! When I run it with --help, it prints the help I needed to successfully run it. And it did what it says it does, it produced a friendly CSV file!
The output is inconsistently quoted which is not an issue yet but could be an issue because there is no prohibition of a comma character, for example, in a base field label. For maximum compatibility I think all values should be quoted.
I see the localization JSON surrounded by quotation marks which is great for the CSV file. However, the JSON therein contained becomes invalid when quoted with single-quotes/apostrophes. I am not sure of the best solution, but maybe escaping these quotation marks, as ugly as that is. That or expanding the localizations into their own columns by some naming convention. Another acceptable stop-gap measure would be to omit the localizations until they are fully supported by the script.
We do not (yet) have the linter set up for Python in this repository, so I installed and ran pylint locally, with the result that several changes are needed:
$ pip3 install --user pylint
...
$ pylint ./pdc-process-base-fields
************* Module pdc-process-base-fields
pdc-process-base-fields:92:39: C0303: Trailing whitespace (trailing-whitespace)
pdc-process-base-fields:121:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
pdc-process-base-fields:1:0: C0103: Module name "pdc-process-base-fields" doesn't conform to snake_case naming style (invalid-name)
pdc-process-base-fields:90:0: C0116: Missing function or method docstring (missing-function-docstring)
pdc-process-base-fields:112:8: W0622: Redefining built-in 'input' (redefined-builtin)
pdc-process-base-fields:64:0: W0611: Unused import json (unused-import)
-----------------------------------
Your code has been rated at 7.00/10Overall, this is useful and looks good, with some minor complaints as seen above and in-line/below.
I think it is automatically quoting commas properly when generating the CSV -- that's what causes the
I see what's happening, yeah. The reason this doesn't happen for the JSON blobs that are empty (that is, all of the localization values except for the one belonging to But shouldn't a CSV parser just treat the double-quoted blob as a string? That is, why would the double quotes be part of the value returned by a CSV parser? They're not themselves doubled -- they're just there to protect the commas inside. I ask out of genuine puzzlement, because when I ran (Obviously, I've elided a great many I think, though, that this may be because IOW, are we sure there's a bug here? I'll do some more testing with some other libraries (like Python's own CSV parser).
Thanks -- I wasn't in the habit of linting. I'll fix the above and revise this PR.
|
|
In the meantime: @jmergy, here is a CSV of the PDC base fields! |
By the way, this change implements the above idea: Here's the effect it has: I kind of like this as a solution. I'll implement it, clean up the lint items too, rebase, and resubmit this PR. |
As suggested by @bickelj in PR #211 (#pullrequestreview-3252907800). The only thing in our current PDC base fields that would affected by this is localizations. Before this change, "localizations" would be a single column in the CSV, and would have a JSON structure as its value -- but that JSON structure would incorrectly use single quotes for string values instead of double quotes, because pandas isn't thinking of it as JSON. As far as pandas was concerned, we inhaled some JSON, got an internal data structure, and then we were writing that data structure out as a *Python* syntax string to CSV. There was a solution available (re-encode the value as JSON before writing), but if the whole point of converting the base fields JSON to CSV is to get *CSV*, then asking recipients to parse further JSON out of the CSV seems non-optimal lose. Better to give them more columns, in a predictable way, with values directly readable in CSV format. This change also takes care of the linting items @bickelj noted.
bbc471b to
9afc3dc
Compare
|
Okay, ready for re-review @bickelj! Thanks. |
bickelj
left a comment
There was a problem hiding this comment.
I see that only quoting fields that need them is typical. I stand corrected.
$ pylint ./pdc-process-base-fields
-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 7.00/10, +3.00)Pylint now passes.
The expansion of localizations into columns works great, this is a big improvement.
As per this Zulip conversation with @bickelj:
https://chat.opentechstrategies.com/#narrow/channel/75-PDC/topic/base.20fields/near/238401