-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Character dtype to U1? #14
Comments
I can't find the reason that we used S1 rather than U1 in sgkit originally, but this seems like a reasonable change to me. It would be worth checking that sgkit's tests still pass with U1. |
I had a quick look, and most do @tomwhite: https://github.com/pystatgen/sgkit/pull/1208 I think you'd need to have a look at the rest, it gets into the guts of the vcf writing code |
FWIW downstream code in vcztools accepts either kind="U" or "S" and converts to "S" for printing to VCF. This seems like the right approach from a be liberal in what you accept perspective. |
Currently the dtype for Character columns is "S1", which leads to values being returned as
bytes
rather thanstr
:This tripped me up, as comparing with "." for example here doesn't find missing values.
Is there a strong reason for using S1 here rather than U1? I think it would be simpler to regard all string-like values as Unicode for downstream analysis.
The text was updated successfully, but these errors were encountered: