Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Character dtype to U1? #14

Open
jeromekelleher opened this issue Feb 25, 2024 · 3 comments
Open

Change Character dtype to U1? #14

jeromekelleher opened this issue Feb 25, 2024 · 3 comments

Comments

@jeromekelleher
Copy link
Contributor

Currently the dtype for Character columns is "S1", which leads to values being returned as bytes rather than str:

dask.array<open_dataset-variant_IC2, shape=(208, 2), dtype=|S1, chunksize=(208, 2), chunktype=numpy.ndarray>
Dimensions without coordinates: variants, INFO_IC2_dim                                         
Attributes:
    comment:  INFO,Type=Character,Number=2
[[b'' b'']
 [b'' b'']
 [b'' b'']
 [b'' b'']                                     
 [b'' b'']                                     

This tripped me up, as comparing with "." for example here doesn't find missing values.

Is there a strong reason for using S1 here rather than U1? I think it would be simpler to regard all string-like values as Unicode for downstream analysis.

@tomwhite
Copy link
Collaborator

I can't find the reason that we used S1 rather than U1 in sgkit originally, but this seems like a reasonable change to me. It would be worth checking that sgkit's tests still pass with U1.

@jeromekelleher
Copy link
Contributor Author

I had a quick look, and most do @tomwhite: https://github.com/pystatgen/sgkit/pull/1208

I think you'd need to have a look at the rest, it gets into the guts of the vcf writing code

@jeromekelleher
Copy link
Contributor Author

FWIW downstream code in vcztools accepts either kind="U" or "S" and converts to "S" for printing to VCF. This seems like the right approach from a be liberal in what you accept perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants