Change Character dtype to U1? #14

jeromekelleher · 2024-02-25T10:50:01Z

Currently the dtype for Character columns is "S1", which leads to values being returned as bytes rather than str:

dask.array<open_dataset-variant_IC2, shape=(208, 2), dtype=|S1, chunksize=(208, 2), chunktype=numpy.ndarray>
Dimensions without coordinates: variants, INFO_IC2_dim                                         
Attributes:
    comment:  INFO,Type=Character,Number=2
[[b'' b'']
 [b'' b'']
 [b'' b'']
 [b'' b'']                                     
 [b'' b'']

This tripped me up, as comparing with "." for example here doesn't find missing values.

Is there a strong reason for using S1 here rather than U1? I think it would be simpler to regard all string-like values as Unicode for downstream analysis.

The text was updated successfully, but these errors were encountered:

tomwhite · 2024-02-26T10:48:17Z

I can't find the reason that we used S1 rather than U1 in sgkit originally, but this seems like a reasonable change to me. It would be worth checking that sgkit's tests still pass with U1.

jeromekelleher · 2024-03-05T09:32:44Z

I had a quick look, and most do @tomwhite: https://github.com/pystatgen/sgkit/pull/1208

I think you'd need to have a look at the rest, it gets into the guts of the vcf writing code

jeromekelleher · 2024-07-09T14:42:10Z

FWIW downstream code in vcztools accepts either kind="U" or "S" and converts to "S" for printing to VCF. This seems like the right approach from a be liberal in what you accept perspective.

jeromekelleher mentioned this issue Mar 5, 2024

Partial changes for VCF Character S->U sgkit-dev/sgkit#1208

Draft

jeromekelleher mentioned this issue Jun 7, 2024

simulate_genotype_call_dataset creates alleles as byte strings sgkit-dev/sgkit#1222

Open

jeromekelleher mentioned this issue Jul 9, 2024

Char fields added as Unicode not string sgkit-dev/bio2zarr#268

Closed

jeromekelleher mentioned this issue Jul 9, 2024

Fix missing format sgkit-dev/vcztools#17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Character dtype to U1? #14

Change Character dtype to U1? #14

jeromekelleher commented Feb 25, 2024

tomwhite commented Feb 26, 2024

jeromekelleher commented Mar 5, 2024

jeromekelleher commented Jul 9, 2024

Change Character dtype to U1? #14

Change Character dtype to U1? #14

Comments

jeromekelleher commented Feb 25, 2024

tomwhite commented Feb 26, 2024

jeromekelleher commented Mar 5, 2024

jeromekelleher commented Jul 9, 2024