Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query list samples command #58

Merged
merged 1 commit into from
Aug 21, 2024
Merged

Conversation

Will-Tyler
Copy link
Contributor

@Will-Tyler Will-Tyler commented Aug 17, 2024

Overview

This pull request implements the bcftools query --list-samples command from bcftools in vcztools. The vcztools command loads the sample IDs from the VCF Zarr group and prints them.

This pull request closes #51.

Example usage

vcztools query -l vcz_test_cache/sample.vcf.vcz
NA00001
NA00002
NA00003
vcztools query --list-samples vcz_test_cache/sample.vcf.vcz
NA00001
NA00002
NA00003

Testing

I added some tests that compare bcftools' output with vcztools' output along with some unit tests. The code introduced in this pull request has good coverage.

The coverage tool shows no coverage on the code added in this PR because the tests run the code in a different process. Most of the VCF writer code is also not covered because of this testing approach.

References

@tomwhite
Copy link
Contributor

The coverage tool shows no coverage on the code added in this PR because the tests run the code in a different process.

I've been wondering what our testing strategy should be, and this is one reason to add unit tests that call functions directly (i.e. not via the CLI). In this case it should just be a matter of adding a unit test for list_samples.

I think we need validation tests too of course, but these are less focused on exercising test coverage and more about checking that we match bcftools across a wide range of CLI parameters. So in this case I think it's fine to keep the tests you've already added as well.

Most of the VCF writer code is also not covered because of this testing approach.

I think a lot of the missing VCF writer coverage is because we are not exercising VCF header generation yet, but that's a bigger topic - see #47 and related issues.

Copy link
Contributor

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Probably best to wait for #62 before we merge this.

list_samples(vcz_path, output=tmp_path / "sample_ids.txt")

with open(tmp_path / "sample_ids.txt") as file:
assert file.read() == expected_output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #62 we can avoid testing every command twice (once with StringIO, and once with a file path). So I think it's OK to remove the file path test here.

import zarr


def list_samples(vcz_path, output=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can output be None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print will default to sys.stdout if the file argument is None. I could make the output default sys.stdout to make this more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, making the output default to sys.stdout breaks the tests. I think because Python evaluates sys.stdout when it reads the function definition, but then stdout is changed during the validation tests.

Anyway, I think it is okay because print's file argument can be None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. I noticed it because view explicitly specifies sys.stdout when it calls write_vcf. Not a big deal, but it may be worth making them consistent (although that might fall out of #54 anyway).

@tomwhite tomwhite merged commit eb04b7f into sgkit-dev:main Aug 21, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add query command to list samples (-l/--list-samples)
2 participants