Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcfR2tidy can't handle files with empty INFO #200

Open
alkaZeltser opened this issue Sep 17, 2022 · 4 comments
Open

vcfR2tidy can't handle files with empty INFO #200

alkaZeltser opened this issue Sep 17, 2022 · 4 comments

Comments

@alkaZeltser
Copy link

alkaZeltser commented Sep 17, 2022

Firstly, thank you for a wonderful package!
I, unfortunately, have discovered a bug with an interesting edge-case.
I have a VCF that contains no information in the INFO column (. in each row) and accordingly no lines in the header with information on any INFO field (no ##INFO=<> lines).

This becomes a problem when calling the vcfR2tidy function. It generates the following error:
Error in strsplit(unlist(x), split = "=") : non-character argument

This error is resolved when a dummy INFO line is added to the @meta object.

I have traced the source to the following function which parses the INFO and FORMAT header information, assuming that a line starting with ##INFO exists:

vcf_field_names <- function(x, tag = "INFO") {

Reproducible example:

data(vcfR_test)
## cause error ##
# remove all INFO lines in @meta object
INFO.meta.lines <- grepl("^##INFO", vcfR_test@meta);
vcfR_test@meta <- vcfR_test@meta[!INFO.meta.lines];
# remove all data from INFO column of @fix object
INFO.col.index <- 8;
vcfR_test@fix[, INFO.col.index] <- rep(NA, nrow(vcfR_test@fix));

vcfR2tidy(vcfR_test)

Error in strsplit(unlist(x), split = "=") : non-character argument

## resolve error ##
dummy.INFO.line <- "##INFO=<ID=AF>"
vcfR_test@meta[length(vcfR_test@meta) + 1] <- dummy.INFO.line

vcfR2tidy(vcfR_test)

Extracting gt element GT
Extracting gt element GQ
Extracting gt element DP
Extracting gt element HQ

@knausb
Copy link
Owner

knausb commented Sep 30, 2022

Hi @alkaZeltser , thanks for posting this and thank you for creating a minimal reproducible example. That greatly facilitates this conversation! I have reproduced your example and agree that the behavior you report is accurate. I have some questions that I hope you can help to address.

You report VCF data where the INFO column is populated by NA (in R we use NA, the VCF specification uses '.'). Because this column has no data it is not documented in the meta region. If there were any data in the 'INFO' column I do feel that it should be described in the meta region and if it were not I would interpret this as deviating from the VCF specification, and I would not feel obligated to address it. But you're reporting no records in the meta AND missing data in the INFO column. Because I have never encountered your situation I took a look at the VCF specification to make sure you're reporting a reasonable situation. And I do believe it's reasonable. So I'm curious, how was this file created?

The issue appears to be that we have not anticipated a situation where there are zero INFO records in the meta region. Thank you for proposing a solution! If we move forward on this it seems to me we would be better off engineering a solution where it is recognized that there are no INFO records in the meta region as opposed to adding dummy data. I think I can handle this. But I thought I should address your proposal and see if you agree.

Thanks!

@alkaZeltser
Copy link
Author

Hi Brian, thank you for your reply!

So I'm curious, how was this file created?

It is indeed an unusual file from my experience; very minimalist.
As you say, the INFO column is populated by '.' in the original VCF.
This VCF file was created by a genotyping core with data from a genotyping experiment using the Global Screening Array on an Illumina machine (I don't know the exact one, unfortunately). The header records VCF format version 4.1 and processing via gtc_to_vcf 1.2.1. The core uses some of their own quality control procedures in addition to the Illumina defaults to filter certain sites and samples, and that's pretty much it.
The FORMAT column is also very minimal, containing just two fields: GT:GQ
The QUAL column is also completely empty (just '.').

If we move forward on this it seems to me we would be better off engineering a solution where it is recognized that there are no INFO records in the meta region as opposed to adding dummy data.

I completely agree, my solution was simply a quick hack/sanity check. Accounting for a case with an empty INFO column and no INFO records in the VCF header would be more appropriate.

@knausb
Copy link
Owner

knausb commented Oct 3, 2022

Hi @alkaZeltser , I'm not familiar with gtc_to_vcf. Thanks for letting me know.

I think I have a solution for your issue. It's in the master branch now. If you'd like to give it a try, the following should install it.

devtools::install_github(repo="knausb/vcfR")

Note that you will need a compiler to install from GitHub. If you do try it and notice anything, please let me know and we can try again. If you're interested in changes I made, you should be able to see the diff below.

eb7124a

Thanks and please let me know if you do not feel this addresses your issue!
Brian

@alkaZeltser
Copy link
Author

Thank you for responding so quickly! I will try to test out the new code.
For reference, I found that GTCtoVCF is an Illumina tool with documentation here: https://github.com/Illumina/GTCtoVCF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants