Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix reading VCF with no header #5206

Merged
merged 5 commits into from
Dec 7, 2024
Merged

fix reading VCF with no header #5206

merged 5 commits into from
Dec 7, 2024

Conversation

ClayBirkett
Copy link
Member

@ClayBirkett ClayBirkett commented Nov 14, 2024

When the VCF has no header then the first few accessions are skipped while loading
This modification works on the transposed file

#5203

Checklist

  • Refactoring only
  • Documentation only
  • Fixture update only
  • [x ] Bug fix
    • The relevant issue has been closed.
    • Further work is required.
  • New feature
    • Relevant tests have been created and run.
    • Data was added to the fixture
      • Data was added via a patch in /t/data/fixture/patches/.
    • User-Facing Change
      • The user manual in /docs has been updated.
    • Any new Perl has been documented using perldoc.
    • Any new JavaScript has been documented using JSDoc.
    • Any new legacy JavaScript has been moved from /js to /js/source/legacy.

@lukasmueller
Copy link
Member

Instead of starting with the first line that contains number - slash - number, the parsing should simply start at the line after the line starting with #CHROM ?

@ClayBirkett
Copy link
Member Author

The error is when reading in the transposed VCF so there is no #CHROM line. It might be that the problem is with the creation of the transposed file not reading the file.

@lukasmueller
Copy link
Member

In the code it actually skips 8 entries, which is correct for the untransposed file, but probably incorrect for the transposed file?

@ClayBirkett
Copy link
Member Author

I have it working now. The problem was that the parse_with_plugin() function was reading a fixed number of comment lines which is never true. I changed this to read all lines starting with "##", the comment lines. Then the next_genotype() function then skips the crhom, pos, id, ref, alt correctly

@lukasmueller
Copy link
Member

lukasmueller commented Nov 29, 2024

When I try to upload a test file, I see the following error in the log:
[error] Attribute (tempfile) does not pass the type constraint because: Validation failed for 'Str' with value undef at /home/production/cxgn/local-lib/lib/perl5/x86_64-linux-gnu-thread-multi/Moose/Object.pm line 24
Moose::Object::new('CXGN::UploadFile', 'HASH(0x5adc3ba7e560)') called at /home/production/cxgn/sgn/bin/../lib/SGN/Controller/AJAX/GenotypesVCFUpload.pm line 380
SGN::Controller::AJAX::GenotypesVCFUpload::upload_genotype_verify_POST('SGN::Controller::AJAX::GenotypesVCFUpload=HASH(0x5adc350a9e10)', 'SGN=HASH(0x5adc3ac95260)') called at /home/production/cxgn/local-lib/lib/perl5/Catalyst/Action.pm line 358
Catalyst::Action::execute('Catalyst::Action=HASH(0x5adc354e9dd8)', 'SGN::Controller::AJAX::GenotypesVCFUpload=HASH(0x5adc350a9e10)', 'SGN=HASH(0x5adc3ac95260)') called at /home/production/cxgn/local-lib/lib/perl5/Catalyst.pm line 2060
eval {...} at /home/production/cxgn/local-lib/lib/perl5/Catalyst.pm line 2060
......

and the following error on the user interface:

The organism species you provided is not in the database! Please contact us.

NOTE: It worked after I made a genotyping project... So the genotyping project cannot be made on the fly.

@ClayBirkett
Copy link
Member Author

I haven't seen that problem but for me the web VCF upload is still renaming my stock names and causing the upload to fail. The $include_lab_numbers setting is enabled when you choose "accessions" and that removes "." from stock names. I'll have to remove or fix that code.

@alockrow
Copy link
Contributor

alockrow commented Dec 2, 2024

@lukasmueller I was getting the same error until I made a new Genotyping Protocol instead of using the one already available on the fixture. It looks like the Protocol available on the fixture (GBS ApeKI genotyping v4) could be giving that error because its information format is out of date (see below).

image

@lukasmueller
Copy link
Member

We should update the fixture to have the correct format for the protocol... could that be done for in the context of this PR?

@isaak
Copy link
Member

isaak commented Dec 5, 2024

From the cli:

Loading works, if there is at least one header line in the VCF file.

If no headers, then the data about the protocol and markers is loaded, but it does not seem the alleles data is loaded. If you expand the 'Genotype Data' section, it hangs up processing the retrieval of the data. No obvious error is thrown.

@ClayBirkett
Copy link
Member Author

If you are using the cassava_test.vcf, it may be the file is causing the loading problems. The last line is truncated. I cleaned up the file to make it valid.

@lukasmueller lukasmueller requested a review from isaak December 5, 2024 16:24
@lukasmueller lukasmueller merged commit 51b6389 into master Dec 7, 2024
4 checks passed
@lukasmueller lukasmueller deleted the fix-vcf-upload branch December 7, 2024 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants