Revisit ingestion of VIDRL flat files #161

joverlee521 · 2024-09-11T21:51:53Z

Brought up by @huddlej on Slack that OneDrive includes flat files. Ingesting the flat files should make the rest of #158 easier?

Revisit changes made in #103 and update it to work with the latest version of the flat files.

joverlee521 · 2024-10-04T23:37:06Z

Thanks @j23414 for investigating the latest flat files 🙏

Jotting down notes for updating the flat file ingest:

the vidrl_flat_file_column_map.tsv will definitely need to be updated
there is a single homologous titre column so our ingest needs to create a row per reference strain to capture these homologous titers
the reference strain use the full strain name so we would no longer need the serum mapping 🎉
human sera pools include the reference strain so we would no longer need to keep track of the vaccine mapping 🎉

joverlee521 · 2024-10-07T22:34:35Z

there is a single homologous titre column so our ingest needs to create a row per reference strain to capture these homologous titers

Oh, there's a separate file for the reference panel results. Each flat file has a matching *_reference_panel.csv file that includes the references' homologous titers.

joverlee521 · 2024-10-07T23:04:26Z

The *_reference_panel.csv has a subset of the columns used in the main *_flat_file.csv and it only includes the shortened name of the antisera. So the antisera -> reference name mapping from the _flat_file.csv will need to be preserved to be used for the processing of the matching _reference_panel.csv file.

huddlej · 2024-10-07T23:09:37Z

@joverlee521 I think we originally asked for the reference panel file and Sheena made it for us. Then later Sheena modified her script that produces the flat files to pull in the relevant information from the reference panel file, so we didn't have to parse that reference information separately.

Is there anything in the reference panel file that we can't get from the flat file by parsing the unique homologous titers like you mentioned above?

We could jump on a huddle tomorrow to chat, if that's helpful. It's been a little while since I looked at these files, too...

joverlee521 · 2024-10-07T23:15:02Z

Is there anything in the reference panel file that we can't get from the flat file by parsing the unique homologous titers like you mentioned above?

Yeah, looking at the *_flat_file.csv more closely, they are completely missing the reference titer measurements. They only include the results for test virus x reference virus, but do not include any of the reference virus x reference virus results.

huddlej · 2024-10-07T23:56:49Z

Got it. I can't see the latest files any more (curse OneDrive!), but in the last view I had of those files, they included columns for reference antigen, reference passage, and homologous titre which would represent most of the reference titer measurements we need, but maybe it isn't enough.

To get those homologous reference values into our standard format we would need to make new records for each unique combination of antigen, passage, and titer with the test virus value equal to the reference antigen, test virus passage equal to reference passage, and titre each to homologous titre. We would be missing the antisera and ferret columns, though. We don't need antisera, when it is just an abbreviation of the reference virus name, but we probably want ferret. That supports the case for parsing the separate reference panel file, if that file has that information.

joverlee521 · 2024-10-08T20:48:00Z

We chatted about this today and decided that we do need to ingest the additional reference_panel.csv. This will ensure our ingest of the flat files includes the all measurements as the previous Excel files.

I'll update tdb/vidrl_upload.py to work with the new flat files and test on a couple Excel/flat file pairs to get a diff of the two paths.

The column map will be more complicated with the need to ingest two slightly different flat files (_flat_file.csv and _reference_panel.csv) as discussed in #161 (comment). I also found myself constantly toggling back and forth between the separate column_map.tsv and the upload script to figure out how the columns are being used, so it makes more sense to just hard-code the column map in the script.

joverlee521 assigned joverlee521 and j23414 Oct 4, 2024

huddlej self-assigned this Oct 9, 2024

joverlee521 linked a pull request Oct 17, 2024 that will close this issue

Update ingest for VIDRL flat files #164

Draft

5 tasks

joverlee521 mentioned this issue Nov 7, 2024

vidrl_upload: fix serum_strain used for human sera measurements #166

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit ingestion of VIDRL flat files #161

Revisit ingestion of VIDRL flat files #161

joverlee521 commented Sep 11, 2024

joverlee521 commented Oct 4, 2024

joverlee521 commented Oct 7, 2024

joverlee521 commented Oct 7, 2024 •

edited

Loading

huddlej commented Oct 7, 2024

joverlee521 commented Oct 7, 2024 •

edited

Loading

huddlej commented Oct 7, 2024

joverlee521 commented Oct 8, 2024

Revisit ingestion of VIDRL flat files #161

Revisit ingestion of VIDRL flat files #161

Comments

joverlee521 commented Sep 11, 2024

joverlee521 commented Oct 4, 2024

joverlee521 commented Oct 7, 2024

joverlee521 commented Oct 7, 2024 • edited Loading

huddlej commented Oct 7, 2024

joverlee521 commented Oct 7, 2024 • edited Loading

huddlej commented Oct 7, 2024

joverlee521 commented Oct 8, 2024

joverlee521 commented Oct 7, 2024 •

edited

Loading

joverlee521 commented Oct 7, 2024 •

edited

Loading