Skip to content
This repository has been archived by the owner on Oct 30, 2021. It is now read-only.

merge "the same" variant in gvcf_variants and gvcf_variants_expanded #15

Open
deflaux opened this issue Aug 6, 2014 · 0 comments
Open

Comments

@deflaux
Copy link
Contributor

deflaux commented Aug 6, 2014

Since "the same" variant can occur on multiple records in our tables (see example records shown by query below), it breaks the experiment we are trying to do with table gvcf_variants_expanded.

In VCF format, the same variant can be encoded in multiple ways and we're not trying to solve that deeper issue here. Instead its just a simplistic grouping of data -> when the location in the genome, reference bases, and alternate bases match exactly, group those together in the same single record in the table.

SELECT
  contig_name,
  start_pos,
  reference_bases,
  alt,
  numsamples,
  dataset
FROM (
  SELECT
    contig_name,
    start_pos,
    reference_bases,
    GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alt,
    COUNT(call.callset_name) WITHIN RECORD AS numsamples,
    'gvcf' AS dataset,
  FROM
    [google.com:biggene:test.pgp_gvcf_variants]
  WHERE
    contig_name = '2'
    AND start_pos=222324034
    AND reference_bases='CTA'),
  (
  SELECT
    contig_name,
    start_pos,
    reference_bases,
    GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alt,
    COUNT(call.callset_name) WITHIN RECORD AS numsamples,
    'exp' AS dataset,
  FROM
    [google.com:biggene:test.pgp_gvcf_variants_expanded]
  WHERE
    contig_name = '2'
    AND start_pos=222324034
    AND reference_bases='CTA')
ORDER BY
  dataset,
  alt
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant