Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with gene data #95

Open
ljpearlman opened this issue Aug 2, 2021 · 4 comments
Open

Issues with gene data #95

ljpearlman opened this issue Aug 2, 2021 · 4 comments

Comments

@ljpearlman
Copy link

I found several genes with links to both human and mouse species. The tables below have the species associated with those genes; the last column is the list of tables that gene/species was found in ("dataset_organism" is the join path gene -> dataset_gene -> dataset -> dataset_organism).

One MGI gene with both human and mouse data; something's probably wrong with the "human" data entries:

gene id gene name species tables
MGI:99829 Runx2 Homo sapiens dataset_organism
MGI:99829 Runx2 Mus musculus gene_summary, dataset_organism, biosample

Three HGNC genes with both human and mouse data; something's probably wrong with the "mouse" data entries:

gene id gene name species tables
HGNC:12660 VAX1 Homo sapiens dataset_organism
HGNC:12660 VAX1 Mus musculus dataset_organism
HGNC:6121 IRF6 Homo sapiens dataset_organism
HGNC:6121 IRF6 Mus musculus gene_summary, dataset_organism
HGNC:7391 MSX1 Homo sapiens gene_summary, dataset_organism
HGNC:7391 MSX1 Mus musculus gene_summary, dataset_organism

A bunch of Facebase-defined genes with both human and mouse data; I have no suggestions for these:

gene id gene name species tables
FACEBASE:1-4SAG Fgfr1 Homo sapiens dataset_organism
FACEBASE:1-4SAG Fgfr1 Mus musculus gene_summary, dataset_organism
FACEBASE:1-4SAJ Fgfr2 Homo sapiens gene_summary, dataset_organism, clinical_assay
FACEBASE:1-4SAJ Fgfr2 Mus musculus gene_summary, dataset_organism, biosample
FACEBASE:1-4SBW Tgfb3 Homo sapiens gene_summary, dataset_organism
FACEBASE:1-4SBW Tgfb3 Mus musculus gene_summary, dataset_organism
FACEBASE:1-4SBY Wnt5a Homo sapiens dataset_organism
FACEBASE:1-4SBY Wnt5a Mus musculus dataset_organism, gene_summary
FACEBASE:1-4SC2 ABCA4 Homo sapiens dataset_organism
FACEBASE:1-4SC2 ABCA4 Mus musculus dataset_organism
FACEBASE:1-4SCE NOG Homo sapiens dataset_organism
FACEBASE:1-4SCE NOG Mus musculus dataset_organism
FACEBASE:1-4SD0 BMP4 Homo sapiens gene_summary, dataset_organism
FACEBASE:1-4SD0 BMP4 Mus musculus dataset_organism, gene_summary
FACEBASE:1-QK1M FOXE1 Homo sapiens dataset_organism
FACEBASE:1-QK1M FOXE1 Mus musculus dataset_organism
FACEBASE:1-QK1T WNT9B Homo sapiens dataset_organism
FACEBASE:1-QK1T WNT9B Mus musculus dataset_organism
@robes
Copy link
Contributor

robes commented Aug 28, 2021

Seems like there are IDs for the BMP4 gene

If this is what we need to track down, I can pull the Gene ID out so that we can map from our existing terms.

@ljpearlman
Copy link
Author

For that third group, it's possible that setting dataset_gene based on the species in dataset_organism will work; I think I just assumed there would be multiple organisms for each dataset because the dataset_organism table is many-to-many, but in practice that doesn't seem to be the case for most of them.

@ljpearlman
Copy link
Author

This is done on fb-dev, except for the "has data" column (I have some questions about that). There's an ER diagram of the gene-related tables here.

Gene values in the data were translated like this:
For non-FACEBASE: (e.g., MGI or HGNC) gene ids, if there's a translation to an NCBI gene, then {source id, ncbi gene's species} is mapped to that gene.
For FACEBASE gene ids, if {source name or synonym, ncbi gene's species} has a unique case-insensitive match to an NCBI gene, then {source id, target species} is mapped to that gene.

All but two of the alternate_id values in the gene table were empty. I verified that the mappings came up with the same values (one gene had a uniprot id as its value, but that id mapped to the right NCBI gene).

There were 16 genes (affecting 14 datasets) that didn't map this way - two HGNC genes that appeared in mouse gene_summary records, one MGI gene that appeared in a human dataset (the only association of a species to that dataset appears to be a dataset_organism row), 8 FACEBASE genes that didn't match any names or ids, and 5 FACEBASE genes that mapped to multiple NCBI genes.

Gene ID Name Species in data Problems Matching NCBI IDs Datasets (or gene summaries)
FACEBASE:1-4S9P Col1 NCBITaxon:10090 no unique match NCBI_Gene:12842, NCBI_Gene:12843 1-X5F8, 8C4
FACEBASE:1-4SAW Hand2 NCBITaxon:10090 no unique match NCBI_Gene:102636514, NCBI_Gene:15111 1-X5K0, 8DC
FACEBASE:1-4SB2 osterix/sp7 NCBITaxon:7955 no unique match NULL VHT
FACEBASE:1-4SCJ panTro4 NCBITaxon:10090 no unique match NULL 1-50KP
FACEBASE:1-QK2C SMC2 NCBITaxon:9606 no unique match NCBI_Gene:10592, NCBI_Gene:83452 1-JVTJ
FACEBASE:1-QK2R ARNT NCBITaxon:9606 no unique match NCBI_Gene:375056, NCBI_Gene:405 1-JVTJ
FACEBASE:1-WKYJ MIR124-3 NCBITaxon:10090 no unique match NULL 1-HW9A
FACEBASE:1-WKYP mirlet7a-5 NCBITaxon:10090 no unique match NULL 1-HW9A
FACEBASE:1-WKYT mirlet7b-5 NCBITaxon:10090 no unique match NULL 1-HW9A
FACEBASE:1-WKZ0 mirlet7c-5 NCBITaxon:10090 no unique match NULL 1-HW9A
FACEBASE:1-WKZ4 mirlet7d-5 NCBITaxon:10090 no unique match NULL 1-HW9A
FACEBASE:1-Y7WW Ror2 NCBITaxon:10090 no unique match NCBI_Gene:19883, NCBI_Gene:26564 1-Y7W6, 1-YQK8
FACEBASE:1-YQW8 Clathrin NCBITaxon:10090 no unique match NULL 1-YQW0, 2-1QKP
HGNC:6121 IRF6 NCBITaxon:10090 no unique match, ontology/species mismatch NULL 8DG
HGNC:7391 MSX1 NCBITaxon:10090 no unique match, ontology/species mismatch NULL 8CC
MGI:99829 Runx2 NCBITaxon:9606 ontology/species mismatch NULL 1-JVTJ

@robes
Copy link
Contributor

robes commented Sep 9, 2021

I gather the issues with matching are mostly due to species mismatches. E.g., mirlet7a-N seem to be mostly issues of a zebrafish gene associated with a mouse dataset.

'panTro4' is just not a gene name. I can only guess the contributor mistook that field for gene assembly. I can clean that up in the production database. [DONE]

How is 'has_data' populated: A trigger, or some out of band process? Could we just filter on the Dataset and Biosample associations/references on 'has value' in the facet picker?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants