Skip to content

Commit ccdb1f2

Browse files
feat: add biosample index (#769)
* Initial commit of biosample index * Make minimal class * Tidy up first draft of adding biosample index * Add beginning of logic for checking if biosample from a studyindex is in biosample index * Make early file for merging multiple biosample indices into one * Finish adding basic iteration of biosample index, needs debugging * Tweak slightly * Modified the parser to accept JSON files * Update biosample index * Tests and docs * Updating tests * Revert GWAS catalog file * fix(biosample index): update to match pre-commit standards * fix(biosample index): merging indices fix * fix(biosample index): update study index qc logic * fix(biosample index): fix missing mock_biosample_index * chore(biosample index): change datasource name from ontologies * fix(biosample index): add dataset doc * fix(biosample index): change dbXrefs to xrefs * chore (biosample index): better commenting Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com> * fix(biosample index): various minor tweaks to biosample index * fix(biosample index): minor bug * fix(biosample index): fix merge shift to method * feat(biosample index): make biosampleName not nullable --------- Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>
1 parent 148e26e commit ccdb1f2

File tree

19 files changed

+1735
-2
lines changed

19 files changed

+1735
-2
lines changed
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
title: Biosample index
3+
---
4+
5+
::: gentropy.dataset.biosample_index.BiosampleIndex
6+
7+
## Schema
8+
9+
--8<-- "assets/schemas/biosample_index.md"

docs/python_api/datasources/_datasources.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ This section contains information about the data source harmonisation tools avai
2626
2. GWAS catalog's [harmonisation pipeline](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics#_harmonised_summary_statistics_data)
2727
3. Ensembl's [Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html)
2828

29-
## Linkage desiquilibrium
29+
## Linkage disequilibrium
3030

3131
1. [GnomAD](gnomad/_gnomad.md) v2.1.1 LD matrixes (7 ancestries)
3232

@@ -37,3 +37,8 @@ This section contains information about the data source harmonisation tools avai
3737
## Gene annotation
3838

3939
1. [Open Targets Platform Target Dataset](open_targets/target.md) (derived from Ensembl)
40+
41+
## Biological samples
42+
43+
1. [Uberon](biosample_ontologies/_uberon.md)
44+
2. [Cell Ontology](biosample_ontologies/_cell_ontology.md)
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
title: Cell Ontology
3+
---
4+
5+
The [Cell Ontology](http://www.obofoundry.org/ontology/cl.html) is a structured controlled vocabulary for cell types. It is used to annotate cell types in single-cell RNA-seq data and other omics data.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
title: Uberon
3+
---
4+
5+
The [Uberon](http://uberon.github.io/) ontology is a multi-species anatomy ontology that integrates cross-species ontologies into a single ontology.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
title: biosample_index
3+
---
4+
5+
::: gentropy.biosample_index.BiosampleIndexStep

poetry.lock

Lines changed: 2 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
{
2+
"type": "struct",
3+
"fields": [
4+
{
5+
"name": "biosampleId",
6+
"type": "string",
7+
"nullable": false,
8+
"metadata": {}
9+
},
10+
{
11+
"name": "biosampleName",
12+
"type": "string",
13+
"nullable": false,
14+
"metadata": {}
15+
},
16+
{
17+
"name": "description",
18+
"type": "string",
19+
"nullable": true,
20+
"metadata": {}
21+
},
22+
{
23+
"name": "xrefs",
24+
"type": {
25+
"type": "array",
26+
"elementType": "string",
27+
"containsNull": true
28+
},
29+
"nullable": true,
30+
"metadata": {}
31+
},
32+
{
33+
"name": "synonyms",
34+
"type": {
35+
"type": "array",
36+
"elementType": "string",
37+
"containsNull": true
38+
},
39+
"nullable": true,
40+
"metadata": {}
41+
},
42+
{
43+
"name": "parents",
44+
"type": {
45+
"type": "array",
46+
"elementType": "string",
47+
"containsNull": true
48+
},
49+
"nullable": true,
50+
"metadata": {}
51+
},
52+
{
53+
"name": "ancestors",
54+
"type": {
55+
"type": "array",
56+
"elementType": "string",
57+
"containsNull": true
58+
},
59+
"nullable": true,
60+
"metadata": {}
61+
},
62+
{
63+
"name": "descendants",
64+
"type": {
65+
"type": "array",
66+
"elementType": "string",
67+
"containsNull": true
68+
},
69+
"nullable": true,
70+
"metadata": {}
71+
},
72+
{
73+
"name": "children",
74+
"type": {
75+
"type": "array",
76+
"elementType": "string",
77+
"containsNull": true
78+
},
79+
"nullable": true,
80+
"metadata": {}
81+
}
82+
]
83+
}

src/gentropy/biosample_index.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
"""Step to generate biosample index dataset."""
2+
from __future__ import annotations
3+
4+
from gentropy.common.session import Session
5+
from gentropy.datasource.biosample_ontologies.utils import extract_ontology_from_json
6+
7+
8+
class BiosampleIndexStep:
9+
"""Biosample index step.
10+
11+
This step generates a Biosample index dataset from the various ontology sources. Currently Cell Ontology and Uberon are supported.
12+
"""
13+
14+
def __init__(
15+
self,
16+
session: Session,
17+
cell_ontology_input_path: str,
18+
uberon_input_path: str,
19+
biosample_index_path: str,
20+
) -> None:
21+
"""Run Biosample index generation step.
22+
23+
Args:
24+
session (Session): Session object.
25+
cell_ontology_input_path (str): Input cell ontology dataset path.
26+
uberon_input_path (str): Input uberon dataset path.
27+
biosample_index_path (str): Output gene index dataset path.
28+
"""
29+
cell_ontology_index = extract_ontology_from_json(cell_ontology_input_path, session.spark)
30+
uberon_index = extract_ontology_from_json(uberon_input_path, session.spark)
31+
32+
biosample_index = cell_ontology_index.merge_indices([uberon_index])
33+
34+
biosample_index.df.write.mode(session.write_mode).parquet(biosample_index_path)

src/gentropy/config.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,16 @@ class GeneIndexConfig(StepConfig):
5151
_target_: str = "gentropy.gene_index.GeneIndexStep"
5252

5353

54+
@dataclass
55+
class BiosampleIndexConfig(StepConfig):
56+
"""Biosample index step configuration."""
57+
58+
cell_ontology_input_path: str = MISSING
59+
uberon_input_path: str = MISSING
60+
biosample_index_path: str = MISSING
61+
_target_: str = "gentropy.biosample_index.BiosampleIndexStep"
62+
63+
5464
@dataclass
5565
class GWASCatalogStudyCurationConfig(StepConfig):
5666
"""GWAS Catalog study curation step configuration."""
@@ -472,6 +482,7 @@ class StudyValidationStepConfig(StepConfig):
472482
study_index_path: list[str] = MISSING
473483
target_index_path: str = MISSING
474484
disease_index_path: str = MISSING
485+
biosample_index_path: str = MISSING
475486
valid_study_index_path: str = MISSING
476487
invalid_study_index_path: str = MISSING
477488
invalid_qc_reasons: list[str] = MISSING
@@ -512,6 +523,7 @@ def register_config() -> None:
512523
cs.store(group="step", name="colocalisation", node=ColocalisationConfig)
513524
cs.store(group="step", name="eqtl_catalogue", node=EqtlCatalogueConfig)
514525
cs.store(group="step", name="gene_index", node=GeneIndexConfig)
526+
cs.store(group="step", name="biosample_index", node=BiosampleIndexConfig)
515527
cs.store(
516528
group="step",
517529
name="gwas_catalog_study_curation",
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
"""Biosample index dataset."""
2+
3+
from __future__ import annotations
4+
5+
from dataclasses import dataclass
6+
from functools import reduce
7+
from typing import TYPE_CHECKING
8+
9+
import pyspark.sql.functions as f
10+
from pyspark.sql import DataFrame
11+
from pyspark.sql.types import ArrayType, StringType
12+
13+
from gentropy.common.schemas import parse_spark_schema
14+
from gentropy.dataset.dataset import Dataset
15+
16+
if TYPE_CHECKING:
17+
from pyspark.sql.types import StructType
18+
19+
20+
@dataclass
21+
class BiosampleIndex(Dataset):
22+
"""Biosample index dataset.
23+
24+
A Biosample index dataset captures the metadata of the biosamples (e.g. tissues, cell types, cell lines, etc) such as alternate names and relationships with other biosamples.
25+
"""
26+
27+
@classmethod
28+
def get_schema(cls: type[BiosampleIndex]) -> StructType:
29+
"""Provide the schema for the BiosampleIndex dataset.
30+
31+
Returns:
32+
StructType: The schema of the BiosampleIndex dataset.
33+
"""
34+
return parse_spark_schema("biosample_index.json")
35+
36+
def merge_indices(
37+
self: BiosampleIndex,
38+
biosample_indices : list[BiosampleIndex]
39+
) -> BiosampleIndex:
40+
"""Merge a list of biosample indices into a single biosample index.
41+
42+
Where there are conflicts, in single values - the first value is taken. In list values, the union of all values is taken.
43+
44+
Args:
45+
biosample_indices (list[BiosampleIndex]): Biosample indices to merge.
46+
47+
Returns:
48+
BiosampleIndex: Merged biosample index.
49+
"""
50+
# Extract the DataFrames from the BiosampleIndex objects
51+
biosample_dfs = [biosample_index.df for biosample_index in biosample_indices] + [self.df]
52+
53+
# Merge the DataFrames
54+
merged_df = reduce(DataFrame.unionAll, biosample_dfs)
55+
56+
# Determine aggregation functions for each column
57+
# Currently this will take the first value for single values and merge lists for list values
58+
agg_funcs = []
59+
for field in merged_df.schema.fields:
60+
if field.name != "biosampleId": # Skip the grouping column
61+
if field.dataType == ArrayType(StringType()):
62+
agg_funcs.append(f.array_distinct(f.flatten(f.collect_list(field.name))).alias(field.name))
63+
else:
64+
agg_funcs.append(f.first(f.col(field.name), ignorenulls=True).alias(field.name))
65+
66+
# Perform aggregation
67+
aggregated_df = merged_df.groupBy("biosampleId").agg(*agg_funcs)
68+
69+
return BiosampleIndex(
70+
_df=aggregated_df,
71+
_schema=BiosampleIndex.get_schema()
72+
)

src/gentropy/dataset/study_index.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
from pyspark.sql import Column, DataFrame
2020
from pyspark.sql.types import StructType
2121

22+
from gentropy.dataset.biosample_index import BiosampleIndex
2223
from gentropy.dataset.gene_index import GeneIndex
2324

2425

@@ -29,12 +30,14 @@ class StudyQualityCheck(Enum):
2930
UNRESOLVED_TARGET (str): Target/gene identifier could not match to reference - Labelling failing target.
3031
UNRESOLVED_DISEASE (str): Disease identifier could not match to referece or retired identifier - labelling failing disease
3132
UNKNOWN_STUDY_TYPE (str): Indicating the provided type of study is not supported.
33+
UNKNOWN_BIOSAMPLE (str): Flagging if a biosample identifier is not found in the reference.
3234
DUPLICATED_STUDY (str): Flagging if a study identifier is not unique.
3335
"""
3436

3537
UNRESOLVED_TARGET = "Target/gene identifier could not match to reference."
3638
UNRESOLVED_DISEASE = "No valid disease identifier found."
3739
UNKNOWN_STUDY_TYPE = "This type of study is not supported."
40+
UNKNOWN_BIOSAMPLE = "Biosample identifier was not found in the reference."
3841
DUPLICATED_STUDY = "The identifier of this study is not unique."
3942

4043

@@ -406,3 +409,36 @@ def validate_target(self: StudyIndex, target_index: GeneIndex) -> StudyIndex:
406409
)
407410

408411
return StudyIndex(_df=validated_df, _schema=StudyIndex.get_schema())
412+
413+
def validate_biosample(self: StudyIndex, biosample_index: BiosampleIndex) -> StudyIndex:
414+
"""Validating biosample identifiers in the study index against the provided biosample index.
415+
416+
Args:
417+
biosample_index (BiosampleIndex): Biosample index containing a reference of biosample identifiers e.g. cell types, tissues, cell lines, etc.
418+
419+
Returns:
420+
StudyIndex: with flagged studies if biosampleIndex could not be validated.
421+
"""
422+
biosample_set = biosample_index.df.select("biosampleId", f.lit(True).alias("isIdFound"))
423+
424+
validated_df = (
425+
self.df.join(biosample_set, self.df.biosampleFromSourceId == biosample_set.biosampleId, how="left")
426+
.withColumn(
427+
"isIdFound",
428+
f.when(
429+
f.col("isIdFound").isNull(),
430+
f.lit(False),
431+
).otherwise(f.lit(True)),
432+
)
433+
.withColumn(
434+
"qualityControls",
435+
StudyIndex.update_quality_flag(
436+
f.col("qualityControls"),
437+
~f.col("isIdFound"),
438+
StudyQualityCheck.UNKNOWN_BIOSAMPLE,
439+
),
440+
)
441+
.drop("isIdFound").drop("biosampleId")
442+
)
443+
444+
return StudyIndex(_df=validated_df, _schema=StudyIndex.get_schema())
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
"""Biosample index data source."""
2+
3+
from __future__ import annotations

0 commit comments

Comments
 (0)