-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
20 changed files
with
1,737 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
--- | ||
title: Biosample index | ||
--- | ||
|
||
::: gentropy.dataset.biosample_index.BiosampleIndex | ||
|
||
## Schema | ||
|
||
--8<-- "assets/schemas/biosample_index.md" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5 changes: 5 additions & 0 deletions
5
docs/python_api/datasources/biosample_ontologies/_cell_ontology.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
title: Cell Ontology | ||
--- | ||
|
||
The [Cell Ontology](http://www.obofoundry.org/ontology/cl.html) is a structured controlled vocabulary for cell types. It is used to annotate cell types in single-cell RNA-seq data and other omics data. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
title: Uberon | ||
--- | ||
|
||
The [Uberon](http://uberon.github.io/) ontology is a multi-species anatomy ontology that integrates cross-species ontologies into a single ontology. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
title: biosample_index | ||
--- | ||
|
||
::: gentropy.biosample_index.BiosampleIndexStep |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
{ | ||
"type": "struct", | ||
"fields": [ | ||
{ | ||
"name": "biosampleId", | ||
"type": "string", | ||
"nullable": false, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "biosampleName", | ||
"type": "string", | ||
"nullable": false, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "description", | ||
"type": "string", | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "xrefs", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "synonyms", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "parents", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "ancestors", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "descendants", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "children", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
} | ||
] | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
"""Step to generate biosample index dataset.""" | ||
from __future__ import annotations | ||
|
||
from gentropy.common.session import Session | ||
from gentropy.datasource.biosample_ontologies.utils import extract_ontology_from_json | ||
|
||
|
||
class BiosampleIndexStep: | ||
"""Biosample index step. | ||
This step generates a Biosample index dataset from the various ontology sources. Currently Cell Ontology and Uberon are supported. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
session: Session, | ||
cell_ontology_input_path: str, | ||
uberon_input_path: str, | ||
biosample_index_path: str, | ||
) -> None: | ||
"""Run Biosample index generation step. | ||
Args: | ||
session (Session): Session object. | ||
cell_ontology_input_path (str): Input cell ontology dataset path. | ||
uberon_input_path (str): Input uberon dataset path. | ||
biosample_index_path (str): Output gene index dataset path. | ||
""" | ||
cell_ontology_index = extract_ontology_from_json(cell_ontology_input_path, session.spark) | ||
uberon_index = extract_ontology_from_json(uberon_input_path, session.spark) | ||
|
||
biosample_index = cell_ontology_index.merge_indices([uberon_index]) | ||
|
||
biosample_index.df.write.mode(session.write_mode).parquet(biosample_index_path) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
"""Biosample index dataset.""" | ||
|
||
from __future__ import annotations | ||
|
||
from dataclasses import dataclass | ||
from functools import reduce | ||
from typing import TYPE_CHECKING | ||
|
||
import pyspark.sql.functions as f | ||
from pyspark.sql import DataFrame | ||
from pyspark.sql.types import ArrayType, StringType | ||
|
||
from gentropy.common.schemas import parse_spark_schema | ||
from gentropy.dataset.dataset import Dataset | ||
|
||
if TYPE_CHECKING: | ||
from pyspark.sql.types import StructType | ||
|
||
|
||
@dataclass | ||
class BiosampleIndex(Dataset): | ||
"""Biosample index dataset. | ||
A Biosample index dataset captures the metadata of the biosamples (e.g. tissues, cell types, cell lines, etc) such as alternate names and relationships with other biosamples. | ||
""" | ||
|
||
@classmethod | ||
def get_schema(cls: type[BiosampleIndex]) -> StructType: | ||
"""Provide the schema for the BiosampleIndex dataset. | ||
Returns: | ||
StructType: The schema of the BiosampleIndex dataset. | ||
""" | ||
return parse_spark_schema("biosample_index.json") | ||
|
||
def merge_indices( | ||
self: BiosampleIndex, | ||
biosample_indices : list[BiosampleIndex] | ||
) -> BiosampleIndex: | ||
"""Merge a list of biosample indices into a single biosample index. | ||
Where there are conflicts, in single values - the first value is taken. In list values, the union of all values is taken. | ||
Args: | ||
biosample_indices (list[BiosampleIndex]): Biosample indices to merge. | ||
Returns: | ||
BiosampleIndex: Merged biosample index. | ||
""" | ||
# Extract the DataFrames from the BiosampleIndex objects | ||
biosample_dfs = [biosample_index.df for biosample_index in biosample_indices] + [self.df] | ||
|
||
# Merge the DataFrames | ||
merged_df = reduce(DataFrame.unionAll, biosample_dfs) | ||
|
||
# Determine aggregation functions for each column | ||
# Currently this will take the first value for single values and merge lists for list values | ||
agg_funcs = [] | ||
for field in merged_df.schema.fields: | ||
if field.name != "biosampleId": # Skip the grouping column | ||
if field.dataType == ArrayType(StringType()): | ||
agg_funcs.append(f.array_distinct(f.flatten(f.collect_list(field.name))).alias(field.name)) | ||
else: | ||
agg_funcs.append(f.first(f.col(field.name), ignorenulls=True).alias(field.name)) | ||
|
||
# Perform aggregation | ||
aggregated_df = merged_df.groupBy("biosampleId").agg(*agg_funcs) | ||
|
||
return BiosampleIndex( | ||
_df=aggregated_df, | ||
_schema=BiosampleIndex.get_schema() | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
"""Biosample index data source.""" | ||
|
||
from __future__ import annotations |
Oops, something went wrong.