Skip to content

Commit d11112f

Browse files
committed
Merge branch 'dev' of https://github.com/opentargets/genetics_etl_python into release/0.x.x
2 parents fd199ba + ac1064f commit d11112f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+1729
-300
lines changed

.github/workflows/release.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ jobs:
5858

5959
- name: Python Semantic Release
6060
id: release
61-
uses: python-semantic-release/python-semantic-release@v8.3.0
61+
uses: python-semantic-release/python-semantic-release@v8.7.0
6262
with:
6363
github_token: ${{ secrets.GITHUB_TOKEN }}
6464

Makefile

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
PROJECT_ID ?= open-targets-genetics-dev
22
REGION ?= europe-west1
3-
APP_NAME ?= $$(cat pyproject.toml| grep name | cut -d" " -f3 | sed 's/"//g')
3+
APP_NAME ?= $$(cat pyproject.toml| grep -m 1 "name" | cut -d" " -f3 | sed 's/"//g')
44
VERSION_NO ?= $$(poetry version --short)
55
CLEAN_VERSION_NO := $(shell echo "$(VERSION_NO)" | tr -cd '[:alnum:]')
66
BUCKET_NAME=gs://genetics_etl_python_playground/initialisation/${VERSION_NO}/
@@ -35,8 +35,7 @@ build-documentation: ## Create local server with documentation
3535
@echo "Building Documentation..."
3636
@poetry run mkdocs serve
3737

38-
create-dev-cluster: ## Spin up a simple dataproc cluster with all dependencies for development purposes
39-
@${MAKE} build
38+
create-dev-cluster: build ## Spin up a simple dataproc cluster with all dependencies for development purposes
4039
@echo "Creating Dataproc Dev Cluster"
4140
@gcloud config set project ${PROJECT_ID}
4241
@gcloud dataproc clusters create "ot-genetics-dev-${CLEAN_VERSION_NO}" \
@@ -49,8 +48,7 @@ create-dev-cluster: ## Spin up a simple dataproc cluster with all dependencies f
4948
--optional-components=JUPYTER \
5049
--enable-component-gateway
5150

52-
make update-dev-cluster: ## Reinstalls the package on the dev-cluster
53-
@${MAKE} build
51+
make update-dev-cluster: build ## Reinstalls the package on the dev-cluster
5452
@echo "Updating Dataproc Dev Cluster"
5553
@gcloud config set project ${PROJECT_ID}
5654
gcloud dataproc jobs submit pig --cluster="ot-genetics-dev-${CLEAN_VERSION_NO}" \
@@ -61,7 +59,6 @@ make update-dev-cluster: ## Reinstalls the package on the dev-cluster
6159
build: clean ## Build Python package with dependencies
6260
@gcloud config set project ${PROJECT_ID}
6361
@echo "Packaging Code and Dependencies for ${APP_NAME}-${VERSION_NO}"
64-
@rm -rf ./dist
6562
@poetry build
6663
@tar -czf dist/config.tar.gz config/
6764
@echo "Uploading to Dataproc"

codecov.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
codecov:
2+
branch: dev
3+
14
comment:
25
layout: "reach, diff, flags, files"
36
behavior: default

config/datasets/gcp.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@ catalog_sumstats_lut: ${datasets.inputs}/v2d/harmonised_list-r2023-11-24a.txt
2424
ukbiobank_manifest: gs://genetics-portal-input/ukb_phenotypes/neale2_saige_study_manifest.190430.tsv
2525
l2g_gold_standard_curation: ${datasets.inputs}/l2g/gold_standard/curation.json
2626
gene_interactions: ${datasets.inputs}/l2g/interaction # 23.09 data
27-
finngen_phenotype_table_url: https://r9.finngen.fi/api/phenos
2827
eqtl_catalogue_paths_imported: ${datasets.inputs}/preprocess/eqtl_catalogue/tabix_ftp_paths_imported.tsv
2928

3029
# Output datasets

config/step/finngen.yaml

Lines changed: 0 additions & 3 deletions
This file was deleted.

config/step/finngen_studies.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
_target_: otg.finngen_studies.FinnGenStudiesStep
2+
finngen_study_index_out: ${datasets.finngen_study_index}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
_target_: otg.finngen_sumstat_preprocess.FinnGenSumstatPreprocessStep
2+
raw_sumstats_path: ???
3+
out_sumstats_path: ???

docs/python_api/datasource/_datasource.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,21 @@ title: Data Source
44

55
# Data Source
66

7-
TBC
7+
This section contains information about the data sources used in Open Targets Genetics.
8+
9+
We use GnomAD v4.0 as a source for variant annotation and GnomAD v2.1.1 as a source for linkage disequilibrium (LD) information (described in the **GnomAD** section).
10+
11+
We rely on Open Targets as a source for the list of targets and the Gold Standard training set (described in the **Open Targets** section).
12+
13+
## Study Sources
14+
15+
1. GWAS catalog
16+
2. FinnGen
17+
18+
## Molecular QTLs
19+
20+
1. eQTL catalogue
21+
22+
## Interaction / Interval-based Experiments
23+
24+
We integrate a list of studies that focus on interaction and interval-based investigations, shedding light on the intricate relationships between genetic elements and their functional implications. For more detils see section **"Intervals"**.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
title: eQTL Catalogue
3+
---
4+
5+
The [eQTL Catalogue](https://www.ebi.ac.uk/eqtl/) aims to provide uniformly processed gene expression and splicing Quantitative Trait Loci (QTLs) from all available public studies on humans.
6+
7+
It serves as the ultimate resource of eQTLs that we use for colocalization and target prioritization.
8+
9+
We utilize data from the following study within the eQTL Catalogue:
10+
11+
1. **GTEx v8**, 49 tissues

docs/python_api/datasource/finngen/_finngen.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,6 @@ title: FinnGen
1212
}
1313
</style>
1414

15-
FinnGen is a research project in genomics and personalized medicine. It is large public-private partnership that has collected and analysed genome and health data from 500,000 Finnish biobank donors to understand the genetic basis of diseases. FinnGen is a now expanding into understanding the progression and biological mechanisms of diseases. FinnGen provides a world-class resource for further breakthroughs in disease prevention, diagnosis, and treatment and a outlook into our genetic make-up.
15+
[FinnGen](https://www.finngen.fi/en) is a research project in genomics and personalized medicine, representing a large public-private partnership. The project has collected and analyzed genome and health data from 500,000 Finnish biobank donors to understand the genetic basis of diseases. FinnGen is now expanding its focus to comprehend the progression and biological mechanisms of diseases. This initiative provides a world-class resource for further breakthroughs in disease prevention, diagnosis, and treatment, offering insights into our genetic makeup.
16+
17+
For a comprehensive understanding of the dataset and methods, refer to [Kurki et al., 2023](https://www.nature.com/articles/s41586-022-05473-8).

docs/python_api/datasource/gnomad/_gnomad.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,9 @@ title: GnomAD
1111
display: none;
1212
}
1313
</style>
14+
15+
[GnomAD](https://gnomad.broadinstitute.org/) (Genome Aggregation Database) is a comprehensive resource that provides aggregated genomic data from large-scale sequencing projects. It encompasses variants from diverse populations and is widely used for variant annotation and population genetics studies.
16+
17+
We use **GnomAD v4.0** as a source for variant annotation, offering detailed information about the prevalence and distribution of genetic variants across different populations. This version of GnomAD provides valuable insights into the genomic landscape, aiding in the interpretation of genetic variants and their potential functional implications.
18+
19+
Additionally, [**GnomAD v2.1.1**](https://gnomad.broadinstitute.org/news/2018-10-gnomad-v2-1/) is utilized as a source for linkage disequilibrium (LD) information.

docs/python_api/datasource/gwas_catalog/_gwas_catalog.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,17 @@ title: GWAS Catalog
66
<img width="100" height="100" src="../../../../assets/imgs/GWAS_Catalog_circle_178x178.png">
77
<h1>GWAS Catalog</h1>
88
</div>
9+
10+
The [GWAS Catalog](https://www.ebi.ac.uk/gwas/) is a comprehensive resource that aims to provide a curated collection of Genome-Wide Association Studies (GWAS) (including harmonized full GWAS summary statistics) across various traits and diseases in humans.
11+
12+
It serves as a valuable repository of genetic associations identified in diverse populations, offering insights into the genetic basis of complex traits and diseases.
13+
14+
We rely on the GWAS Catalog for a rich source of genetic associations, utilizing the data for analysis and interpretation.
15+
16+
For detailed information on specific genetic associations, their significance, and associated studies, refer to the [GWAS Catalog](https://www.ebi.ac.uk/gwas/).
17+
18+
Within our analyses, we leverage two different types of studies from the GWAS Catalog:
19+
20+
1. **Studies with (full) GWAS summary stats**
21+
22+
2. **Studies with top hits only - GWAS curated studies**
Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,25 @@
11
---
2-
title: Chromatin intevals
2+
title: Interaction and Interval-based Studies
33
---
44

5-
# Chromatin intervals
5+
# List of Interaction and Interval-based Studies
66

7-
TBC
7+
In this section, we provide a list of studies that focus on interaction and interval-based investigations, shedding light on the intricate relationships between genetic elements and their functional implications.
8+
9+
1. **Promoter Capture Hi-C (Javierre et al., 2016):**
10+
_Title:_ "Lineage-Specific Genome Architecture Links Enhancers and Non-coding Disease Variants to Target Gene Promoters".
11+
This study presents evidence linking genetic variation to genes through the application of Promoter Capture Hi-C across each of the 17 human primary hematopoietic cell types. The method captures interactions between promoters and distal regulatory elements, providing valuable insights into the three-dimensional chromatin architecture. DOI: 10.1016/j.cell.2016.09.037
12+
13+
2. **Enhancer-TSS Correlation (Andersson et al., 2014):**
14+
_Title:_ "An Atlas of Active Enhancers across Human Cell Types and Tissues".
15+
This study explores genetic variation's impact on genes by examining the correlation between the transcriptional activity of enhancers and transcription start sites. The findings are documented in the FANTOM5 CAGE expression atlas, offering a comprehensive view of the regulatory landscape. DOI: 10.1038/nature12787
16+
17+
3. **DHS-Promoter Correlation (Thurman et al., 2012):**
18+
_Title:_ "The accessible chromatin landscape of the human genome".
19+
Investigating genetic variation's connection to genes, this study employs the correlation of DNase I hypersensitive sites (DHS) and gene promoters. The analysis spans 125 cell and tissue types from the ENCODE project, providing a broad understanding of the regulatory interactions across diverse biological contexts. DOI: 10.1038/nature11232
20+
21+
4. **Promoter Capture Hi-C (Jung et al., 2019):**
22+
_Title:_ "A compendium of promoter-centered long-range chromatin interactions in the human genome".
23+
This study compiles a compendium of promoter-centered long-range chromatin interactions in the human genome. By focusing on the three-dimensional organization of chromatin, the research contributes to our understanding of the spatial arrangement of genetic elements and their implications in gene regulation. DOI: 10.1038/s41588-019-0494-8
24+
25+
For in-depth details on each study, you may refer to the respective publications.

docs/python_api/datasource/open_targets/_open_targets.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,12 @@ title: Open Targets
1212
}
1313
</style>
1414

15-
The Open Targets Platform is a comprehensive resource that aims to aggregate and harmonise various types of data to facilitate the identification, prioritisation, and validation of drug targets. By integrating publicly available datasets including data generated by the Open Targets consortium, the Platform builds and scores target-disease associations to assist in drug target identification and prioritisation. It also integrates relevant annotation information about targets, diseases, phenotypes, and drugs, as well as their most relevant relationships.
15+
The Open Targets Platform is a comprehensive resource that aims to aggregate and harmonize various types of data to facilitate the identification, prioritization, and validation of drug targets. By integrating publicly available datasets, including data generated by the Open Targets consortium, the Platform builds and scores target-disease associations to assist in drug target identification and prioritization. It also integrates relevant annotation information about targets, diseases, phenotypes, and drugs, as well as their most relevant relationships.
1616

17-
Genomic data from Open Targets integrates human genome-wide association studies (GWAS) and functional genomics data including gene expression, protein abundance, chromatin interaction and conformation data from a wide range of cell types and tissues to make robust connections between GWAS-associated loci, variants and likely causal genes.
17+
Within our analyses, we utilize Open Targets to infer two datasets:
18+
19+
1. **The list of targets:**
20+
This dataset provides a compilation of targets. In the Open Targets Platform, a target is understood as any naturally-occurring molecule that can be targeted by a medicinal product. The EMBL-EBI Ensembl database serves as the source for human targets in the Platform, with the Ensembl gene ID as the primary identifier. For more details, refer to [this link](https://platform-docs.opentargets.org/target).
21+
22+
2. **The list of Gold Standard Positives:**
23+
We use this dataset for training the Locus-to-Gene model. The current list contains 496 Gold Standard Positives.

docs/python_api/datasource/ukbiobank/_ukbiobank.md

Lines changed: 0 additions & 24 deletions
This file was deleted.

docs/python_api/datasource/ukbiobank/study_index.md

Lines changed: 0 additions & 5 deletions
This file was deleted.

docs/python_api/method/carma.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
title: CARMA
3+
---
4+
5+
CARMA is the method of the fine-mapping and outlier detection, originally implemented in R ([CARMA on GitHub](https://github.com/ZikunY/CARMA)).
6+
7+
The full repository for the reimplementation of CARMA in Python can be found [here](https://github.com/hlnicholls/carmapy/tree/0.1.0).
8+
9+
This is a simplified version of CARMA with the following features:
10+
11+
1. It uses only Spike-slab effect size priors and Poisson model priors.
12+
2. C++ is re-implemented in Python.
13+
3. The way of storing the configuration list is changed. It uses a string with the list of indexes for causal SNPs instead of a sparse matrix.
14+
4. Fixed bugs in PIP calculation.
15+
5. No credible models.
16+
6. No credible sets, only PIPs.
17+
7. No functional annotations.
18+
8. Removed unnecessary parameters.
19+
20+
:::otg.method.carma.CARMA

docs/python_api/step/finngen.md

Lines changed: 0 additions & 5 deletions
This file was deleted.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
title: FinnGen Studies
3+
---
4+
5+
::: otg.finngen_studies.FinnGenStudiesStep
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
title: FinnGen Preprocess Summary Stats
3+
---
4+
5+
::: otg.finngen_sumstat_preprocess.FinnGenSumstatPreprocessStep

0 commit comments

Comments
 (0)