Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: GWAS Catalog harmonisation prototype #270

Merged
merged 22 commits into from
Dec 7, 2023
Merged

Conversation

d0choa
Copy link
Collaborator

@d0choa d0choa commented Nov 22, 2023

This PR aims to create a process to be able to harmonise GWAS Catalog studies within GCP.

(Some) Background

  • Harmonised GWAS Catalog studies have been uploaded to GCP using a rsync strategy from EBI infrastructure (~25k at the moment - ~3h to synch all)
  • Harmonised summary stats are stored as .tsv.gz(poorly partitioned)
  • We expect the data to keep growing (incremental)
  • Datasets vary in schema
  • Additional row-wise QC is required to ensure several requirements (betas, positions, etc.)

This PR
Aims to produce additionally harmonise the ".tsv.gz" files by using the GWASCatalogSumstatsPreprocessStep and producing an appropriately partitioned parquet file per study.

Additional context

  • Partitions are done by sorting by chromosome, position and chunking in a fixed number of 20 partitions. This ensures the partitions are equally sized within the study (no skews) and partition sizes are reasonable (~10Mb on average, to a maximum of ~100Mb). This will result in 0.5M partitions to read once the 25k summary stats are processed which I believe is still reasonable but it will require additional considerations downstream.
  • Current DAG computes the list of studies to process by looking at the filesystem. The current to-do list is created based on study IDs that haven't yet been processed (.SUCCESS), so it's not bulletproof (no timestamps, no md5, etc.).
  • Jobs are submitted to Dataproc as GCP python API requests as opposed to using Airflow Operators. The reason is not to overload GCP with status requests from Airflow and keep things simple. Of course, it comes with the cost of not knowing the status of every individual study in Airflow.
  • The cluster geometry requires many small (n1-standard-2) workers each addressing a different summary stats file. Due to the number of workers the master node requires a decent size (ATM n1-highmem-64).
  • This PR can hit Dataproc default quotas of 5,000 job submissions. It can be adjusted but I'm still keeping the default as a safety net.

After some adjustments in the geometry, the task runs at a pace of 800-1000 studies per hour using approximately 80 workers. Considering the task, I believe this is a reasonable performance balance. This will become an incremental task.

Screenshot 2023-11-22 at 15 43 26

@d0choa
Copy link
Collaborator Author

d0choa commented Nov 22, 2023

Additional stats from @DSuveges:

GWAS Catalog summary stats:

  • Currently 25.6k (~100 new studies from yesterday)
  • 90 studies below 1MB. (I haven't considered number of rows at this point)
  • 165 studies above 1GB (up to 2.4GB), The row count goes from 25M to 61M)

Size distribution of all:
image

Studies above 1 GB:

image

@codecov-commenter
Copy link

codecov-commenter commented Nov 22, 2023

Codecov Report

Merging #270 (caa8895) into main (f2f8399) will decrease coverage by 1.21%.
The diff coverage is 40.74%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #270      +/-   ##
==========================================
- Coverage   86.44%   85.24%   -1.21%     
==========================================
  Files          86       87       +1     
  Lines        1999     2053      +54     
==========================================
+ Hits         1728     1750      +22     
- Misses        271      303      +32     
Files Coverage Δ
src/airflow/dags/common_airflow.py 88.88% <25.00%> (-11.12%) ⬇️
src/airflow/dags/gwas_catalog_harmonisation.py 43.47% <43.47%> (ø)

@d0choa d0choa added the DAG label Nov 24, 2023
@d0choa d0choa changed the title GWAS Catalog harmonisation prototype feat: GWAS Catalog harmonisation prototype Nov 27, 2023
@d0choa d0choa self-assigned this Nov 27, 2023
@d0choa d0choa marked this pull request as ready for review December 1, 2023 11:01
@d0choa d0choa requested a review from ireneisdoomed December 1, 2023 11:01
@d0choa
Copy link
Collaborator Author

d0choa commented Dec 1, 2023

This DAG was used to complete the harmonisation of 25k harmonised GWAS Catalog studies.

It's currently limited to 5,000 studies at a time, which I think is a good safety net.

Similar to the other DAGs is a pragmatic solution that does the job, not a fully bulletproof implementation.

Requesting review from @ireneisdoomed because it interfaces with downstream work, but happy to listen to anybody else

Copy link
Contributor

@tskir tskir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good interim solution which allows us to ingest and start using GWAS Catalog studies quickly.

However, I strongly believe that for production use we should not use direct orchestration and submission of jobs. I did that for my first ingestion of FinnGen, and can testify that it's not CPU efficient and seems to make the cluster less stable.

As we discovered earlier, Spark is also not great at ingesting non-Parquet files and/or partitioning them.

So I suggest that going forward, we use the non-Spark preprocessing approach developed in #295 + #296 + #297. This should be extremely fast and make downstream processing incredibly simple. Coincidentally, this will also remove the need to do rsync from the EBI side, or to do anything with the EBI infrastructure at all.

@d0choa If it's all right with you, I would keep these changes in a separate branch for now. I believe that if we merge this PR, we'll have to dismantle and rewrite this part again very soon.

However, if you'd prefer merging at least some working solution into main, I won't argue the point, and if it is the case, please let me know and I'll approve it as is.

Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, works as expected. This DAG is meant to be run independently from the other GWAS Catalog preprocess.

src/airflow/dags/gwas_catalog_harmonisation.py Outdated Show resolved Hide resolved
d0choa and others added 2 commits December 5, 2023 15:33
Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>
@DSuveges DSuveges merged commit e0297cb into main Dec 7, 2023
1 check passed
@DSuveges DSuveges deleted the do_gwascat_harmonisation branch December 12, 2023 15:13
Daniel-Considine added a commit that referenced this pull request Jan 3, 2024
* test: refactor test_gnomad_ld

* build(deps-dev): bump mkdocs-material from 9.4.7 to 9.4.8

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.4.7 to 9.4.8.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.4.7...9.4.8)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps-dev): bump pytest-xdist from 3.3.1 to 3.4.0

Bumps [pytest-xdist](https://github.com/pytest-dev/pytest-xdist) from 3.3.1 to 3.4.0.
- [Changelog](https://github.com/pytest-dev/pytest-xdist/blob/master/CHANGELOG.rst)
- [Commits](pytest-dev/pytest-xdist@v3.3.1...v3.4.0)

---
updated-dependencies:
- dependency-name: pytest-xdist
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: revert testing changes

* fix: gnomad paths are not necessary after #233

* build(deps): bump wandb from 0.13.11 to 0.16.0

Bumps [wandb](https://github.com/wandb/wandb) from 0.13.11 to 0.16.0.
- [Release notes](https://github.com/wandb/wandb/releases)
- [Changelog](https://github.com/wandb/wandb/blob/main/CHANGELOG.md)
- [Commits](wandb/wandb@v0.13.11...v0.16.0)

---
updated-dependencies:
- dependency-name: wandb
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/astral-sh/ruff-pre-commit: v0.1.4 → v0.1.5](astral-sh/ruff-pre-commit@v0.1.4...v0.1.5)
- [github.com/psf/black: 23.10.1 → 23.11.0](psf/black@23.10.1...23.11.0)
- [github.com/alessandrojcm/commitlint-pre-commit-hook: v9.7.0 → v9.8.0](alessandrojcm/commitlint-pre-commit-hook@v9.7.0...v9.8.0)
- [github.com/pre-commit/mirrors-mypy: v1.6.1 → v1.7.0](pre-commit/mirrors-mypy@v1.6.1...v1.7.0)

* fix: persist raw gwascat associations to return consistent results

* fix: coalesce variantid to assign a studylocusid

closes #3151

* chore: use a higher RAM master machine

* chore: use gs:// prefix for FinnGen input data

* chore: use more partitions for FinnGen

* chore: always error if the output data exists

* feat: add eQTL ingestion to the list of steps in DAG

* docs: update running instructions

* docs: update contributing checklist

* docs: add automatically generated docs

* chore: add configuration

* chore: unify FinnGen config with eQTL Catalogue

* feat: eQTL Catalogue main ingestion script

* feat: implement study index ingestion

* feat: implement summary stats ingestion

* refactor: update eQTL study index import

* refactor: reorganise study index ingestion for readability

* style: docstring for eQTL Catalogue summary stats ingestion

* feat: construct study ID based on all appropriate columns

* feat: map partial and full study IDs

* feat: join dataframes to add the full study ID information

* feat: populate geneId column

* refactor: move gene ID joining into the study index class

* fix: do not partition by chromosome for QTL studies

* fix: update class names

* chore: add __init__.py for eQTL Catalogue

* fix: eqtl_catalogue path in docs

* fix: name of EqtlCatalogueStep class

* chore: replace attributes with static methods for eQTL study index

* chore: replace attributes with static methods for eQTL summary stats

* fix: do not initialise session in the main class

* test: add conftests for eQTL Catalogue

* fix: include header when reading the study index

* fix: populating publicationDate

* fix: cast nSamples as long

* fix: manually specify schema for eQTL Catalogue summary stats

* fix: typo in position field name

* test: add sample eQTL Catalogue studies

* test: add sample eQTL Catalogue summary stats

* test: add test for eQTL Catalogue study index

* test: add test for eQTL Catalogue summary stats

* chore: partition output data by chromosome

* chore: read input data from Google Storage

* fix: studies sample filename

* chore: repartition data before processing

* revert: repartition call

* revert: repartition call

* build(deps-dev): bump google-cloud-dataproc from 5.6.0 to 5.7.0

Bumps [google-cloud-dataproc](https://github.com/googleapis/google-cloud-python) from 5.6.0 to 5.7.0.
- [Release notes](https://github.com/googleapis/google-cloud-python/releases)
- [Changelog](https://github.com/googleapis/google-cloud-python/blob/main/CHANGELOG.md)
- [Commits](googleapis/google-cloud-python@google-cloud-dataproc-v5.6.0...google-cloud-dataproc-v5.7.0)

---
updated-dependencies:
- dependency-name: google-cloud-dataproc
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: change definition of negative l2g evidence

* refactor: modularise logic for gold standards

* refactor: move hardcoded values to constants

* refactor: turn `OpenTargetsL2GGoldStandard` into class methods

* refactor(gold_standard): move logic to refine gold standards to `L2GGoldStandard`

* chore: remove the num_local_ssds arg which has no effect

* chore: align default values with docstring

* feat: add ability to attach local SSDs

* chore: make local SSDs a default

* test: add `test_parse_positive_curation`

* test: fix and test logic in `expand_gold_standard_with_negatives`

* test: add `test_expand_gold_standard_with_negatives_same_positives`

* test: testing for `process_gene_interactions`

* feat: parametrise autoscaling policy

* feat: finetune spark job: taskid, trigger rule, other args

* feat: new PICS step

* feat: configuration for PICS step

* docs: pics step

* feat: datasets config

* Revert "feat: datasets config"

This reverts commit a07c7c4.

* feat: gcp dataset config

* fix: wrong lines removed

* feat: raise error in `from_parquet` when df is empty

* chore: add `variantId` to gold standards schema

* chore: change `sources` in gold standards schema to a nullable

* test: add `test_filter_unique_associations`

* Update src/otg/pics.py

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* Update src/otg/pics.py

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* Update src/otg/pics.py

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* Update src/otg/pics.py

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* feat(overlaps): add and test method to transform the overlaps as a square matrix

* chore(overlaps): chromosome and statistics are not mandatory fields in the schema

* feat(l2g_gold_standard): change `filter_unique_associations` logic

* test(l2g_gold_standard): add `test_remove_false_negatives`

* fix(l2g_gold_standard): fix logic in `remove_false_negatives`

* chore(gold_standards): define gs labels as `L2GGoldStandard` attributes

* build(deps-dev): bump apache-airflow-providers-google

Bumps [apache-airflow-providers-google](https://github.com/apache/airflow) from 10.11.0 to 10.11.1.
- [Release notes](https://github.com/apache/airflow/releases)
- [Changelog](https://github.com/apache/airflow/blob/main/RELEASE_NOTES.rst)
- [Commits](apache/airflow@providers-google/10.11.0...providers-google/10.11.1)

---
updated-dependencies:
- dependency-name: apache-airflow-providers-google
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps-dev): bump mypy from 1.6.1 to 1.7.0

Bumps [mypy](https://github.com/python/mypy) from 1.6.1 to 1.7.0.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](python/mypy@v1.6.1...v1.7.0)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps-dev): bump mkdocs-material from 9.4.8 to 9.4.10

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.4.8 to 9.4.10.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.4.8...9.4.10)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps-dev): bump mkdocstrings-python from 1.7.3 to 1.7.4

Bumps [mkdocstrings-python](https://github.com/mkdocstrings/python) from 1.7.3 to 1.7.4.
- [Release notes](https://github.com/mkdocstrings/python/releases)
- [Changelog](https://github.com/mkdocstrings/python/blob/main/CHANGELOG.md)
- [Commits](mkdocstrings/python@1.7.3...1.7.4)

---
updated-dependencies:
- dependency-name: mkdocstrings-python
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: pre-commit autoupdate

* feat: updating summary stats schema and ingtesion

* chore: remove beta value interval calculation

* refactor: removing odds ratio, and confidence intervals from the schema

* fix: multiply standard error by zscore in `calculate_confidence_interval`

* chore: remove reference to confidence intervals

* fix: local SSD initialisation

* feat: gitignore .venv file

* feat: ingestion supported for both new and old format of the harmonized GWAS Catalog Summary stats. (#274)

* feat: converting gwas catalog ingestion to the new format

* feat: generalizing GWAS catalog sumstas ingestion both format supported

* fix: sample file must follow the convention of the real files

* fix: woopsie, the sample file is not added

* test: adding trst for old gwas format

* feat: updating gwas summary stats ingestion step to the new way of getting data

* fix: casting types to make schema explicit

* build(deps): bump pyarrow from 11.0.0 to 14.0.1

Bumps [pyarrow](https://github.com/apache/arrow) from 11.0.0 to 14.0.1.
- [Commits](apache/arrow@go/v11.0.0...go/v14.0.1)

---
updated-dependencies:
- dependency-name: pyarrow
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: rename study_locus to credible_set for l2g

* build(deps-dev): bump ipython from 8.17.2 to 8.18.1 (#280)

Bumps [ipython](https://github.com/ipython/ipython) from 8.17.2 to 8.18.1.
- [Release notes](https://github.com/ipython/ipython/releases)
- [Commits](ipython/ipython@8.17.2...8.18.1)

---
updated-dependencies:
- dependency-name: ipython
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump mkdocstrings-python from 1.7.4 to 1.7.5 (#279)

Bumps [mkdocstrings-python](https://github.com/mkdocstrings/python) from 1.7.4 to 1.7.5.
- [Release notes](https://github.com/mkdocstrings/python/releases)
- [Changelog](https://github.com/mkdocstrings/python/blob/main/CHANGELOG.md)
- [Commits](mkdocstrings/python@1.7.4...1.7.5)

---
updated-dependencies:
- dependency-name: mkdocstrings-python
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.1.3 to 0.1.6 (#276)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.1.3 to 0.1.6.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@v0.1.3...v0.1.6)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump mypy from 1.7.0 to 1.7.1 (#278)

Bumps [mypy](https://github.com/python/mypy) from 1.7.0 to 1.7.1.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](python/mypy@v1.7.0...v1.7.1)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* docs: minify plugin removed to prevent clash in local development (#284)

* fix: proper parsing of gwas catalog study accession from filename (#282)

* fix: proper parsing of gwas catalog study accession from filename

* fix: interestingly, I had to add an extra line to pass doctest on VSCode

* feat: bumping mypy version

---------

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* fix: making standard_error column optional (#286)

* fix: making standard-error column optional

* feat: add `clump` step (#288)

* feat: join clump methods into step canonical

* refactor: keep clump step minimal

* docs: add clump step docs page

* fix: add clump config

* revert(pre-commit): downgrade mypy version to 1.7.0

* feat(clump): add attribute to collect locus in window-based clumping

* refactor(clump): make study and ld indices non mandatory

* test: add test for clumpstep when input is ss

* test: rename test_clump_py to avoid clashes

* refactor: remove data variable in clumpstep

* fix: correct and test study splitter when subStudyDescription is the same (#289)

* fix(clump): read input files recursively (#292)

* build(deps-dev): bump mkdocs-material from 9.4.10 to 9.4.14 (#300)

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.4.10 to 9.4.14.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.4.10...9.4.14)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump scipy from 1.11.3 to 1.11.4 (#299)

Bumps [scipy](https://github.com/scipy/scipy) from 1.11.3 to 1.11.4.
- [Release notes](https://github.com/scipy/scipy/releases)
- [Commits](scipy/scipy@v1.11.3...v1.11.4)

---
updated-dependencies:
- dependency-name: scipy
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump pymdown-extensions from 10.3.1 to 10.5 (#301)

Bumps [pymdown-extensions](https://github.com/facelessuser/pymdown-extensions) from 10.3.1 to 10.5.
- [Release notes](https://github.com/facelessuser/pymdown-extensions/releases)
- [Commits](facelessuser/pymdown-extensions@10.3.1...10.5)

---
updated-dependencies:
- dependency-name: pymdown-extensions
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump mkdocs-git-committers-plugin-2 from 1.2.0 to 2.2.2 (#303)

Bumps [mkdocs-git-committers-plugin-2](https://github.com/ojacques/mkdocs-git-committers-plugin-2) from 1.2.0 to 2.2.2.
- [Release notes](https://github.com/ojacques/mkdocs-git-committers-plugin-2/releases)
- [Commits](ojacques/mkdocs-git-committers-plugin-2@1.2.0...2.2.2)

---
updated-dependencies:
- dependency-name: mkdocs-git-committers-plugin-2
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump apache-airflow-providers-google (#302)

Bumps [apache-airflow-providers-google](https://github.com/apache/airflow) from 10.11.1 to 10.12.0.
- [Release notes](https://github.com/apache/airflow/releases)
- [Changelog](https://github.com/apache/airflow/blob/main/RELEASE_NOTES.rst)
- [Commits](apache/airflow@providers-google/10.11.1...providers-google/10.12.0)

---
updated-dependencies:
- dependency-name: apache-airflow-providers-google
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: deptry added to handle unused, missing and transitive dependencies (#304)

* feat: add deptry to the project

* feat: deptry as pre-commit and associated changes

* feat: deptry as pre-commit and associated changes

* fix: exclude deptry from pre-commit.ci

* ci: check deptry when testing

* fix: poetry lock with no update

* feat: adding unpublished studies (#290)

* feat: adding unpublished studies

* feat: updating all gwas catalog sources

* feat: generalizing study ingestion to accept list of files

* fix: removing unused configuration

* feat: adding unpublished ancestries as well

* fix: updating ancestry config name

* Apply suggestions from code review

Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>

---------

Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>

* feat: add prettier as formatter (yaml, json, md, etc.) (#298)

* feat: prettier added to the project and precommit

* feat: recommend vscode extension

* fix: adjusted prettier

* fix: all files reformated by prettier

* fix: unnecessary comment

* chore(l2ggoldstandard): add studyId to schema (#305)

* chore(l2ggoldstandard): add studyId to schema

* fix: add `studyId` to gold standards testing fixtures

---------

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* feat: Adding cohorts field to study index (#309)

* feat: ingesting cohort information for GWAS Catalog

* feat: adding cohort to finngen study index

* feat: add 'coalesce' and 'repartition' wrappers to 'Dataset' (#307)

* feat(dataset): add `coalesce`

* feat(clump): coalesce summary stats to 1000 partitions

* feat(dataset): change coalesce to setPartitions

* test(dataset): added `TestSetPartitions`

* refactor(dataset): split set_partitions into repartition and coalesce

* chore(clump): decrease number of sumstats partitions to 400

* chore: change number of partitions to window clumping to 4000

Number estimate on the basis of 60 workers x 16 cores x 4 partitions

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

---------

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* chore(airflow): schedule_interval deprecation warning (#293)

Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* feat: GWAS Catalog harmonisation prototype (#270)

* chore: merge main

* feat: partitioning GWAS Catalog dataset to 20 equally-sized partitions

* fix: missing config

* feat: dag with gwas catalog harmonisation

* feat: adding more primary workers to help with the task

* refactor: changed autoscaling policy

* feat: allow to specify number of preeptible workers

* fix: bugs on to_do_list

* revert: version number

* fix: gwas_catalog_sumstat_preprocess no longer needs study_id

* fix: unnecessary config causes issues

* fix: improved regexp

* refactor: generalising the config

* fix: rename cluster to prevent clashes with other dags

* Update src/airflow/dags/gwas_catalog_harmonisation.py

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

---------

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>
Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* test: failing doctest in different python version (#320)

* test: issue on test with different python version

* refactor: value error instead of assertion

* build(deps): bump wandb from 0.16.0 to 0.16.1 (#315)

Bumps [wandb](https://github.com/wandb/wandb) from 0.16.0 to 0.16.1.
- [Release notes](https://github.com/wandb/wandb/releases)
- [Changelog](https://github.com/wandb/wandb/blob/main/CHANGELOG.md)
- [Commits](wandb/wandb@v0.16.0...v0.16.1)

---
updated-dependencies:
- dependency-name: wandb
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* chore(deps): bump actions/setup-python from 4 to 5 (#319)

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 4 to 5.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v4...v5)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* build(deps-dev): bump mkdocs-material from 9.4.14 to 9.5.2 (#324)

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.4.14 to 9.5.2.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.4.14...9.5.2)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* test: Improvements to `test_dataset` and `test_clump_step` (#312)

* test: rename TestDataset to MockDataset

* test: output test_clumpstep_summary_stats results to temp dir

* feat: Gnomad v4 based variant annotation (#311)

* feat: gnomad4 parser and changes in schema

* fix: variant annotation schema

* feat: required changes in variant index

* test: fix schema

* test: adjust testing to absence of sift and polyphen

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: typo in schema

* fix: preventing skewed partitions

* fix: remove sift and polyphen predictions from v2g

* feat: rename gnomad3VariantId to gnomadVariantId name

* refactor: stop inheriting datasets in parsers (#313)

* refactor: stop inheriting datasets in parsers

* fix: typing issue

* refactor: include datasets in datasources

* test: fix incorrect import

* test: doctest function calls fixed

* test: doctest function calls fixed in studyindex

---------

Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* feat: add gwas_catalog_preprocess dag (#291)

* feat: gwas_catalog step stops at ingestion

* feat: gwas_catalog step stops at ingestion

* feat: add gwas_catalog_preprocess dag

* fix: change step_id to task_id as task_id

* feat: group gwas_catalog_preprocess tasks into sumstats and curation groups

* fix: add all dependencies when ld clumping

* fix: update gwas catalog docs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactor(dag): extract releasebucket and sumstats paths as constants

* refactor: streamline study locus paths

* build(deps-dev): bump pre-commit from 3.5.0 to 3.6.0 (#316)

Bumps [pre-commit](https://github.com/pre-commit/pre-commit) from 3.5.0 to 3.6.0.
- [Release notes](https://github.com/pre-commit/pre-commit/releases)
- [Changelog](https://github.com/pre-commit/pre-commit/blob/main/CHANGELOG.md)
- [Commits](pre-commit/pre-commit@v3.5.0...v3.6.0)

---
updated-dependencies:
- dependency-name: pre-commit
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump numpy from 1.26.1 to 1.26.2 (#314)

Bumps [numpy](https://github.com/numpy/numpy) from 1.26.1 to 1.26.2.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/RELEASE_WALKTHROUGH.rst)
- [Commits](numpy/numpy@v1.26.1...v1.26.2)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump typing-extensions from 4.8.0 to 4.9.0 (#317)

Bumps [typing-extensions](https://github.com/python/typing_extensions) from 4.8.0 to 4.9.0.
- [Release notes](https://github.com/python/typing_extensions/releases)
- [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md)
- [Commits](python/typing_extensions@4.8.0...4.9.0)

---
updated-dependencies:
- dependency-name: typing-extensions
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: add `l2g_benchmark` notebook to compare with production results (#323)

* feat: finngen preprocess prototype (#272)

All steps associated with the preprocessing of Finngen studies (PICS-road) included in a DAG:
- summary stats harmonisation
- window-based clumping
- LD-based clumping
- PICS

Several enhancements might follow in different PRs.

---------
Co-authored-by: Irene López <irene.lopezs@protonmail.com>

* chore: create code of conduct (#327)

* Create CODE_OF_CONDUCT.md

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* chore: review study locus and study index configs (#326)

* chore: make studylocus and study indices configs clearer

* chore: temporarily turn off removal of redundancies due to perf

* refactor: read studyindex and studylocus recursively

* feat: track training data and feature importance (#325)

* chore: delete makefile_deprecated (#329)

Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* feat: ruff as formatter (#322)

* ruff formatter instead of black

* refactor: ruff reformatted files

* feat: more complete ruff adjustments

* refactor: all codebase to comply with ruff rules

* chore: update lock

* feat: more stringent docstring rules

* revert: remove isort and black from Makefile

* build(deps-dev): bump google-cloud-dataproc from 5.7.0 to 5.8.0 (#330)

Bumps [google-cloud-dataproc](https://github.com/googleapis/google-cloud-python) from 5.7.0 to 5.8.0.
- [Release notes](https://github.com/googleapis/google-cloud-python/releases)
- [Changelog](https://github.com/googleapis/google-cloud-python/blob/main/packages/google-cloud-documentai/CHANGELOG.md)
- [Commits](googleapis/google-cloud-python@google-cloud-dataproc-v5.7.0...google-cloud-dataproc-v5.8.0)

---
updated-dependencies:
- dependency-name: google-cloud-dataproc
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.1.6 to 0.1.7 (#331)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.1.6 to 0.1.7.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@v0.1.6...v0.1.7)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: semantic release automation (#294)

Feature needs to be fully tested

* feat: track feature missingness rates (#335)

* feat(L2GFeatureMatrix): add `features_list` as attribute

* fix: log wandb table

* feat(L2GFeatureMatrix): track missingness rate for each feature

* feat(L2GFeatureMatrix): track missingness rate for each feature

* chore(LocusToGeneModel): remove evaluation outside experiment tracking

* feat: trigger on push (#337)

* build(deps-dev): bump pytest-xdist from 3.4.0 to 3.5.0 (#333)

* build(deps-dev): bump ipykernel from 6.26.0 to 6.27.1 (#332)

* feat: yamllint to ensure yaml linting (#338)

* feat: yamllint support

* feat: updates yamllint rules

* fix: release actions fixes (#344)

* feat: metadata on toml

* feat: several fixes

* refactor: linting

* refactor: externalise python version

* fix: single quotes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* docs: finngen description v1 (#345)

* chore: upgrade checkout (#346)

* fix: github token (#348)

* fix: several issues (#349)

* feat: release branch (#350)

* fix: unnecessary option (#351)

* build(deps-dev): bump python-semantic-release from 8.3.0 to 8.5.1 (#343)

Bumps [python-semantic-release](https://github.com/python-semantic-release/python-semantic-release) from 8.3.0 to 8.5.1.
- [Release notes](https://github.com/python-semantic-release/python-semantic-release/releases)
- [Changelog](https://github.com/python-semantic-release/python-semantic-release/blob/master/CHANGELOG.md)
- [Commits](python-semantic-release/python-semantic-release@v8.3.0...v8.5.1)

---
updated-dependencies:
- dependency-name: python-semantic-release
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* feat: activate release process (#352)

* feat: serious release

* revert: no tests within release workflow

* build(deps-dev): bump isort from 5.13.1 to 5.13.2 (#342)

Bumps [isort](https://github.com/pycqa/isort) from 5.13.1 to 5.13.2.
- [Release notes](https://github.com/pycqa/isort/releases)
- [Changelog](https://github.com/PyCQA/isort/blob/main/CHANGELOG.md)
- [Commits](PyCQA/isort@5.13.1...5.13.2)

---
updated-dependencies:
- dependency-name: isort
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.1.7 to 0.1.8 (#341)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.1.7 to 0.1.8.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@v0.1.7...v0.1.8)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: upload release (#353)

* feat: semantic release gh action (#354)

* revert: dispatch (#355)

* build(deps): bump pyspark from 3.3.3 to 3.3.4 (#358)

Bumps [pyspark](https://github.com/apache/spark) from 3.3.3 to 3.3.4.
- [Commits](apache/spark@v3.3.3...v3.3.4)

---
updated-dependencies:
- dependency-name: pyspark
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump python-semantic-release/python-semantic-release (#359)

Bumps [python-semantic-release/python-semantic-release](https://github.com/python-semantic-release/python-semantic-release) from 8.3.0 to 8.5.1.
- [Release notes](https://github.com/python-semantic-release/python-semantic-release/releases)
- [Changelog](https://github.com/python-semantic-release/python-semantic-release/blob/master/CHANGELOG.md)
- [Commits](python-semantic-release/python-semantic-release@v8.3.0...v8.5.1)

---
updated-dependencies:
- dependency-name: python-semantic-release/python-semantic-release
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* ci: new changelog and release notes templates  (#357)

Templates for CHANGELOG and release notes. To be fully tested on the next release.

* fix(l2g): `calculate_feature_missingness_rate` counts features annotated with 0 as incomplete (#364)

* chore: import wandb classes explicitly

* fix: count feature annotation with 0 as incomplete

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* docs: corrected and added documentation to datasource (#362)

* docs: corrected and added documentation to datasource

* docs: corrected documentation to datasource - answering comments v1

* docs: corrections in datasource documentation

* fix: incorrect parsing of `app_name` in makefile (#367)

* fix: correct app_name in makefile

* chore: remove redundant dist cleaning

* chore: streamline make rules dependencies

* ci: set codecov default branch to dev (#368)

* feat: Finngen R10 harmonisation and preprocessing (#370)

* chore: remove unnecessary file

* fix: several fixes on finngen harmonisation and preprocess

* docs: update docs

* fix: test

* fix: uncomment line

Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

---------

Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* feat(pics): remove variants from `locus` when PICS cannot be applied (#361)

* feat(pics): variants not in locus when if pips cant be calculated

* feat(pics): add empty_locus qc flag

* chore(pics): add  to finemappingMethod column

* refactor(pics): change definition of non picsable based on ldset

* Update src/otg/dataset/study_locus.py

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* Update src/otg/method/pics.py

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* Update tests/method/test_pics.py

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

---------

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>
Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* chore(study_index): change numeric columns from long to integers (#371)

* chore(study_index): change numeric columns to integers

* chore(study_index): accommodate parsers to schema changes

* feat(l2g): add features based on predicted variant consequences (#360)

* chore: import wandb classes explicitly

* feat(l2g): add studylocusfeaturefactory._get_vep_features

* chore: accommodate project to newer features

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat(l2g): include averaged features

* chore: set cluster delete TTL (#379)

* build(deps-dev): bump apache-airflow from 2.7.3 to 2.8.0 (#373)

Bumps [apache-airflow](https://github.com/apache/airflow) from 2.7.3 to 2.8.0.
- [Release notes](https://github.com/apache/airflow/releases)
- [Changelog](https://github.com/apache/airflow/blob/main/RELEASE_NOTES.rst)
- [Commits](apache/airflow@2.7.3...2.8.0)

---
updated-dependencies:
- dependency-name: apache-airflow
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>

* build(deps-dev): bump mypy from 1.7.1 to 1.8.0 (#374)

Bumps [mypy](https://github.com/python/mypy) from 1.7.1 to 1.8.0.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](python/mypy@v1.7.1...v1.8.0)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump python-semantic-release/python-semantic-release (#372)

Bumps [python-semantic-release/python-semantic-release](https://github.com/python-semantic-release/python-semantic-release) from 8.5.1 to 8.7.0.
- [Release notes](https://github.com/python-semantic-release/python-semantic-release/releases)
- [Changelog](https://github.com/python-semantic-release/python-semantic-release/blob/master/CHANGELOG.md)
- [Commits](python-semantic-release/python-semantic-release@v8.5.1...v8.7.0)

---
updated-dependencies:
- dependency-name: python-semantic-release/python-semantic-release
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>

* build(deps-dev): bump python-semantic-release from 8.5.1 to 8.7.0 (#375)

Bumps [python-semantic-release](https://github.com/python-semantic-release/python-semantic-release) from 8.5.1 to 8.7.0.
- [Release notes](https://github.com/python-semantic-release/python-semantic-release/releases)
- [Changelog](https://github.com/python-semantic-release/python-semantic-release/blob/master/CHANGELOG.md)
- [Commits](python-semantic-release/python-semantic-release@v8.5.1...v8.7.0)

---
updated-dependencies:
- dependency-name: python-semantic-release
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>

* build(deps-dev): bump mkdocs-git-revision-date-localized-plugin (#376)

Bumps [mkdocs-git-revision-date-localized-plugin](https://github.com/timvink/mkdocs-git-revision-date-localized-plugin) from 1.2.1 to 1.2.2.
- [Release notes](https://github.com/timvink/mkdocs-git-revision-date-localized-plugin/releases)
- [Commits](timvink/mkdocs-git-revision-date-localized-plugin@v1.2.1...v1.2.2)

---
updated-dependencies:
- dependency-name: mkdocs-git-revision-date-localized-plugin
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ipython from 8.18.1 to 8.19.0 (#377)

Bumps [ipython](https://github.com/ipython/ipython) from 8.18.1 to 8.19.0.
- [Release notes](https://github.com/ipython/ipython/releases)
- [Commits](ipython/ipython@8.18.1...8.19.0)

---
updated-dependencies:
- dependency-name: ipython
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Irene López <irene.lopezs@protonmail.com>
Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>
Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>
Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirill Tsukanov <tsukanoffkirill@gmail.com>
Co-authored-by: David Ochoa <dogcaesar@gmail.com>
Co-authored-by: Yakov <yt4@sanger.ac.uk>
Daniel-Considine added a commit that referenced this pull request Jan 3, 2024
* test: refactor test_gnomad_ld

* build(deps-dev): bump mkdocs-material from 9.4.7 to 9.4.8

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.4.7 to 9.4.8.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.4.7...9.4.8)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps-dev): bump pytest-xdist from 3.3.1 to 3.4.0

Bumps [pytest-xdist](https://github.com/pytest-dev/pytest-xdist) from 3.3.1 to 3.4.0.
- [Changelog](https://github.com/pytest-dev/pytest-xdist/blob/master/CHANGELOG.rst)
- [Commits](pytest-dev/pytest-xdist@v3.3.1...v3.4.0)

---
updated-dependencies:
- dependency-name: pytest-xdist
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: revert testing changes

* fix: gnomad paths are not necessary after #233

* build(deps): bump wandb from 0.13.11 to 0.16.0

Bumps [wandb](https://github.com/wandb/wandb) from 0.13.11 to 0.16.0.
- [Release notes](https://github.com/wandb/wandb/releases)
- [Changelog](https://github.com/wandb/wandb/blob/main/CHANGELOG.md)
- [Commits](wandb/wandb@v0.13.11...v0.16.0)

---
updated-dependencies:
- dependency-name: wandb
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/astral-sh/ruff-pre-commit: v0.1.4 → v0.1.5](astral-sh/ruff-pre-commit@v0.1.4...v0.1.5)
- [github.com/psf/black: 23.10.1 → 23.11.0](psf/black@23.10.1...23.11.0)
- [github.com/alessandrojcm/commitlint-pre-commit-hook: v9.7.0 → v9.8.0](alessandrojcm/commitlint-pre-commit-hook@v9.7.0...v9.8.0)
- [github.com/pre-commit/mirrors-mypy: v1.6.1 → v1.7.0](pre-commit/mirrors-mypy@v1.6.1...v1.7.0)

* fix: persist raw gwascat associations to return consistent results

* fix: coalesce variantid to assign a studylocusid

closes #3151

* chore: use a higher RAM master machine

* chore: use gs:// prefix for FinnGen input data

* chore: use more partitions for FinnGen

* chore: always error if the output data exists

* feat: add eQTL ingestion to the list of steps in DAG

* docs: update running instructions

* docs: update contributing checklist

* docs: add automatically generated docs

* chore: add configuration

* chore: unify FinnGen config with eQTL Catalogue

* feat: eQTL Catalogue main ingestion script

* feat: implement study index ingestion

* feat: implement summary stats ingestion

* refactor: update eQTL study index import

* refactor: reorganise study index ingestion for readability

* style: docstring for eQTL Catalogue summary stats ingestion

* feat: construct study ID based on all appropriate columns

* feat: map partial and full study IDs

* feat: join dataframes to add the full study ID information

* feat: populate geneId column

* refactor: move gene ID joining into the study index class

* fix: do not partition by chromosome for QTL studies

* fix: update class names

* chore: add __init__.py for eQTL Catalogue

* fix: eqtl_catalogue path in docs

* fix: name of EqtlCatalogueStep class

* chore: replace attributes with static methods for eQTL study index

* chore: replace attributes with static methods for eQTL summary stats

* fix: do not initialise session in the main class

* test: add conftests for eQTL Catalogue

* fix: include header when reading the study index

* fix: populating publicationDate

* fix: cast nSamples as long

* fix: manually specify schema for eQTL Catalogue summary stats

* fix: typo in position field name

* test: add sample eQTL Catalogue studies

* test: add sample eQTL Catalogue summary stats

* test: add test for eQTL Catalogue study index

* test: add test for eQTL Catalogue summary stats

* chore: partition output data by chromosome

* chore: read input data from Google Storage

* fix: studies sample filename

* chore: repartition data before processing

* revert: repartition call

* revert: repartition call

* build(deps-dev): bump google-cloud-dataproc from 5.6.0 to 5.7.0

Bumps [google-cloud-dataproc](https://github.com/googleapis/google-cloud-python) from 5.6.0 to 5.7.0.
- [Release notes](https://github.com/googleapis/google-cloud-python/releases)
- [Changelog](https://github.com/googleapis/google-cloud-python/blob/main/CHANGELOG.md)
- [Commits](googleapis/google-cloud-python@google-cloud-dataproc-v5.6.0...google-cloud-dataproc-v5.7.0)

---
updated-dependencies:
- dependency-name: google-cloud-dataproc
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: change definition of negative l2g evidence

* refactor: modularise logic for gold standards

* refactor: move hardcoded values to constants

* refactor: turn `OpenTargetsL2GGoldStandard` into class methods

* refactor(gold_standard): move logic to refine gold standards to `L2GGoldStandard`

* chore: remove the num_local_ssds arg which has no effect

* chore: align default values with docstring

* feat: add ability to attach local SSDs

* chore: make local SSDs a default

* test: add `test_parse_positive_curation`

* test: fix and test logic in `expand_gold_standard_with_negatives`

* test: add `test_expand_gold_standard_with_negatives_same_positives`

* test: testing for `process_gene_interactions`

* feat: parametrise autoscaling policy

* feat: finetune spark job: taskid, trigger rule, other args

* feat: new PICS step

* feat: configuration for PICS step

* docs: pics step

* feat: datasets config

* Revert "feat: datasets config"

This reverts commit a07c7c4.

* feat: gcp dataset config

* fix: wrong lines removed

* feat: raise error in `from_parquet` when df is empty

* chore: add `variantId` to gold standards schema

* chore: change `sources` in gold standards schema to a nullable

* test: add `test_filter_unique_associations`

* Update src/otg/pics.py

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* Update src/otg/pics.py

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* Update src/otg/pics.py

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* Update src/otg/pics.py

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* feat(overlaps): add and test method to transform the overlaps as a square matrix

* chore(overlaps): chromosome and statistics are not mandatory fields in the schema

* feat(l2g_gold_standard): change `filter_unique_associations` logic

* test(l2g_gold_standard): add `test_remove_false_negatives`

* fix(l2g_gold_standard): fix logic in `remove_false_negatives`

* chore(gold_standards): define gs labels as `L2GGoldStandard` attributes

* build(deps-dev): bump apache-airflow-providers-google

Bumps [apache-airflow-providers-google](https://github.com/apache/airflow) from 10.11.0 to 10.11.1.
- [Release notes](https://github.com/apache/airflow/releases)
- [Changelog](https://github.com/apache/airflow/blob/main/RELEASE_NOTES.rst)
- [Commits](apache/airflow@providers-google/10.11.0...providers-google/10.11.1)

---
updated-dependencies:
- dependency-name: apache-airflow-providers-google
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps-dev): bump mypy from 1.6.1 to 1.7.0

Bumps [mypy](https://github.com/python/mypy) from 1.6.1 to 1.7.0.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](python/mypy@v1.6.1...v1.7.0)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps-dev): bump mkdocs-material from 9.4.8 to 9.4.10

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.4.8 to 9.4.10.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.4.8...9.4.10)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps-dev): bump mkdocstrings-python from 1.7.3 to 1.7.4

Bumps [mkdocstrings-python](https://github.com/mkdocstrings/python) from 1.7.3 to 1.7.4.
- [Release notes](https://github.com/mkdocstrings/python/releases)
- [Changelog](https://github.com/mkdocstrings/python/blob/main/CHANGELOG.md)
- [Commits](mkdocstrings/python@1.7.3...1.7.4)

---
updated-dependencies:
- dependency-name: mkdocstrings-python
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: pre-commit autoupdate

* feat: updating summary stats schema and ingtesion

* chore: remove beta value interval calculation

* refactor: removing odds ratio, and confidence intervals from the schema

* fix: multiply standard error by zscore in `calculate_confidence_interval`

* chore: remove reference to confidence intervals

* fix: local SSD initialisation

* feat: gitignore .venv file

* feat: ingestion supported for both new and old format of the harmonized GWAS Catalog Summary stats. (#274)

* feat: converting gwas catalog ingestion to the new format

* feat: generalizing GWAS catalog sumstas ingestion both format supported

* fix: sample file must follow the convention of the real files

* fix: woopsie, the sample file is not added

* test: adding trst for old gwas format

* feat: updating gwas summary stats ingestion step to the new way of getting data

* fix: casting types to make schema explicit

* build(deps): bump pyarrow from 11.0.0 to 14.0.1

Bumps [pyarrow](https://github.com/apache/arrow) from 11.0.0 to 14.0.1.
- [Commits](apache/arrow@go/v11.0.0...go/v14.0.1)

---
updated-dependencies:
- dependency-name: pyarrow
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: rename study_locus to credible_set for l2g

* build(deps-dev): bump ipython from 8.17.2 to 8.18.1 (#280)

Bumps [ipython](https://github.com/ipython/ipython) from 8.17.2 to 8.18.1.
- [Release notes](https://github.com/ipython/ipython/releases)
- [Commits](ipython/ipython@8.17.2...8.18.1)

---
updated-dependencies:
- dependency-name: ipython
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump mkdocstrings-python from 1.7.4 to 1.7.5 (#279)

Bumps [mkdocstrings-python](https://github.com/mkdocstrings/python) from 1.7.4 to 1.7.5.
- [Release notes](https://github.com/mkdocstrings/python/releases)
- [Changelog](https://github.com/mkdocstrings/python/blob/main/CHANGELOG.md)
- [Commits](mkdocstrings/python@1.7.4...1.7.5)

---
updated-dependencies:
- dependency-name: mkdocstrings-python
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.1.3 to 0.1.6 (#276)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.1.3 to 0.1.6.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@v0.1.3...v0.1.6)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump mypy from 1.7.0 to 1.7.1 (#278)

Bumps [mypy](https://github.com/python/mypy) from 1.7.0 to 1.7.1.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](python/mypy@v1.7.0...v1.7.1)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* docs: minify plugin removed to prevent clash in local development (#284)

* fix: proper parsing of gwas catalog study accession from filename (#282)

* fix: proper parsing of gwas catalog study accession from filename

* fix: interestingly, I had to add an extra line to pass doctest on VSCode

* feat: bumping mypy version

---------

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* fix: making standard_error column optional (#286)

* fix: making standard-error column optional

* feat: add `clump` step (#288)

* feat: join clump methods into step canonical

* refactor: keep clump step minimal

* docs: add clump step docs page

* fix: add clump config

* revert(pre-commit): downgrade mypy version to 1.7.0

* feat(clump): add attribute to collect locus in window-based clumping

* refactor(clump): make study and ld indices non mandatory

* test: add test for clumpstep when input is ss

* test: rename test_clump_py to avoid clashes

* refactor: remove data variable in clumpstep

* fix: correct and test study splitter when subStudyDescription is the same (#289)

* fix(clump): read input files recursively (#292)

* build(deps-dev): bump mkdocs-material from 9.4.10 to 9.4.14 (#300)

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.4.10 to 9.4.14.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.4.10...9.4.14)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump scipy from 1.11.3 to 1.11.4 (#299)

Bumps [scipy](https://github.com/scipy/scipy) from 1.11.3 to 1.11.4.
- [Release notes](https://github.com/scipy/scipy/releases)
- [Commits](scipy/scipy@v1.11.3...v1.11.4)

---
updated-dependencies:
- dependency-name: scipy
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump pymdown-extensions from 10.3.1 to 10.5 (#301)

Bumps [pymdown-extensions](https://github.com/facelessuser/pymdown-extensions) from 10.3.1 to 10.5.
- [Release notes](https://github.com/facelessuser/pymdown-extensions/releases)
- [Commits](facelessuser/pymdown-extensions@10.3.1...10.5)

---
updated-dependencies:
- dependency-name: pymdown-extensions
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump mkdocs-git-committers-plugin-2 from 1.2.0 to 2.2.2 (#303)

Bumps [mkdocs-git-committers-plugin-2](https://github.com/ojacques/mkdocs-git-committers-plugin-2) from 1.2.0 to 2.2.2.
- [Release notes](https://github.com/ojacques/mkdocs-git-committers-plugin-2/releases)
- [Commits](ojacques/mkdocs-git-committers-plugin-2@1.2.0...2.2.2)

---
updated-dependencies:
- dependency-name: mkdocs-git-committers-plugin-2
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump apache-airflow-providers-google (#302)

Bumps [apache-airflow-providers-google](https://github.com/apache/airflow) from 10.11.1 to 10.12.0.
- [Release notes](https://github.com/apache/airflow/releases)
- [Changelog](https://github.com/apache/airflow/blob/main/RELEASE_NOTES.rst)
- [Commits](apache/airflow@providers-google/10.11.1...providers-google/10.12.0)

---
updated-dependencies:
- dependency-name: apache-airflow-providers-google
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: deptry added to handle unused, missing and transitive dependencies (#304)

* feat: add deptry to the project

* feat: deptry as pre-commit and associated changes

* feat: deptry as pre-commit and associated changes

* fix: exclude deptry from pre-commit.ci

* ci: check deptry when testing

* fix: poetry lock with no update

* feat: adding unpublished studies (#290)

* feat: adding unpublished studies

* feat: updating all gwas catalog sources

* feat: generalizing study ingestion to accept list of files

* fix: removing unused configuration

* feat: adding unpublished ancestries as well

* fix: updating ancestry config name

* Apply suggestions from code review

Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>

---------

Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>

* feat: add prettier as formatter (yaml, json, md, etc.) (#298)

* feat: prettier added to the project and precommit

* feat: recommend vscode extension

* fix: adjusted prettier

* fix: all files reformated by prettier

* fix: unnecessary comment

* chore(l2ggoldstandard): add studyId to schema (#305)

* chore(l2ggoldstandard): add studyId to schema

* fix: add `studyId` to gold standards testing fixtures

---------

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* feat: Adding cohorts field to study index (#309)

* feat: ingesting cohort information for GWAS Catalog

* feat: adding cohort to finngen study index

* feat: add 'coalesce' and 'repartition' wrappers to 'Dataset' (#307)

* feat(dataset): add `coalesce`

* feat(clump): coalesce summary stats to 1000 partitions

* feat(dataset): change coalesce to setPartitions

* test(dataset): added `TestSetPartitions`

* refactor(dataset): split set_partitions into repartition and coalesce

* chore(clump): decrease number of sumstats partitions to 400

* chore: change number of partitions to window clumping to 4000

Number estimate on the basis of 60 workers x 16 cores x 4 partitions

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

---------

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* chore(airflow): schedule_interval deprecation warning (#293)

Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* feat: GWAS Catalog harmonisation prototype (#270)

* chore: merge main

* feat: partitioning GWAS Catalog dataset to 20 equally-sized partitions

* fix: missing config

* feat: dag with gwas catalog harmonisation

* feat: adding more primary workers to help with the task

* refactor: changed autoscaling policy

* feat: allow to specify number of preeptible workers

* fix: bugs on to_do_list

* revert: version number

* fix: gwas_catalog_sumstat_preprocess no longer needs study_id

* fix: unnecessary config causes issues

* fix: improved regexp

* refactor: generalising the config

* fix: rename cluster to prevent clashes with other dags

* Update src/airflow/dags/gwas_catalog_harmonisation.py

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

---------

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>
Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* test: failing doctest in different python version (#320)

* test: issue on test with different python version

* refactor: value error instead of assertion

* build(deps): bump wandb from 0.16.0 to 0.16.1 (#315)

Bumps [wandb](https://github.com/wandb/wandb) from 0.16.0 to 0.16.1.
- [Release notes](https://github.com/wandb/wandb/releases)
- [Changelog](https://github.com/wandb/wandb/blob/main/CHANGELOG.md)
- [Commits](wandb/wandb@v0.16.0...v0.16.1)

---
updated-dependencies:
- dependency-name: wandb
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* chore(deps): bump actions/setup-python from 4 to 5 (#319)

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 4 to 5.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v4...v5)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* build(deps-dev): bump mkdocs-material from 9.4.14 to 9.5.2 (#324)

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.4.14 to 9.5.2.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.4.14...9.5.2)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* test: Improvements to `test_dataset` and `test_clump_step` (#312)

* test: rename TestDataset to MockDataset

* test: output test_clumpstep_summary_stats results to temp dir

* feat: Gnomad v4 based variant annotation (#311)

* feat: gnomad4 parser and changes in schema

* fix: variant annotation schema

* feat: required changes in variant index

* test: fix schema

* test: adjust testing to absence of sift and polyphen

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: typo in schema

* fix: preventing skewed partitions

* fix: remove sift and polyphen predictions from v2g

* feat: rename gnomad3VariantId to gnomadVariantId name

* refactor: stop inheriting datasets in parsers (#313)

* refactor: stop inheriting datasets in parsers

* fix: typing issue

* refactor: include datasets in datasources

* test: fix incorrect import

* test: doctest function calls fixed

* test: doctest function calls fixed in studyindex

---------

Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* feat: add gwas_catalog_preprocess dag (#291)

* feat: gwas_catalog step stops at ingestion

* feat: gwas_catalog step stops at ingestion

* feat: add gwas_catalog_preprocess dag

* fix: change step_id to task_id as task_id

* feat: group gwas_catalog_preprocess tasks into sumstats and curation groups

* fix: add all dependencies when ld clumping

* fix: update gwas catalog docs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactor(dag): extract releasebucket and sumstats paths as constants

* refactor: streamline study locus paths

* build(deps-dev): bump pre-commit from 3.5.0 to 3.6.0 (#316)

Bumps [pre-commit](https://github.com/pre-commit/pre-commit) from 3.5.0 to 3.6.0.
- [Release notes](https://github.com/pre-commit/pre-commit/releases)
- [Changelog](https://github.com/pre-commit/pre-commit/blob/main/CHANGELOG.md)
- [Commits](pre-commit/pre-commit@v3.5.0...v3.6.0)

---
updated-dependencies:
- dependency-name: pre-commit
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump numpy from 1.26.1 to 1.26.2 (#314)

Bumps [numpy](https://github.com/numpy/numpy) from 1.26.1 to 1.26.2.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/RELEASE_WALKTHROUGH.rst)
- [Commits](numpy/numpy@v1.26.1...v1.26.2)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump typing-extensions from 4.8.0 to 4.9.0 (#317)

Bumps [typing-extensions](https://github.com/python/typing_extensions) from 4.8.0 to 4.9.0.
- [Release notes](https://github.com/python/typing_extensions/releases)
- [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md)
- [Commits](python/typing_extensions@4.8.0...4.9.0)

---
updated-dependencies:
- dependency-name: typing-extensions
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: add `l2g_benchmark` notebook to compare with production results (#323)

* feat: finngen preprocess prototype (#272)

All steps associated with the preprocessing of Finngen studies (PICS-road) included in a DAG:
- summary stats harmonisation
- window-based clumping
- LD-based clumping
- PICS

Several enhancements might follow in different PRs.

---------
Co-authored-by: Irene López <irene.lopezs@protonmail.com>

* chore: create code of conduct (#327)

* Create CODE_OF_CONDUCT.md

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* chore: review study locus and study index configs (#326)

* chore: make studylocus and study indices configs clearer

* chore: temporarily turn off removal of redundancies due to perf

* refactor: read studyindex and studylocus recursively

* feat: track training data and feature importance (#325)

* chore: delete makefile_deprecated (#329)

Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* feat: ruff as formatter (#322)

* ruff formatter instead of black

* refactor: ruff reformatted files

* feat: more complete ruff adjustments

* refactor: all codebase to comply with ruff rules

* chore: update lock

* feat: more stringent docstring rules

* revert: remove isort and black from Makefile

* build(deps-dev): bump google-cloud-dataproc from 5.7.0 to 5.8.0 (#330)

Bumps [google-cloud-dataproc](https://github.com/googleapis/google-cloud-python) from 5.7.0 to 5.8.0.
- [Release notes](https://github.com/googleapis/google-cloud-python/releases)
- [Changelog](https://github.com/googleapis/google-cloud-python/blob/main/packages/google-cloud-documentai/CHANGELOG.md)
- [Commits](googleapis/google-cloud-python@google-cloud-dataproc-v5.7.0...google-cloud-dataproc-v5.8.0)

---
updated-dependencies:
- dependency-name: google-cloud-dataproc
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.1.6 to 0.1.7 (#331)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.1.6 to 0.1.7.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@v0.1.6...v0.1.7)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: semantic release automation (#294)

Feature needs to be fully tested

* feat: track feature missingness rates (#335)

* feat(L2GFeatureMatrix): add `features_list` as attribute

* fix: log wandb table

* feat(L2GFeatureMatrix): track missingness rate for each feature

* feat(L2GFeatureMatrix): track missingness rate for each feature

* chore(LocusToGeneModel): remove evaluation outside experiment tracking

* feat: trigger on push (#337)

* build(deps-dev): bump pytest-xdist from 3.4.0 to 3.5.0 (#333)

* build(deps-dev): bump ipykernel from 6.26.0 to 6.27.1 (#332)

* feat: yamllint to ensure yaml linting (#338)

* feat: yamllint support

* feat: updates yamllint rules

* fix: release actions fixes (#344)

* feat: metadata on toml

* feat: several fixes

* refactor: linting

* refactor: externalise python version

* fix: single quotes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* docs: finngen description v1 (#345)

* chore: upgrade checkout (#346)

* fix: github token (#348)

* fix: several issues (#349)

* feat: release branch (#350)

* fix: unnecessary option (#351)

* build(deps-dev): bump python-semantic-release from 8.3.0 to 8.5.1 (#343)

Bumps [python-semantic-release](https://github.com/python-semantic-release/python-semantic-release) from 8.3.0 to 8.5.1.
- [Release notes](https://github.com/python-semantic-release/python-semantic-release/releases)
- [Changelog](https://github.com/python-semantic-release/python-semantic-release/blob/master/CHANGELOG.md)
- [Commits](python-semantic-release/python-semantic-release@v8.3.0...v8.5.1)

---
updated-dependencies:
- dependency-name: python-semantic-release
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* feat: activate release process (#352)

* feat: serious release

* revert: no tests within release workflow

* build(deps-dev): bump isort from 5.13.1 to 5.13.2 (#342)

Bumps [isort](https://github.com/pycqa/isort) from 5.13.1 to 5.13.2.
- [Release notes](https://github.com/pycqa/isort/releases)
- [Changelog](https://github.com/PyCQA/isort/blob/main/CHANGELOG.md)
- [Commits](PyCQA/isort@5.13.1...5.13.2)

---
updated-dependencies:
- dependency-name: isort
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.1.7 to 0.1.8 (#341)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.1.7 to 0.1.8.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@v0.1.7...v0.1.8)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: upload release (#353)

* feat: semantic release gh action (#354)

* revert: dispatch (#355)

* build(deps): bump pyspark from 3.3.3 to 3.3.4 (#358)

Bumps [pyspark](https://github.com/apache/spark) from 3.3.3 to 3.3.4.
- [Commits](apache/spark@v3.3.3...v3.3.4)

---
updated-dependencies:
- dependency-name: pyspark
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump python-semantic-release/python-semantic-release (#359)

Bumps [python-semantic-release/python-semantic-release](https://github.com/python-semantic-release/python-semantic-release) from 8.3.0 to 8.5.1.
- [Release notes](https://github.com/python-semantic-release/python-semantic-release/releases)
- [Changelog](https://github.com/python-semantic-release/python-semantic-release/blob/master/CHANGELOG.md)
- [Commits](python-semantic-release/python-semantic-release@v8.3.0...v8.5.1)

---
updated-dependencies:
- dependency-name: python-semantic-release/python-semantic-release
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

* ci: new changelog and release notes templates  (#357)

Templates for CHANGELOG and release notes. To be fully tested on the next release.

* fix(l2g): `calculate_feature_missingness_rate` counts features annotated with 0 as incomplete (#364)

* chore: import wandb classes explicitly

* fix: count feature annotation with 0 as incomplete

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* docs: corrected and added documentation to datasource (#362)

* docs: corrected and added documentation to datasource

* docs: corrected documentation to datasource - answering comments v1

* docs: corrections in datasource documentation

* fix: incorrect parsing of `app_name` in makefile (#367)

* fix: correct app_name in makefile

* chore: remove redundant dist cleaning

* chore: streamline make rules dependencies

* ci: set codecov default branch to dev (#368)

* feat: Finngen R10 harmonisation and preprocessing (#370)

* chore: remove unnecessary file

* fix: several fixes on finngen harmonisation and preprocess

* docs: update docs

* fix: test

* fix: uncomment line

Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

---------

Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* feat(pics): remove variants from `locus` when PICS cannot be applied (#361)

* feat(pics): variants not in locus when if pips cant be calculated

* feat(pics): add empty_locus qc flag

* chore(pics): add  to finemappingMethod column

* refactor(pics): change definition of non picsable based on ldset

* Update src/otg/dataset/study_locus.py

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* Update src/otg/method/pics.py

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

* Update tests/method/test_pics.py

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

---------

Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>
Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>

* chore(study_index): change numeric columns from long to integers (#371)

* chore(study_index): change numeric columns to integers

* chore(study_index): accommodate parsers to schema changes

* feat(l2g): add features based on predicted variant consequences (#360)

* chore: import wandb classes explicitly

* feat(l2g): add studylocusfeaturefactory._get_vep_features

* chore: accommodate project to newer features

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat(l2g): include averaged features

* chore: set cluster delete TTL (#379)

* build(deps-dev): bump apache-airflow from 2.7.3 to 2.8.0 (#373)

Bumps [apache-airflow](https://github.com/apache/airflow) from 2.7.3 to 2.8.0.
- [Release notes](https://github.com/apache/airflow/releases)
- [Changelog](https://github.com/apache/airflow/blob/main/RELEASE_NOTES.rst)
- [Commits](apache/airflow@2.7.3...2.8.0)

---
updated-dependencies:
- dependency-name: apache-airflow
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>

* build(deps-dev): bump mypy from 1.7.1 to 1.8.0 (#374)

Bumps [mypy](https://github.com/python/mypy) from 1.7.1 to 1.8.0.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](python/mypy@v1.7.1...v1.8.0)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump python-semantic-release/python-semantic-release (#372)

Bumps [python-semantic-release/python-semantic-release](https://github.com/python-semantic-release/python-semantic-release) from 8.5.1 to 8.7.0.
- [Release notes](https://github.com/python-semantic-release/python-semantic-release/releases)
- [Changelog](https://github.com/python-semantic-release/python-semantic-release/blob/master/CHANGELOG.md)
- [Commits](python-semantic-release/python-semantic-release@v8.5.1...v8.7.0)

---
updated-dependencies:
- dependency-name: python-semantic-release/python-semantic-release
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>

* build(deps-dev): bump python-semantic-release from 8.5.1 to 8.7.0 (#375)

Bumps [python-semantic-release](https://github.com/python-semantic-release/python-semantic-release) from 8.5.1 to 8.7.0.
- [Release notes](https://github.com/python-semantic-release/python-semantic-release/releases)
- [Changelog](https://github.com/python-semantic-release/python-semantic-release/blob/master/CHANGELOG.md)
- [Commits](python-semantic-release/python-semantic-release@v8.5.1...v8.7.0)

---
updated-dependencies:
- dependency-name: python-semantic-release
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>

* build(deps-dev): bump mkdocs-git-revision-date-localized-plugin (#376)

Bumps [mkdocs-git-revision-date-localized-plugin](https://github.com/timvink/mkdocs-git-revision-date-localized-plugin) from 1.2.1 to 1.2.2.
- [Release notes](https://github.com/timvink/mkdocs-git-revision-date-localized-plugin/releases)
- [Commits](timvink/mkdocs-git-revision-date-localized-plugin@v1.2.1...v1.2.2)

---
updated-dependencies:
- dependency-name: mkdocs-git-revision-date-localized-plugin
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ipython from 8.18.1 to 8.19.0 (#377)

Bumps [ipython](https://github.com/ipython/ipython) from 8.18.1 to 8.19.0.
- [Release notes](https://github.com/ipython/ipython/releases)
- [Commits](ipython/ipython@8.18.1...8.19.0)

---
updated-dependencies:
- dependency-name: ipython
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Irene López <irene.lopezs@protonmail.com>
Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>
Co-authored-by: Daniel Suveges <daniel.suveges@protonmail.com>
Co-authored-by: Kirill Tsukanov <tskir@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirill Tsukanov <tsukanoffkirill@gmail.com>
Co-authored-by: David Ochoa <dogcaesar@gmail.com>
Co-authored-by: Yakov <yt4@sanger.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants