Skip to content

Files

Latest commit

2760791 · Feb 10, 2025

History

History

notebooks

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Dec 2, 2024
Jun 12, 2022
Oct 24, 2023
Oct 24, 2023
Feb 10, 2025
Dec 18, 2024
Jun 12, 2022
Jun 12, 2022
Oct 17, 2022
Oct 24, 2023
Oct 24, 2023
Oct 24, 2023
Oct 24, 2023
Oct 24, 2023
Oct 24, 2023
Dec 2, 2024
Oct 24, 2023
Feb 27, 2021
Oct 24, 2023
Oct 24, 2023
Feb 27, 2021
Oct 24, 2023
Mar 1, 2021
Oct 24, 2023
Oct 24, 2023
Feb 15, 2021
Oct 24, 2023
Oct 24, 2023
Mar 1, 2021
Oct 17, 2022
Dec 1, 2024
Apr 18, 2021
Oct 17, 2022

These are the Jupyter notebooks for GA4GH Federated Analysis Systems Project

These supersede the scripts used in the FASP-Scripts used at GA4GH Plenary 2020.

[TOC]


FASPScripts Notebooks

.

The notebooks follow a basic three step pattern used throughout FASP. Each step corresponds to a different GA4GH API as outlined here

  • Data Connect - to identify subjects and samples of interest based on attrinutes of those subjects and samples
  • Data Repository Service DRS - to obtain authorized access to files (genomic sequences)
  • Workflow Execution Service - WES - to perform a workflow on those files

In any notebook more than one implementation of the given API may be used at each step where different data sources need to be searched, where files are in different cloud locations, or where workflow needs to be performed local to those files.

Some notebooks use a non GA4GH API which performs equivalent functionality. The motivation for each notebook was to search particular datasets together in a federated way. Where those data were not available through a GA4GH API a proprietary API was used. In some cases the data sources used in notebooks were created for purposes of demo/exploration. In some cases this was necessary to create scrambled versions of controlled access datasets. In other cases controlled access subject and specimen data were searched but were accessed from private stores maintained under access control.

In all cases where controlled access sequence data was used it remains under the access control of the repositories that make it available (EGA, NIH Cloud Platforms).

The table below indicates for each notebook where a GA4GH API could be used (blue) and where a proprietary API (grey) was used.

scriptGrid

**Prerequisites to run notebooks

  • fasp package - install (e.g. pip) from fasp-scripts directory
  • Settings file
    • The examples directory contains a template settings file with a number of parameters for the FASP scripts. Place a copy of this file in your file system and set the environment variable FASP_SETTINGS to point to it. Edit the settings as appropriate.
  • Python 3
    • See the code for the modules required
  • A folder in your home directory called .keys containing keys for various services. Not all keys required for all scripts.
  • The following modules are used by different scripts. All scripts are unlikely to be relevant to all users these modules are not installed with the fasp package. Please install those needed for the scripts you will run.
    • Google Life Sciences API enabled for your GCP account
    • BigQuery python libraries - for scripts that use BigQuery
    • Seven Bridges API
    • pyega3 - EGA client libraries for download. See also EGA documentation for client API.