-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
b5c7b2f
commit 7af2390
Showing
14 changed files
with
877 additions
and
197 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 0ffcf54393a856012dadc0b1563f2c43 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,219 @@ | ||
Cloud URL Examples | ||
==================== | ||
|
||
*Table of Contents*: | ||
|
||
- `Http <#http-section>`_ | ||
- `local file <#local-file-section>`_ | ||
- `AWS S3 <#aws-section>`_ | ||
|
||
.. note:: | ||
|
||
The `bed-reader` package also supports Azure and GCP, but we don't have examples. | ||
|
||
To specify a file in the cloud, you must specify URL string plus optional cloud options. | ||
|
||
The exact details depend on the cloud service. We'll look at `http`, at `local files`, and at `AWS S3`. | ||
|
||
.. _http-section: | ||
|
||
Http | ||
---- | ||
|
||
You can read \*.bed files from web sites directly. For small files, access will be fast. For medium-sized files, | ||
you may need to extend the default `timeout`. | ||
|
||
Reading from large files can also be practical and even fast under these conditions: | ||
|
||
- You need only some of the information | ||
- (Optional, but helpful) You can provide some metadata about individuals (samples) and SNPs (variants) locally. | ||
|
||
Let's first look at reading a small or medium-sized dataset. | ||
|
||
*Example:* | ||
|
||
Read an entire file and find the fraction of missing values. | ||
|
||
.. code-block:: python | ||
>>> import numpy as np | ||
>>> from bed_reader import open_bed | ||
>>> with open_bed("https://raw.githubusercontent.com/fastlmm/bed-sample-files/main/small.bed") as bed: | ||
... val = bed.read() | ||
... missing_count = np.isnan(val).sum() | ||
... missing_fraction = missing_count / val.size | ||
... missing_fraction # doctest: +ELLIPSIS | ||
0.1666... | ||
When reading a medium-sized file, you may need to set a `timeout` in your cloud options. With a `timeout`, | ||
you can give your code more than the default 30 seconds to read metadata from the \*.fam and \*.bim files | ||
(or genomic data from \*.bed). | ||
|
||
.. note:: | ||
|
||
See `ClientConfigKey <https://docs.rs/object_store/latest/object_store/enum.ClientConfigKey.html>`_ | ||
for a list of cloud options, such as `timeout`, that you can always use. | ||
|
||
You may also wish to use `.skip_format_check=True` to avoid a fast, | ||
early check of the \*.bed file's header. | ||
|
||
Here we print the first five iids (individual or sample ids) and first five sids (SNP or variant ids). | ||
We then, print all unique chromosome values. Finally, we read all data from chromosome 5 and print its dimensions. | ||
|
||
.. code-block:: python | ||
>>> import numpy as np | ||
>>> from bed_reader import open_bed | ||
>>> with open_bed( | ||
... "https://raw.githubusercontent.com/fastlmm/bed-sample-files/main/toydata.5chrom.bed", | ||
... cloud_options={"timeout": "100s"}, | ||
... skip_format_check=True, | ||
... ) as bed: | ||
... bed.iid[:5] | ||
... bed.sid[:5] | ||
... np.unique(bed.chromosome) | ||
... val = bed.read(index=np.s_[:, bed.chromosome == "5"]) | ||
... val.shape | ||
array(['per0', 'per1', 'per2', 'per3', 'per4'], dtype='<U11') | ||
array(['null_0', 'null_1', 'null_2', 'null_3', 'null_4'], dtype='<U9') | ||
array(['1', '2', '3', '4', '5'], dtype='<U9') | ||
(500, 440) | ||
Now, let's read from a large file containing data from over 1 million individuals (samples) and over 300,000 SNPs (variants). The file size is 91 GB. In this example, we read data for just one SNP (variant). If we know the number of individuals (samples) and SNPs (variants) exactly, we can read this SNP quickly and with just one file access. | ||
|
||
What is the mean value of the SNP (variant) at index position 100,000? | ||
|
||
.. code-block:: python | ||
>>> import numpy as np | ||
>>> from bed_reader import open_bed | ||
>>> with open_bed( | ||
... "https://www.ebi.ac.uk/biostudies/files/S-BSST936/genotypes/synthetic_v1_chr-10.bed", | ||
... cloud_options={"timeout": "100s"}, | ||
... skip_format_check=True, | ||
... iid_count=1_008_000, | ||
... sid_count=361_561, | ||
... ) as bed: | ||
... val = bed.read(index=np.s_[:, 100_000], dtype=np.float32) | ||
... np.mean(val) # doctest: +ELLIPSIS | ||
0.033913... | ||
You can also download the \*.fam and \*.bim metadata files and then read from them locally while continuing to read the \*.bed file from the cloud. | ||
This gives you almost instant access to the metadata and the \*.bed file. Here is an example: | ||
|
||
.. code-block:: python | ||
>>> from bed_reader import open_bed, sample_file | ||
>>> import numpy as np | ||
>>> # For this example, assume 'synthetic_v1_chr-10.fam' and 'synthetic_v1_chr-10.bim' are already downloaded | ||
>>> # and 'local_fam_file' and 'local_bim_file' variables are set to their local file paths. | ||
>>> local_fam_file = sample_file("synthetic_v1_chr-10.fam") | ||
>>> local_bim_file = sample_file("synthetic_v1_chr-10.bim") | ||
>>> with open_bed( | ||
... "https://www.ebi.ac.uk/biostudies/files/S-BSST936/genotypes/synthetic_v1_chr-10.bed", | ||
... fam_filepath=local_fam_file, | ||
... bim_filepath=local_bim_file, | ||
... skip_format_check=True, | ||
... ) as bed: | ||
... print(f"iid_count={bed.iid_count:_}, sid_count={bed.sid_count:_}") | ||
... print(f"iid={bed.iid[:5]}...") | ||
... print(f"sid={bed.sid[:5]}...") | ||
... print(f"unique chromosomes = {np.unique(bed.chromosome)}") | ||
... val = bed.read(index=np.s_[:10, :: bed.sid_count // 10]) | ||
... print(f"val={val}") | ||
iid_count=1_008_000, sid_count=361_561 | ||
iid=['syn1' 'syn2' 'syn3' 'syn4' 'syn5']... | ||
sid=['chr10:10430:C:A' 'chr10:10483:A:C' 'chr10:10501:G:T' 'chr10:10553:C:A' | ||
'chr10:10645:G:A']... | ||
unique chromosomes = ['10'] | ||
val=[[0. 1. 0. 2. 0. 1. 0. 0. 0. 0. 0.] | ||
[0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.] | ||
[0. 0. 2. 2. 0. 1. 0. 2. 0. 0. 0.] | ||
[0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.] | ||
[0. 0. 1. 2. 0. 1. 0. 1. 0. 0. 0.] | ||
[0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0.] | ||
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] | ||
[0. 0. 0. 2. 0. 0. 0. 1. 0. 0. 0.] | ||
[0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0.] | ||
[0. 0. 0. 2. 0. 1. 0. 1. 0. 0. 0.]] | ||
.. _local-file-section: | ||
|
||
Local File | ||
---------- | ||
|
||
We can specify a local file as if it is in the cloud. This is a great way to test cloud functions. For real work and better efficiency, however, | ||
use the file's path rather than its URL. | ||
|
||
Local File URL | ||
++++++++++++++ | ||
|
||
The URL for a local file takes the form `file:///{encoded_file_name}`. No cloud options are needed. | ||
|
||
*Example:* | ||
|
||
.. code-block:: python | ||
>>> import numpy as np | ||
>>> from bed_reader import open_bed, sample_file | ||
>>> from urllib.parse import urljoin | ||
>>> from pathlib import Path | ||
>>> file_name = str(sample_file("small.bed")) | ||
>>> print(f"file name: {file_name}") # doctest: +ELLIPSIS | ||
file name: ...small.bed | ||
>>> url = urljoin("file:", Path(file_name).as_uri()) | ||
>>> print(f"url: {url}") # doctest: +ELLIPSIS | ||
url: file:///.../small.bed | ||
>>> with open_bed(url) as bed: | ||
... val = bed.read(index=np.s_[:, 2], dtype=np.float64) | ||
... print(val) | ||
[[nan] | ||
[nan] | ||
[ 2.]] | ||
.. _aws-section: | ||
|
||
AWS S3 | ||
------ | ||
|
||
Let's look next at reading a file (or part of a file) from AWS S3. | ||
|
||
The URL for an AWS S3 file takes the form `s3://{bucket_name}/{s3_path}`. | ||
|
||
AWS forbids putting some needed information in the URL. Instead, that information must go into a string-to-string | ||
dictionary of cloud options. Specifically, we'll put `"aws_region"`, `"aws_access_key_id"`, and `"aws_secret_access_key"` in | ||
the cloud options. | ||
For security, we pull the last two option values from a file rather than hard-coding them into the program. | ||
|
||
See `ClientConfigKey <https://docs.rs/object_store/latest/object_store/enum.ClientConfigKey.html>`_ for a list of cloud options, such as ``timeout``, that you can always use. | ||
See `AmazonS3ConfigKey <https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html>`_ for a list of AWS-specific options. | ||
See `AzureConfigKey <https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html>`_ for a list of Azure-specific options. | ||
See `GoogleConfigKey <https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html>`_ for a list of Google-specific options. | ||
|
||
*Example:* | ||
|
||
.. note:: | ||
|
||
I can run this, but others can't because of the authentication checks. | ||
|
||
.. code-block:: python | ||
import os | ||
import configparser | ||
from bed_reader import open_bed | ||
config = configparser.ConfigParser() | ||
_ = config.read(os.path.expanduser("~/.aws/credentials")) | ||
cloud_options = { | ||
"aws_region": "us-west-2", | ||
"aws_access_key_id": config["default"].get("aws_access_key_id"), | ||
"aws_secret_access_key": config["default"].get("aws_secret_access_key"), | ||
} | ||
with open_bed("s3://bedreader/v1/toydata.5chrom.bed", cloud_options=cloud_options) as bed: | ||
val = bed.read(dtype="int8") | ||
print(val.shape) | ||
# Expected output: (500, 10000) |
Oops, something went wrong.