Skip to content

Commit

Permalink
add options_etc.md
Browse files Browse the repository at this point in the history
  • Loading branch information
CarlKCarlK committed Jan 15, 2024
1 parent 6c974ef commit 377d208
Show file tree
Hide file tree
Showing 7 changed files with 240 additions and 82 deletions.
9 changes: 3 additions & 6 deletions README-rust.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Features
* Supports many indexing methods. Slice data by individuals (samples) and/or SNPs (variants).
* The Python-facing APIs for this library is used by [PySnpTools](https://github.com/fastlmm/PySnpTools), [FaST-LMM](https://github.com/fastlmm/FaST-LMM), and [PyStatGen](https://github.com/pystatgen).
* Supports [PLINK 1.9](https://www.cog-genomics.org/plink2/formats).
* Read data from the cloud, efficiently and directly.
* Read data locally or from the cloud, efficiently and directly.

Examples
--------
Expand Down Expand Up @@ -97,13 +97,10 @@ use bed_reader::{BedCloud, ReadOptions, assert_eq_nan, sample_url, EMPTY_OPTIONS
# use {bed_reader::BedErrorPlus, tokio::runtime::Runtime}; // '#' needed for doctest
# Runtime::new().unwrap().block_on(async {
let url = sample_url("small.bed")?;
println!("{url:?}"); // For example, "file://C:\\Users\\carlk\\AppData\\Local\\fastlmm\\bed-reader\\cache\\small.bed"
let options = EMPTY_OPTIONS; // map of authentication keys, etc., if needed.
let mut bed_cloud = BedCloud::new(url, options).await?;
let val = ReadOptions::builder()
.sid_index(2)
.f64()
.read_cloud(&mut bed_cloud)
.await?;
let val = ReadOptions::builder().sid_index(2).f64().read_cloud(&mut bed_cloud).await?;
assert_eq_nan(&val, &nd::array![[f64::NAN], [f64::NAN], [2.0]]);
# Ok::<(), Box<dyn std::error::Error>>(())
# }).unwrap();
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Features:
* Supports all Python indexing methods. Slice data by individuals (samples) and/or SNPs (variants).
* Used by [PySnpTools](https://github.com/fastlmm/PySnpTools), [FaST-LMM](https://github.com/fastlmm/FaST-LMM), and [PyStatGen](https://github.com/pystatgen).
* Supports [PLINK 1.9](https://www.cog-genomics.org/plink2/formats).
* Read data from the cloud, efficiently and directly.
* Read data locally or from the cloud, efficiently and directly.

Install
====================
Expand Down
108 changes: 68 additions & 40 deletions src/bed_cloud.rs

Large diffs are not rendered by default.

7 changes: 6 additions & 1 deletion src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
//! | Function | Description |
//! | -------- | ----------- |
//! | [`Bed::new`](struct.Bed.html#method.new) or [`Bed::builder`](struct.Bed.html#method.builder) | Open a local PLINK .bed file for reading genotype data and metadata. |
//! | [`BedCloud::from_object_path`](struct.BedCloud.html#method.from_object_path) or [`BedCloud::builder_cmk`](struct.BedCloud.html#method.builder_cmk) | Open a cloud PLINK .bed file for reading genotype data and metadata. |
//! | [`BedCloud::new`](struct.BedCloud.html#method.new) or [`BedCloud::builder`](struct.BedCloud.html#method.builder) | Open a cloud PLINK .bed file for reading genotype data and metadata. |
//! | [`ReadOptions::builder`](struct.ReadOptions.html#method.builder) | Read genotype data from a local or cloud file. Supports indexing and options. |
//! | [`WriteOptions::builder`](struct.WriteOptions.html#method.builder) | Write values to a local file in PLINK .bed format. Supports metadata and options. |
//!
Expand Down Expand Up @@ -54,6 +54,7 @@
//! When using [`ReadOptions::builder`](struct.ReadOptions.html#method.builder) to read genotype data, use these options to
//! specify a desired numeric type,
//! which individuals (samples) to read, which SNPs (variants) to read, etc.
//! cmk add links to cloud options
//!
//! | Option | Description |
//! | -------- | ----------- |
Expand Down Expand Up @@ -7190,3 +7191,7 @@ where
{
Ok(STATIC_FETCH_DATA.fetch_files(path_list)?)
}

pub mod supplemental_documents {
#![doc = include_str!("supplemental_documents/options_etc.md")]
}
104 changes: 104 additions & 0 deletions src/supplemental_documents/options_etc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Options, Options, Options

Within this crate, the term "options" can refer to three levels of options: [Cloud](#cloud-options), [Bed/BedCloud](#bedbedcloud-options), and [ReadOptions](#readoptions).

## Cloud options

When specifing a file in the cloud via a URL, we use methods [`BedCloud::new(url, options)`](../struct.BedCloud.html#method.new) and
[`BedCloud::builder(url, options)`](../struct.BedCloud.html#method.builder).

The cloud providers forbid putting some needed information in the URL. Instead, that information must
go into `options`.
For example, AWS S3 requires that information about `"aws_region"`, `"aws_access_key_id"`, and `"aws_secret_access_key"` be placed in the options.

Here is an AWS example:

> **Note:** I can run this, but others can't because of the authentication checks.
```rust
use bed_reader::{BedCloud,BedErrorPlus};
use tokio::runtime::Runtime;
use rusoto_credential::{CredentialsError, ProfileProvider, ProvideAwsCredentials};

Runtime::new().unwrap().block_on(async {
// Read my AWS credentials from file ~/.aws/credentials
let credentials = if let Ok(provider) = ProfileProvider::new() {
provider.credentials().await
} else {
Err(CredentialsError::new("No credentials found"))
};

let Ok(credentials) = credentials else {
eprintln!("Skipping test because no AWS credentials found");
return Ok(());
};

let url = "s3://bedreader/v1/toydata.5chrom.bed";
let options = [
("aws_region", "us-west-2"),
("aws_access_key_id", credentials.aws_access_key_id()),
("aws_secret_access_key", credentials.aws_secret_access_key()),
];

let mut bed_cloud = BedCloud::new(url, options).await?;
let val = bed_cloud.read::<i8>().await?;
assert_eq!(val.shape(), &[500, 10_000]);
Ok::<(), Box<BedErrorPlus>>(())
});
Ok::<(), Box<BedErrorPlus>>(())
```

We can also read from local files as though they are in the cloud. In that case, no cloud options are needed, so we use `EMPTY_OPTIONS`:

```rust
use ndarray as nd;
use bed_reader::{BedCloud, ReadOptions, assert_eq_nan, sample_url, EMPTY_OPTIONS};
# use {bed_reader::BedErrorPlus, tokio::runtime::Runtime}; // '#' needed for doctest
# Runtime::new().unwrap().block_on(async {
let url = sample_url("small.bed")?;
println!("{url:?}"); // For example, "file://C:\\Users\\carlk\\AppData\\Local\\fastlmm\\bed-reader\\cache\\small.bed"
let options = EMPTY_OPTIONS; // map of authentication keys, etc., if needed.
let mut bed_cloud = BedCloud::new(url, options).await?;
let val = ReadOptions::builder().sid_index(2).f64().read_cloud(&mut bed_cloud).await?;
assert_eq_nan(&val, &nd::array![[f64::NAN], [f64::NAN], [2.0]]);
# Ok::<(), Box<dyn std::error::Error>>(())
# }).unwrap();
```

Other cloud services, for example, Azure and Google Cloud, also need cloud options. Their options are similar to, but not identical to, the options for AWS S3. You will need to research the details to use any cloud service.

## Bed/BedCloud-Options

When you open a local file for reading, you can set options via [`Bed::builder`](../struct.Bed.html#method.builder). When you open a cloud file for reading, you can set options via [`BedCloud::builder`](../struct.BedCloud.html#method.builder).

The options, [listed here](../struct.BedBuilder.html#implementations), can:

* set the path of the .fam and/or .bim file
* override some metadata, for example, replace the individual ids.
* set the number of individuals (samples) or SNPs (variants)
* control checking the validity of the .bed file’s header
* skip reading selected metadata

For example, here we replace the `iid` in the file with our own list:

```rust
use bed_reader::{Bed, sample_bed_file};
let file_name = sample_bed_file("small.bed")?;
let mut bed = Bed::builder(file_name)
.iid(["sample1", "sample2", "sample3"])
.build()?;
println!("{:?}", bed.iid()?); // Outputs ndarray ["sample1", "sample2", "sample3"]
# use bed_reader::BedErrorPlus;
# Ok::<(), Box<BedErrorPlus>>(())
```

## ReadOptions

When reading read genotype data, use [`ReadOptions::builder`](../struct.ReadOptions.html#method.builder) to specify:

* a desired numeric type,
* which individuals (samples) to read,
* which SNPs (variants) to read,
* etc.

See this crates introductory material for [a complete table](../index.html#readoptions).
90 changes: 57 additions & 33 deletions tests/tests_api_cloud.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
use std::collections::HashMap;
use std::collections::HashSet;
use std::panic::catch_unwind;
use std::sync::Arc;
Expand Down Expand Up @@ -266,7 +265,7 @@ async fn bad_header_cloud_url() -> Result<(), Box<BedErrorPlus>> {
println!("start");
let url = sample_url("badfile.bed")?;
println!("{:?}", url);
let bed_cloud = BedCloud::builder_from_url(&url, EMPTY_OPTIONS)?
let bed_cloud = BedCloud::builder(&url, EMPTY_OPTIONS)?
.skip_early_check()
.build()
.await?;
Expand Down Expand Up @@ -579,7 +578,7 @@ async fn fam_and_bim_cloud_url() -> Result<(), Box<BedErrorPlus>> {
let mut deb_maf_mib = sample_urls(["small.deb", "small.maf", "small.mib"])?;

// Build BedCloud with custom fam and bim paths
let mut bed_cloud = BedCloud::builder_from_url(deb_maf_mib.remove(0), EMPTY_OPTIONS)?
let mut bed_cloud = BedCloud::builder(deb_maf_mib.remove(0), EMPTY_OPTIONS)?
.fam_from_url(deb_maf_mib.remove(0), EMPTY_OPTIONS)? // Note: indexes shift
.bim_from_url(deb_maf_mib.remove(0), EMPTY_OPTIONS)? // Note: indexes shift
.build()
Expand Down Expand Up @@ -2211,32 +2210,58 @@ async fn dyn_cloud() -> Result<(), Box<BedErrorPlus>> {
Ok(())
}

// cmk requires aws credentials
#[tokio::test]
async fn s3_url_cloud() -> Result<(), Box<BedErrorPlus>> {
use rusoto_credential::{ProfileProvider, ProvideAwsCredentials};
// Try to get credentials and return Ok(()) if not available
let credentials = match ProfileProvider::new() {
Ok(provider) => match provider.credentials().await {
Ok(creds) => creds,
Err(_) => return Ok(()), // No credentials, return Ok(())
},
Err(_) => return Ok(()), // Unable to create ProfileProvider, return Ok(())
// Read my AWS credentials from file ~/.aws/credentials
use rusoto_credential::{CredentialsError, ProfileProvider, ProvideAwsCredentials};
let credentials = if let Ok(provider) = ProfileProvider::new() {
provider.credentials().await
} else {
Err(CredentialsError::new("No credentials found"))
};

let Ok(credentials) = credentials else {
eprintln!("Skipping test because no AWS credentials found");
return Ok(());
};

let url = "s3://bedreader/v1/toydata.5chrom.bed";
// let url = "file:///O:/programs/br/bed_reader/tests/data/toydata.5chrom.bed";
let cloud_options: HashMap<&str, &str> = [
let options = [
("aws_region", "us-west-2"),
("aws_access_key_id", credentials.aws_access_key_id()),
("aws_secret_access_key", credentials.aws_secret_access_key()),
("aws_region", "us-west-2"),
]
.iter()
.cloned()
.collect();
];

let mut bed_cloud = BedCloud::new(url, options).await?;
let val = bed_cloud.read::<i8>().await?;
assert_eq!(val.shape(), &[500, 10_000]);
Ok(())
}

#[tokio::test]
async fn s3_url_cloud2() -> Result<(), Box<BedErrorPlus>> {
// Read my AWS credentials from file ~/.aws/credentials
use rusoto_credential::{CredentialsError, ProfileProvider, ProvideAwsCredentials};
let credentials = if let Ok(provider) = ProfileProvider::new() {
provider.credentials().await
} else {
Err(CredentialsError::new("No credentials found"))
};

let Ok(credentials) = credentials else {
eprintln!("Skipping test because no AWS credentials found");
return Ok(());
};

let url = "s3://bedreader/v1/toydata.5chrom.bed";
let options = [
("aws_region", "us-west-2"),
("aws_access_key_id", credentials.aws_access_key_id()),
("aws_secret_access_key", credentials.aws_secret_access_key()),
];
let url = Url::parse(url).unwrap();
let (object_store, store_path): (Box<dyn ObjectStore>, StorePath) =
object_store::parse_url_opts(&url, cloud_options).unwrap();
object_store::parse_url_opts(&url, options).unwrap();
// print!("{:?}", object_store);
// print!("{:?}", store_path);
// // let store_path: StorePath = "/v1/toydata.5chrom.bed".into();
Expand All @@ -2252,24 +2277,21 @@ async fn s3_url_cloud() -> Result<(), Box<BedErrorPlus>> {
Ok(())
}

#[tokio::test]
async fn object_path_2() {
#[test]
fn object_path_2() -> Result<(), Box<BedErrorPlus>> {
use bed_reader::{sample_bed_file, BedErrorPlus, ObjectPath};
use object_store::{local::LocalFileSystem, path::Path as StorePath};
use std::sync::Arc;
use tokio::runtime::Runtime;

Runtime::new()
.unwrap()
.block_on(async {
let arc_object_store = Arc::new(LocalFileSystem::new()); // Arc-wrapped ObjectStore
let file_path = sample_bed_file("plink_sim_10s_100v_10pmiss.bed")?; // regular Rust PathBuf
let store_path = StorePath::from_filesystem_path(&file_path)?; // StorePath
let object_path = ObjectPath::new(arc_object_store, store_path); // ObjectPath
assert_eq!(object_path.size().await?, 303);
Ok::<(), Box<BedErrorPlus>>(())
})
.unwrap();
Runtime::new().unwrap().block_on(async {
let arc_object_store = Arc::new(LocalFileSystem::new()); // Arc-wrapped ObjectStore
let file_path = sample_bed_file("plink_sim_10s_100v_10pmiss.bed")?; // regular Rust PathBuf
let store_path = StorePath::from_filesystem_path(&file_path)?; // StorePath
let object_path = ObjectPath::new(arc_object_store, store_path); // ObjectPath
assert_eq!(object_path.size().await?, 303);
Ok::<(), Box<BedErrorPlus>>(())
})
}

// cmk requires aws credentials
Expand Down Expand Up @@ -2312,6 +2334,7 @@ fn read_me_cloud() -> Result<(), Box<BedErrorPlus>> {
use {assert_eq_nan, bed_reader::BedErrorPlus, tokio::runtime::Runtime}; // '#' needed for doctest
Runtime::new().unwrap().block_on(async {
let url = sample_url("small.bed")?;
println!("{url:?}"); // For example, "file://C:\\Users\\carlk\\AppData\\Local\\fastlmm\\bed-reader\\cache\\small.bed"
let options = EMPTY_OPTIONS; // map of authetication keys, etc., if needed.
let mut bed_cloud = BedCloud::new(url, options).await?;
let val = ReadOptions::builder()
Expand All @@ -2333,3 +2356,4 @@ fn read_me_cloud() -> Result<(), Box<BedErrorPlus>> {
// cmk Rules: use tokio testing
// cmk Rules: Make strings, maps, etc as generic as you can
// cmk Rules: allow user to control concurrency and buffer size
// cmk Rules: Much larger binary for Python
2 changes: 1 addition & 1 deletion useful.bat
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ pytest --doctest-modules bed_reader\_open_bed.py
maturin build --release

# show docs
cargo doc --open
cargo doc --no-deps --open

0 comments on commit 377d208

Please sign in to comment.