PyRHE is a unified and efficient Python framework for genomics heritability estimation. It provides a modular and extensible platform for implementing various genetic architecture estimation models and computation optimizations for large-scale genomic data.
The full documentation is available at the PyRHE Documentation.
- High computational efficiency through distributed jackknife subsamples and parallelized genotype I/O and large-scale matrix operations.
- Tensor-based computation with automatic conversion of large matrices to PyTorch tensors, designed to run efficiently on both CPU and CUDA-enabled GPU architectures.
- Memory-efficient streaming support through the
StreamingBaseclass, enabling memory-efficient processing of large-scale genomic data. - Modular, extensible design with abstract base classes (
Base,StreamingBase) that provide interfaces for adding new models. - Multiple models in one framework, including RHE, RHE-DOM, and GENIE, all sharing common infrastructure.
pip install pyrhe
# Also install proper version of PyTorch from https://pytorch.org/
from pyrhe.models import (
RHE,
StreamingRHE,
GENIE,
StreamingGENIE,
RHE_DOM,
StreamingRHE_DOM,
)
# Standard RHE
rhe_model = RHE(
geno_file="path/to/genotype",
annot_file="path/to/annotation",
pheno_file="path/to/phenotype",
# other arguments...
)
rhe_results = rhe_model()
# Streaming RHE
streaming_rhe_model = StreamingRHE(
geno_file="path/to/genotype",
annot_file="path/to/annotation",
pheno_file="path/to/phenotype",
# other arguments...
)
streaming_results = streaming_rhe_model()Each model (e.g., RHE, GENIE, RHE_DOM and their streaming version) follows the same pattern: initialize with file paths and options, then call the model instance to run estimation and return results.
After installing the package, you can run PyRHE directly from the command line:
python run_rhe.py <command-line arguments>
Alternatively, you may run PyRHE using a newline-separated config file:
python run_rhe.py --config <config file>
model: The model to run (e.g., rhe, rhe_dom, genie).
genotype (-g): The path of PLINK BED genotype file
phenotype (-p): The path of phenotype file
covariate (-c): The path of covariate file
annotation (-annot): The path of genotype annotation file.
num_vec (-k): The number of random vectors (10 is recommended).
num_block (-jn): The number of jackknife blocks (100 is recommended).
The higher the number of jackknife blocks, the higher the memory usage.
output (-o): The path of the output file prefix
streaming: Whether to use the streaming version or not
num_workers: The number of workers
seed (-s): The random seed
device: Device to use (cpu or gpu)
Using CPU already enables great performance. You can further improve performance using GPU
cuda_num: CUDA number of GPU
geno_impute_method: How to impute missing genotype ("binary" (binary imputation) or "mean" (mean imputation))
cov_impute_method: How to impute missing covariate ("ignore" (ignore individuals with missing covariate) or "mean" (mean imputation))
samp_prev: Sample prevalence of binary phenotype (for conversion to liability scale)
pop_prev: Population prevalence of binary phenotype (for conversion to liability scale)
trace (-tr): Save the stochastic trace estimates as trace summary statistics (.trace) with metadata (.MN)
trace_dir: Directory to save the trace estimates
Please refer to the example for a list of configuration files for running RHE, RHE_DOM, and GENIE, and their respective outputs.