Skip to content

Latest commit

 

History

History
90 lines (77 loc) · 3.35 KB

README.md

File metadata and controls

90 lines (77 loc) · 3.35 KB

MEMO: MEM-based pangenome indexing for k-mer queries GitHub release (latest by date) GitHub

Maximal Exact Match Ordered (MEMO) is a pangenome indexing method based on maximal exact matches (MEMs) between genomes. A single MEMO index can handle arbitrary-length-k k-mer queries over pangenomic windows. MEMO performs membership queries for per-genome k-mer presence/absence and conservation queries for the number of genomes containing the k-mers in a window. MEMO achieves smaller index sizes and faster queries than k-mer-based approaches like KMC3 and PanKmer. See the small example here on running MEMO for visualizing sequence conservation.

Installation

Docker/Singularity Container

MEMO is available as a Docker image on DockerHub.

### Docker:
docker pull hwangstephen/memo:latest
docker run hwangstephen/memo:latest memo -h
### Singularity:
singularity pull docker://hwangstephen/memo:latest
./memo_latest.sif memo -h

Build from source

MEMO relies on the following dependencies:

  • Python:
    • python (>=3.10)
    • pandas
    • plotnine
    • pyarrow
    • numba
    • numpy
  • Others:

Compile MONI from repo:

sudo apt-get install -y build-essential cmake git python3 zlib1g-dev
git clone https://github.com/maxrossi91/moni
mkdir build
cd build
cmake ..
make
make install

After downloading/building the required dependencies, clone and run MEMO from its repo:

git clone https://github.com/StephenHwang/MEMO.git
cd MEMO/src
./memo -h

Usage

Index Creation

To create a MEMO conservation index, specify a list of genomes -g and an output location -o and prefix -p. To create the MEMO membership index, include the -m flag. Each line in the genome_list.txt is the path to each genome in the pangenome; the first genome listed is the pangenome pivot.

./memo index \
  -g genome_list.txt \
  -o output_dir \
  -p output_prefix

Querying k-mer membership and conservation

Once you have created your indexes, specify your length-k k, genomic region -r, and the total number of genomes in your genome (inclusive of pivot) -n. Then run memo query for the conservation query. To run the membership query, include the -m flag.

./memo query \
  -b index.parquet \
  -k k \
  -n num_genomes \
  -r chr:start-end \
  -o memo_c_out.txt

Visualizing sequence conservation

hprc_hla_seq_conservation

31-mer sequence conservation of the Human Leucocyte Antigen locus in the HPRC pangenome.

After the conservation query, use MEMO to visualize sequence conservation:

./memo view \
  -i memo_c_out.txt \
  -o out.png \
  -n num_genomes \
  -b num_bins

Citing MEMO

Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, Ben Langmead. MEM-based pangenome indexing for k-mer queries (2024). bioRxiv.