This package is intended to be used in two ways:
-
by the Snakemake workflows in the
phage-seq
repository to batch process raw sequencing data from Phage-seq experiments into feature tables. For this usage, the Snakemake workflows will installnbseq
automatically as needed at the appropriate steps/ -
interactively within Jupyter notebooks to query, calculate, and visualize the resulting data structures.
To explore the code used in our paper (Grun et al., Nat. Commun. 2024), start with the phage-seq
repository. That repository also includes several demonstration notebooks and datasets to explore the functionality of this library. Follow the instructions there to create or obtain an example dataset, then return to this repository for instructions on how to install nbseq
for interactive analysis.
-
First, perform preprocessing of raw data using the Snakemake workflow(s) in the
phage-seq
repository, following instructions there. The relevant steps within the Snakemake workflows will install thenbseq
package; it is not necessary to manually install thenbseq
package for this step. -
Second, for interactive analysis, it is recommended to create a dedicated
conda
environment for use with thenbseq
package.-
If you have not already done so, install the Mamba (or Conda) package manager. I recommend using the
miniforge
distribution. -
Create and activate a new
conda
environment fornbseq
and its dependencies. You have two options:-
Minimal installation: installs only the required core dependencies:
wget https://github.com/caseygrun/nbseq/raw/main/environment-min.yaml conda env create -f environment-min.yaml conda activate nbseq-min
-
Full installation of all optional dependencies:
wget https://github.com/caseygrun/nbseq/raw/main/environment.yaml conda env create -f environment.yaml conda activate nbseq
In both cases, you do not need to clone this repository. You only need to download the
.yaml
file(s) using the steps above; the remaining files will be downloaded and installed byconda
. -
-
Install JupyterLab or the Jupyter Notebook, if you have not already; you also have two choices for this:
-
I recommend creating a separate dedicated
conda
environment for JupyterLab and using thenb_conda_kernels
package; this will allow you to install and update JupyterLab separately fromnbseq
and its many dependencies; thenb_conda_kernels
package lets you access thenbseq
environment (and any other conda environments you create) from within JupyterLab:conda deactivate conda create -n jupyter jupyterlab nb_conda_kernels panel conda activate jupyter
-
Alternatively, you can install JupyterLab directly into the same environment as
nbseq
:conda install jupyterlab
-
-
Launch JupyterLab and follow the instructions below in "Usage:"
jupyter lab
-
Note: nbseq
is tested only on 64-bit Linux.
The main entry point for interactive analysis is the nbseq.Experiment
class, which loads and organizes feature tables, phylogenetic trees, metadata, and databases for a given experiment. nbseq.Experiment.from_files
can load data from the directory structure created by the phage-seq
Snakemake workflows. Consult the docstring ?nbseq.Experiment.from_files
for a more detailed description of the options.
>>> import nbseq
>>> ex = nbseq.Experiment.from_files(
... # skip loading the larger `aa` (e.g. each VHH amino acid sequence is a
... # distinct column) feature table and # phylogenetic tree; by default,
... # the function loads the `cdr3` and `aa` feature tables
... ft_aa=None, tree_aa=None,
... metadata='config/metadata_full.csv') #'intermediate/cdr3/features/all/alpaca/asvs.nwk')
Loading experiment panning-extended from '/vast/palmer/home.mccleary/cng2/code/phageseq-paper/panning-extended'...
- Reading metadata from config/metadata_full.csv ...
- Reading phenotypes from config/phenotypes.csv ...
- Reading Config from config/config.yaml ...
- Using SQL database at 'sqlite:////vast/palmer/home.mccleary/cng2/code/phageseq-paper/panning-extended/intermediate/aa/asvs.db'
- Reading feature data for table 'cdr3' from results/tables/cdr3/asvs.csv (2.6 MB)...
- Reading aa feature table from results/tables/aa/feature_table.biom (350.4 MB)...
- Reading cdr3 feature table from results/tables/cdr3/feature_table.biom (8.4 MB)...
- Warning: phylogeny for space 'aa' at 'intermediate/aa/features/top_asvs/alpaca/asvs.nwk' does not exist!
- Warning: phylogeny for space 'cdr3' at 'intermediate/cdr3/features/top_asvs/alpaca/asvs.nwk' does not exist!
- Using mmseqs2 database 'aa' at 'intermediate/aa/features_db/features'
- Warning: mmseqs2 database for space 'cdr3' at 'intermediate/cdr3/features_db/features' does not exist!
- Reading enrichment model (conditional ECDF) for space cdr3 from results/tables/cdr3/enrichment/null/ecdf.pickle (307.6 kB)...
Finished in 20.29 seconds
Displaying the Experiment
object shows a summary:
>>> ex
Experiment('panning-extended') with feature spaces ['aa', 'cdr3']:
obs: ['plate.x' 'well.x' 'depth' 'expt' 'round' 'sample' 'phage_library'
'notes' 'r' 'io' 'kind' 'selection' 'replicate' 'name_full' 'name'
'well_027e' 'sel_plate_027i' 'sel_well_027i' 'selection_027j' 'plate.y'
'well.y' 'category' 'antigen' 'genotype_pair' 'gene_CS' 'gene_S'
'genotype_CS' 'background_CS' 'strain_CS' 'loc_CS' 'cond_CS' 'genotype_S'
'background_S' 'strain_S' 'loc_S' 'cond_S' 'cond_notes' 'bflm' 'swim'
'twitch' 'swarm' 'PMB-R' 'FEP-R' 'TET-R' 'CIP-R' 'CHL-R' 'GEN-R' 'ERY-R'
'IPM-R' 'cdiGMP' 'FliC' 'FliCa' 'FliCb' 'FlgEHKL' 'PilQ' 'PilA' 'PilB'
'LasA' 'LasB' 'Apr' 'XcpQ' 'ToxA' 'EstA' 'LepA' 'PlpD' 'Phz' 'Pcn' 'Pvd'
'Hcn' 'Rhl' 'T3SS' 'T6SS' 'Pel' 'Psl' 'CdrB' 'SCV' 'Mucoid' 'Alginate'
'OprM' 'OprJ' 'OprN' 'OprOP' 'OpdH' 'OprD' 'OprL' 'OprF' 'OprG' 'OprH'
'OprB' 'MexAB' 'MexCD' 'MexEF' 'MexJK' 'MexXY' 'MexGHI' 'PirA' 'Pfu' 'TonB'
'FptA' 'FpvA' 'PfeA' 'CupB5' 'CupA' 'CupB' 'CupC' 'CupD' 'LPS-LipidA-
Palmitoyl' 'L-LipidA-Ara4N' 'LPS-CPA' 'LPS-OSA' 'LPS-galU' 'LPS-rough'
'LPS' 'description']
- aa : 439 samples x 5134305 features, database: None
var: ['reads' 'nsamples']
- cdr3 : 439 samples x 40292 features, database: None
var: ['reads' 'nsamples']
SQL: sqlite:////vast/palmer/home.mccleary/cng2/code/phageseq-paper/panning-extended/intermediate/aa/asvs.db
From there, you can access various visualizations via the experiment visualizer, ex.viz
, e.g.:
>>> ex.viz.top_feature_barplot(f"expt == '027j' & FlgEHKL == 1", select_from_round=None, n=100).facet(column='selection')
Or load additional interactive visualizations using the nbseq.viz
package, e.g.
>>> import nbseq.viz.dash
>>> nbseq.viz.dash.selection_group_dashboard(
... ex, starting_phenotype='FlgEHKL',
... global_query=(
... "expt == '027j' & io == 'i' & kind == '+'")
... )
See the phage-seq
repository for additional examples: panning-minimal
and panning-extended
.
The nbseq
package contains the following sub-modules:
nbseq
:Experiment
class that collects and organizes data for one or more Phage-seq experiments. Namely,Experiment
loads and organizes trees, metadata, and feature tables in multiple feature spaces (e.g. VHH, CDR3, etc.) and facilitates projecting between them.Experiment
also provides an interface for interactive visualization of the entire experiment or subsets thereof.utils
: utility functionsasvs
: process VHH sequences: calculate residue frequencies, consensus sequences, query for similar sequences, project between feature spaces (e.g. CDR3 counts to full length amino acid sequence counts)ft
: read and process feature tables (sparse matrices of sample x feature [i.e. VHH, CDR3, etc.])select
: perform calculations relevant to phage display selection (e.g. enrichment, amplification bias); calculate null models of enrichment probabilitiesnorm
: normalize feature table data to remove effect of variable library sizesordination
: perform ordination/dimensionality reduction on feature tablesdesign
: create design matrices for inference and machine learningpheno
: compare and visualize phenotypes of samplesmsa
: perform multiple sequence alignment withmafft
viz
: generate various visualizations: feature bar plots, rank-abundance curve (Whittaker plots), abundance curves, 2D/3D ordination plots, sequence logos, receiver-operator characteristic curves, etc.predict
: perform machine learning prediction on feature tablesresynth
: choose and resynthesize recombinant VHH genes as gene fragments. Includes routines for identifying consensus sequences, trimming and adding adapter sequences, etc.cloning
: simulate cloning recombinant VHHs into destination vectorsprep
: utilities to aid in HTS library preparation
Later versions may work but have not been tested.
Required and recommended dependencies can be installed using conda
via the included environment.yml
file
- Required dependencies:
- anndata=0.9.2
- biom-format=2.1.15
- humanize=4.7.0
- natsort=8.4.0
- numpy=1.24.4
- pandas=2.0.3
- pysam=0.21.0
- pyyaml=6.0
- scikit-bio=0.5.9
- scipy=1.10.0
- statsmodels=0.14.0
- Optional dependencies:
- For machine learning:
- scikit-learn=1.3.0
- scikit-optimize=0.9.0
- xgboost=1.5.1
- For database-accelerated feature queries:
- connectorx=0.3.1
- mmseqs2=14.7e284
- sqlalchemy=2.0.19
- sqlite=3.42.0
- For recombinant sequence optimization and cloning:
- dna_features_viewer=3.1.2
- dnachisel=3.2.11
- pydna=3.1.0
- python-codon-tables=0.1.12
- For processing Sanger sequencing chromatograms:
- bioconvert=1.1.1
- For visualizations:
- altair==5.1.0.dev0
- logomaker=0.8
- matplotlib=3.7.2
- plotly=5.16.0
- pygments=2.16.1
- plotnine=0.12.2
- seaborn=0.12.2
- patchworklib=0.6.3
- pip: mnemonicode=1.4.5
- For interactive "dashboard" visualizations:
- altair-transform=0.2.0
- bokeh=3.2.2
- ipykernel=6.25.1
- ipywidgets=8.1.0
- panel=1.2.1
- For normalization using
scran
package:- r=4.1
- bioconductor-biomformat=1.22.0
- bioconductor-scran=1.22.1
- For machine learning: