Single-cell sequencing of barcoded pdmH1N1 influenza virus; David Bacsik and Jesse Bloom.
Pre-print of results is titled Influenza virus transcription and progeny production are poorly correlated in single cells and is available at https://www.biorxiv.org/content/10.1101/2022.08.30.505828v1.
A static version of the repository used to generate the figures in this pre-print is tagged at: https://github.com/jbloomlab/barcoded_flu_pdmH1N1/releases/tag/bioRxiv_v1.
All data used in this study is available in GEO under accession number GSE214938.
The workflow for this project has two main steps. First, the Snakemake pipeline is run which takes raw sequencing data as input and generates a CSV containing information about viral transcription and progeny production in single influenza-infected cells. Then, the final_analysis.py.ipynb
is run manually to visualize the results.
For a summary of the Snakemake pipeline, see the report.html
file that is placed in the ./results/
subdirectory.
This repository is organized as followed (based loosely on this example snakemake repository):
-
environment.yml and environment_unpinned.yml give the version pinned and unpinned conda environment used to run the Snakemake pipeline.
-
config.yaml contains the configuration for the analysis.
-
cluster.yaml contains the cluster configuration for running tha analysis on the Fred Hutch cluster.
-
./notebooks/ contains Jupyter notebooks that are run by Snakefile using the snakemake notebook functionality.
-
./scripts/ contains scripts used by Snakefile.
-
./pymodules/ contains Python modules with some functions used by Snakefile.
-
./report/ contains workflow description and captions used to create the snakemake report.
-
./data/ contains the input data, specifically:
-
./data/flu_sequences/ gives the flu sequences used in the experiment. See the README in that subdirectory for details.
-
./data/flu_sequences/pacbio_amplions gives the famplicon sequences generated for pacbio sequencing. See the README in that subdirectory for details.
-
-
./results/ is a created directory with all results, most of which are not tracked in this repository.
-
./results/figures/ contains the figures generated for the manuscript.
-
./results/viral_fastq10x/ contains two CSV files containing key processed data:
- integrate_data.csv contains viral transcription and genotype information for all cells in the dataset.
- complete_measurement_cells_data.csv contains progeny production information ,viral transcription information, and genotype information for the set of cells with complete sequencing and progeny production measurements.
The conda environment for the pipeline in this repo is specified in environment.yml; note also that an unpinned version of this environment is specified in environment_unpinned.yml. If you are on the Hutch cluster and set up to use the BloomLab conda installation, then this environment is already built and you can activate it simply with:
conda activate barcoded_flu_pdmH1N1
Otherwise you need to first build the conda environment from environment.yml and then activate it as above.
In addition to building and activating the conda environment, you also need to install cellranger and bcl2fastq into the current path; the current analysis uses cellranger version 4.0.0 and bcl2fastq version 2.20.
Once the barcoded_flu_pdmH1N1 conda environment and other software have been activated, simply enter the commands to run Snakefile and then generate a snakemake report, at ./results/report.html
.
These commands with the configuration for the Fred Hutch cluster are in the shell script. run_Hutch_cluster.bash.
You probably want to submit the script itself via sbatch, using:
sbatch run_Hutch_cluster.sbatch
When the Snakeamke pipeline has run completely, the processed output data is exported to a CSV file at results/viral_fastq10x/{expt}_integrate_data.csv
. A stable version of this file is available at https://github.com/jbloomlab/barcoded_flu_pdmH1N1/blob/main/results/viral_fastq10x/scProgenyProduction_trial3_integrate_data.csv and can be used to re-analyze the data without running the Snakemake pipeline.
In this repo, the CSV file is used to perform the final analysis and generate figures in the final_analysis.py.ipynb
notebook. This notebook is run manually. This notebook must be run with the barcoded_flu_pdmH1N1_final_anlaysis conda environment activated.
To activate this environment, first build it from envs/barcoded_flu_pdmH1N1_final_analysis.yml and then activate it with:
conda activate barcoded_flu_pdmH1N1_final_analysis
Ideally, before you a new branch is committed, you should run the linting in lint.bash with the command:
bash ./lint.bash
This script runs:
For the Jupyter notebook linting, it may be easiest to lint while you are still developing notebook with run cells rather then before you put the empty notebook in ./notebooks/, as the linting results are labeled by cell run number.