Skip to content

Commit a872335

Browse files
authored
Merge pull request #5 from mmolari/marco-dev
# changes: - migrated to nextflow DSL 2 - added analysis of secondary and supplementary alignments - interesting trajectories are the ones with highest max - min frequency, not final - initial frequency - multiple improvements in analysis plots
2 parents 901ba1c + 058505b commit a872335

30 files changed

+1674
-660
lines changed

.gitignore

+11-9
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1-
test_dataset
2-
.vscode
3-
local
4-
work
5-
.nextflow*
6-
reports
7-
results
8-
makefile
9-
figures
1+
/test_dataset
2+
/.vscode
3+
/local
4+
/work
5+
/.nextflow*
6+
/reports
7+
/results
8+
/makefile
9+
/figures
10+
/playgrounds
11+
/conda_envs/conda
1012
__pycache__/

README.md

+27-7
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,18 @@ Code for the analysis of morbidostat data. The analysis consists for now of the
44

55
- map reads against assembled genome from the first timepoint
66
- produce a pileup of the mapped reads, that can be explored with IGV
7-
8-
## content of the repository
9-
10-
- The folder `notes` contains notes and descrioptions of the steps implemented in the pipeline. To be used as guides for pipeline implementation.
11-
- The `scripts` folder contains scripts used by the various workflows.
7+
- analyzes the variation over time of:
8+
- coverage
9+
- consensus frequency
10+
- gaps
11+
- insertions
12+
- secondary/supplementary mappings
13+
- clipped reads
14+
15+
It consists of three workflows:
16+
- `pileup.nf`: maps the reads against the reference genomes and produces pileups and other summary files (see `notes/pipeline_description.md` for a more detailed description).
17+
- `extract_stats.nf`: produces databases containing the variation of consensus and gap frequency over time.
18+
- `plots.nf`: produces different plots for the analysis (see `notes/plots.md` for a more detailed description).
1219

1320
## test dataset
1421

@@ -20,6 +27,16 @@ Substituting `username` with the appropriate username.
2027

2128
The test dataset has been extracted from the morbidostat run labeled `2022-02-08-RT_test`. It contains 2 vials (vial 2 and 3), each one having 3 timepoints (timepoint 1, 2 and 5). For timepoint 1 the annotated genome is included. For the rest of the timepoints, only fastq reads are provided in the `reads.fastq.gz` files. These files contain only a subset of all the reads (first 10'000 lines of the original fastq file). This corresponds to ~3% of all the reads, and these files are roughly 20-30 Mb each.
2229

30+
In general input data must be placed in an input directory with nested structure:
31+
32+
```
33+
input_dir/vial_XX/time_YY/
34+
├── assembled_genome
35+
│   ├── assembly.fna
36+
│   └── assembly.gbk
37+
└── reads.fastq.gz
38+
```
39+
2340
## workflow: building a pileup of mapped reads
2441

2542
The workflow `pileup.nf` builds a pileup of the reads. For every vial, it maps reads for any timepoints to a reference genome, which is the one assembled from the initial timepoint for each vial. It generates the sorted bam file of mappings, together with a pileup matrix and a list of insertions and positions of clipped reads. Usage is as follows:
@@ -39,17 +56,20 @@ The input parameters are:
3956
- `input_fld` : folder containing the input reads. It must have the nested `vial_n/time_n` structure of archived morbidostat experiment data.
4057
- `time_beg` : the label corresponding to the first timepoint, whose assembled genome is used as reference (e.g. `1` for timepoint_1`)
4158
- `qual_min` : parameters of the script used to build the pileup. Only reads with quality higher than the threshold are used, and only insertions that have at least 70% of the sites meeting this threshold.
42-
- `clip_minL` : minimum length of clips that are added to the clip dictionary. Shorter clips are ignored.
59+
- `clip_minL` : minimum length of clips that are added to the clip dictionary. Shorter clips are ignored. Nb: only clips from primary mappings are considered in the clip analysis.
4360

44-
Nb: only primary reads are considered, and secondary or supplementary reads are excluded. In future updates we might want to consider secondary reads for clip positions.
4561

4662
results are saved in the `results` folder, with a structure that mirrors the `vial_n/timepoint_n` structure of the input folder. Each of these subfolders will contain the following files:
4763
- `reads.sorted.bam` and `reads.sorted.bam.bai` : sorted `bam` file (and corresponding index) containing the mapping of the reads against the reference genome.
4864
- `pileup/allele_counts.npz` : pileup of the reads. This is a numpy tensor with dimension (2,6,L) corresponding to (1) forward-reverse reads, (2) allele `["A", "C", "G", "T", "-", "N"]` and (3) position.
4965
- `pileup/insertions.pkl.gz` : a nested dictionary of insertions, saved in pickle format and compressed with gzip. The structure is `position -> sequence -> [n. forward reads, n. reverse reads]`.
5066
- `pileup/clips.pkl.gz` : a pickle file containing two dictionaries (accessible with the "count" and "seqs" tags). The first has the form `{position -> [fwd clips, rev clips, tot fwd read ends, tot rev read ends]}`. Contains the number of forward and reverse soft-clips ecountered in the reads at any given position, provided that the clipped part is longer than the threshold `clip_minL`. It also contains the total number of reads that end at that position, even if they do not meet the threshold. On a separate dictionary (under the "seqs" tag) for each position we save the clipped nucleotide sequence. This dictionary has the form `{position -> {0 or 1 for fwd / rev -> seq }}`.
67+
- `pileup/unmapped.{csv,fastq.gz}`: dataframe with list of unmapped reads. It contains the read-id, length, flag and average quality. The fastq.gz file contains these reads.
68+
- `pileup/non_primary.csv`: dataframe with list of mapping info for reads that have *supplementary* or *secondary* mappings. The primary one is included too. The structure of this file is described in `notes/read_mapping.md`.
5169
- `ref_genome.fa` and `ref_genome.gbk` : symlink to the reference genome used for mapping the reads (same vial, first timepoint), both in genbank and fasta format.
5270

71+
See also `notes/pipeline_description.md` for a detailed description of the commands/tools used and output files structure.
72+
In the current version of the pipeline this workflow also produces plots for secondary/supplementary mappings (see `notes/plots.md`).
5373

5474
## workflow: extract statistics
5575

conda_env.yml

-198
This file was deleted.

0 commit comments

Comments
 (0)