mmolari
diff --git a/‎.gitignore
+11-9 b/‎.gitignore
+11-9
diff --git a/‎README.md
+27-7 b/‎README.md
+27-7
diff --git a/‎conda_env.yml
-198 b/‎conda_env.yml
-198
@@ -1,10 +1,12 @@
-test_dataset
-.vscode
-local
-work
-.nextflow*
-reports
-results
-makefile
-figures
+/test_dataset
+/.vscode
+/local
+/work
+/.nextflow*
+/reports
+/results
+/makefile
+/figures
+/playgrounds
+/conda_envs/conda
 __pycache__/
@@ -4,11 +4,18 @@ Code for the analysis of morbidostat data. The analysis consists for now of the
 
 - map reads against assembled genome from the first timepoint
 - produce a pileup of the mapped reads, that can be explored with IGV
-
-## content of the repository
-
-- The folder `notes` contains notes and descrioptions of the steps implemented in the pipeline. To be used as guides for pipeline implementation.
-- The `scripts` folder contains scripts used by the various workflows.
+- analyzes the variation over time of:
+  - coverage
+  - consensus frequency
+  - gaps
+  - insertions
+  - secondary/supplementary mappings
+  - clipped reads
+
+It consists of three workflows:
+- `pileup.nf`: maps the reads against the reference genomes and produces pileups and other summary files (see `notes/pipeline_description.md` for a more detailed description).
+- `extract_stats.nf`: produces databases containing the variation of consensus and gap frequency over time.
+- `plots.nf`: produces different plots for the analysis (see `notes/plots.md` for a more detailed description). 
 
 ## test dataset
 
@@ -20,6 +27,16 @@ Substituting `username` with the appropriate username.
 
 The test dataset has been extracted from the morbidostat run labeled `2022-02-08-RT_test`. It contains 2 vials (vial 2 and 3), each one having 3 timepoints (timepoint 1, 2 and 5). For timepoint 1 the annotated genome is included. For the rest of the timepoints, only fastq reads are provided in the `reads.fastq.gz` files. These files contain only a subset of all the reads (first 10'000 lines of the original fastq file). This corresponds to ~3% of all the reads, and these files are roughly 20-30 Mb each.
 
+In general input data must be placed in an input directory with nested structure:
+
+```
+input_dir/vial_XX/time_YY/
+├── assembled_genome
+│   ├── assembly.fna
+│   └── assembly.gbk
+└── reads.fastq.gz
+```
+
 ## workflow: building a pileup of mapped reads
 
 The workflow `pileup.nf` builds a pileup of the reads. For every vial, it maps reads for any timepoints to a reference genome, which is the one assembled from the initial timepoint for each vial. It generates the sorted bam file of mappings, together with a pileup matrix and a list of insertions and positions of clipped reads. Usage is as follows:
@@ -39,17 +56,20 @@ The input parameters are:
 - `input_fld` : folder containing the input reads. It must have the nested `vial_n/time_n` structure of archived morbidostat experiment data.
 - `time_beg` : the label corresponding to the first timepoint, whose assembled genome is used as reference (e.g. `1` for timepoint_1`)
 - `qual_min` : parameters of the script used to build the pileup. Only reads with quality higher than the threshold are used, and only insertions that have at least 70% of the sites meeting this threshold.
-- `clip_minL` : minimum length of clips that are added to the clip dictionary. Shorter clips are ignored.
+- `clip_minL` : minimum length of clips that are added to the clip dictionary. Shorter clips are ignored. Nb: only clips from primary mappings are considered in the clip analysis.
 
-Nb: only primary reads are considered, and secondary or supplementary reads are excluded. In future updates we might want to consider secondary reads for clip positions.
 
 results are saved in the `results` folder, with a structure that mirrors the `vial_n/timepoint_n` structure of the input folder. Each of these subfolders will contain the following files:
 - `reads.sorted.bam` and `reads.sorted.bam.bai` : sorted `bam` file (and corresponding index) containing the mapping of the reads against the reference genome.
 - `pileup/allele_counts.npz` : pileup of the reads. This is a numpy tensor with dimension (2,6,L) corresponding to (1) forward-reverse reads, (2) allele `["A", "C", "G", "T", "-", "N"]` and (3) position.
 - `pileup/insertions.pkl.gz` : a nested dictionary of insertions, saved in pickle format and compressed with gzip. The structure is `position -> sequence -> [n. forward reads, n. reverse reads]`.
 - `pileup/clips.pkl.gz` : a pickle file containing two dictionaries (accessible with the "count" and "seqs" tags). The first has the form `{position -> [fwd clips, rev clips, tot fwd read ends, tot rev read ends]}`. Contains the number of forward and reverse soft-clips ecountered in the reads at any given position, provided that the clipped part is longer than the threshold `clip_minL`. It also contains the total number of reads that end at that position, even if they do not meet the threshold. On a separate dictionary (under the "seqs" tag) for each position we save the clipped nucleotide sequence. This dictionary has the form `{position -> {0 or 1 for fwd / rev -> seq }}`.
+- `pileup/unmapped.{csv,fastq.gz}`: dataframe with list of unmapped reads. It contains the read-id, length, flag and average quality. The fastq.gz file contains these reads.
+- `pileup/non_primary.csv`: dataframe with list of mapping info for reads that have *supplementary* or *secondary* mappings. The primary one is included too. The structure of this file is described in `notes/read_mapping.md`.
 - `ref_genome.fa` and `ref_genome.gbk` : symlink to the reference genome used for mapping the reads (same vial, first timepoint), both in genbank and fasta format. 
 
+See also `notes/pipeline_description.md` for a detailed description of the commands/tools used and output files structure.
+In the current version of the pipeline this workflow also produces plots for secondary/supplementary mappings (see `notes/plots.md`).
 
 ## workflow: extract statistics