Skip to content

Commit 113604e

Browse files
authored
Merge pull request #8 from mmolari/marco-dev
fix: max insertion frequency
2 parents a872335 + f32dec0 commit 113604e

File tree

2 files changed

+11
-6
lines changed

2 files changed

+11
-6
lines changed

notes/improvements.md

+1
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
- [ ] Rename records in original fasta/genbank assembled genome as `vialXX_timeYY_clZZ`?
44
- [x] Group together correlated deletion trajectories that span adjacen intervals.
55
- [ ] parallelize insertion fisher test evaluation.
6+
- [ ] because of quality filtering the number of insertions might be greater than the number of reads in the pileup. To solve for this in the insertion analysis we decrease the number of insertions in this case, to have an insertion frequency not greater than one. However this might bias the estimation for insertion frequency on the rest of the sites. A proper solution would require also keeping information on the unfiltered number of reads (also an unfiltered pileup).
67
- [x] Select trajectories based on delta (max - min) frequency on timepoints with high confidence.
78
- [x] Use secondary/supplementary reads to find duplicated/chimeric region bridges.
89
- [ ] Make the pipeline less reliant on folder structure. Pass the files directly in channels.

scripts/plot_insertions.py

+10-6
Original file line numberDiff line numberDiff line change
@@ -54,22 +54,26 @@ def L_tot(x):
5454
df["If"], df["Ir"] = I[:, 0], I[:, 1]
5555
df["It"] = I.sum(axis=1)
5656

57+
# average read length
58+
Ltot = np.vstack([L_tot(ins[p]) for p in pos])
59+
df["Lf"] = safe_division(Ltot[:, 0], df["If"])
60+
df["Lr"] = safe_division(Ltot[:, 1], df["Ir"])
61+
df["Lt"] = Ltot.sum(axis=1) / df["It"]
62+
5763
# number of reads
5864
df["Nf"] = stats_table.N(t, kind="fwd")[pos]
5965
df["Nr"] = stats_table.N(t, kind="rev")[pos]
6066
df["Nt"] = df["Nf"] + df["Nr"]
6167

68+
# renormalize because of quality filtering (the two might differ in some places)
69+
df["If"] = np.minimum(df["If"],df["Nf"])
70+
df["Ir"] = np.minimum(df["Ir"],df["Nr"])
71+
6272
# frequency of insertions
6373
df["Ff"] = safe_division(df["If"], df["Nf"])
6474
df["Fr"] = safe_division(df["Ir"], df["Nr"])
6575
df["Ft"] = safe_division(df["It"], df["Nt"])
6676

67-
# average read length
68-
Ltot = np.vstack([L_tot(ins[p]) for p in pos])
69-
df["Lf"] = safe_division(Ltot[:, 0], df["If"])
70-
df["Lr"] = safe_division(Ltot[:, 1], df["Ir"])
71-
df["Lt"] = Ltot.sum(axis=1) / df["It"]
72-
7377
# build dataframe
7478
dfs[t] = pd.DataFrame(df, index=pos)
7579
# df = pd.concat(columns, axis=1).fillna(0).astype(int)

0 commit comments

Comments
 (0)