-
Notifications
You must be signed in to change notification settings - Fork 0
/
Workflow_Sanchez_et_al._2017_Nat Comm.txt
1588 lines (1220 loc) · 79.1 KB
/
Workflow_Sanchez_et_al._2017_Nat Comm.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
###
# DISCLAIMER:
# These were the workflows for solving several bioinformatic analysis for Sanchez et al. (2017) Nat Comm.
# I used standard open-source software in Linux command-line, and custom-made python scripts (I kept here the original names of the python script's files to match my records)
# It also involved some manual organization of files, manual assessment of results and manual processing to reach final result (e.g.sequence reconstruction).
# Although it solved the problems faced at the time, note that this is a beginner's code, as I'm not a professional bioinformatician or data scientist.
# As such, it may probably look overly complicated; and other more advanced solutions are likely simpler or more effective.
# The workflow also display some historical contingencies, as I was learning while doing.
# None of this work would be possible without the collaboration with Herve Gaubert,
# and the kind bioinformatics support of Varodom Charoensawan, Hugo Tavares, Jeremy Gruel, Hajk-Georg Drost, Anna Gogleva and Yassin Refahi.
#
# Diego H. Sanchez (diego.sanchez@slcu.cam.ac.uk)
###
#################################################################################################################################################
# A) Workflow for LTR reconstruction of ONSEN insertions in nrpd1-3 (example for nrpd1 2L plant from Gubert. et al, (2017))
#################################################################################################################################################
#########
STEP 1
#########
# Use python script to extend areas around insertion poins 1000 bp (since mean library fragment size was around 350 bp, I estimated that 1000 bp would recover all possible informative reads)
# For this I input a .txt file with genomic coordinates of insertions from data reported in Gaubert et al. (2017) Genetics (chromosome, start, end; in my records file name:input_2L.txt)
#FILE: input text extend strands and generates new bed.py
#Script:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import time
start_time=time.time() # To tell de time it take to process
# Files&path
InBEDFile = "input_2L.txt"
OutBEDFile = "insertion_points_2L_extended.bed"
path = "~/PATH/"
################################################################################
###define CountLines fx to count the lines a the file
################################################################################
def CountLines (filename):
with open ("%s%s" %(path,filename), 'r') as myfile: # opens input file
count=sum(1 for line in myfile)
return count
############
### MAIN ###
############
# Count lines in files
counts_bed = CountLines (InBEDFile)
print 'Number of lines in BED file: ', counts_bed, "--- %s minutes run ---" %((time.time()-start_time)/60)
# Input data from .bed in matrix
input_bed_handle = open("%s%s" %(path,InBEDFile), "r") # opens the file to be written
output_handle = open("%s%s" %(path,OutBEDFile), "w") # opens the file to be written
count=0
for line in input_bed_handle:
chromosome, start, end = line.strip("\n").split('\t')
start_=int(start)
end_=int(end)
if start_>end_:
end=start_
start=end_
print start, end,
new_start=int(start)-1000; new_end=int(end)+1000 # extends 1000 bp the genomic coordinate in the corresponding strands
name = chromosome+':'+str(new_start)+'-'+str(new_end)
print name
out_line = [chromosome, '\t', str(new_start), '\t', str(new_end), '\t', name, '\n'] # generates de components of .bed4 file
for item in out_line:
output_handle.write(item) # Writing each component to a new .bed4 format
input_bed_handle.close()
output_handle.close()
counts_bed = CountLines (OutBEDFile) # check content of the OutBEDFile
print 'Number of lines in new BED file: ', counts_bed, "--- %s minutes run ---" %((time.time()-start_time)/60)
# END
print 'file parsed in ', "--- %s minutes run ---" %((time.time()-start_time)/60)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#########
STEP 2
#########
# Use phyton script to recover NGS reads around each ONSEN new insertion point; but only those having mapped read/unmapped mate.
# Most input files were generated in Gaubert et al. (2017) Genetics, when finding the genomic coordinates of new ONSEN insertions.
# The script will loop around each extended new position from previous step, retriving the mapping reads.
# For each new insertion, the script will generate for each position: (i) a .sam file with the NGS reads but only those having mapped read/unmapped mate,
# (ii) a report with the name of reads recovered from this .sam, and (iii) two fasta files with the sequences of these,
# obtained from the left or right LTRs, which were used to reconstruct the LTR's sequence of each new insertion (this is possible becuase both 5' and 3' LTRs are identical in new insertions).
# Unintendently, these data also allows recognizing the orientation of the new insertion.
# NOTE: script uses Numpy matrices to store information in memory, avoiding looping around files which is a far more time-consuming process.
# However, this solution may not work if available memory is limited. Also, the function 'samtools view' requires an index of the .bam file
# Prepare an index and a .bed file from .bam mapping file (obtained in Gaubert et al. (2017) Genetics); mapping reads to an ONSEN-masked genome.
$samtools index 2L_trimmo_paired_bowtie2_verysensitive_TAIR10_chr_all_maskedONSEN_sorted_rmdup_picard.bam &
$bamToBed -i 2L_trimmo_paired_bowtie2_verysensitive_TAIR10_chr_all_maskedONSEN_sorted_rmdup_picard.bam > 2L_trimmo_paired_bowtie2_verysensitive_TAIR10_chr_all_maskedONSEN_sorted_rmdup_picard.bed &
# Run python script
#FILE: Modif_Analysis_1xsam_1xbed_1xfasta_ONSEN_reads_around_each_coordinate_for_commandline_2Lsample.py
#Script:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import time
import numpy as np
from collections import Counter
from Bio import SeqIO
from Bio.Seq import Seq
import subprocess
start_time=time.time() # to tell de time it take to process
# to run in command line: $python Modif_Analysis_1xsam_1xbed_1xfasta_ONSEN_reads_around_each_coordinate_for_commandline_2Lsample.py > log_analysis.txt &
# Path
path = "~/PATH/"
# Files
MappingFile_bam = "2L_trimmo_paired_bowtie2_verysensitive_TAIR10_chr_all_maskedONSEN_sorted_rmdup_picard.bam" # mapped reads to ONSEN masked genome, from Gaubert et al. (2017) Genetics
MappingFile_bed = "2L_trimmo_paired_bowtie2_verysensitive_TAIR10_chr_all_maskedONSEN_sorted_rmdup_picard.bed" # obtained from previous .bam file
InputBEDFile = "insertion_points_2L_extended.bed" # obtained in previous step
FilenameSAM = "Reads_2L_GenomeUnmapped_verysensitive_fasta_ONSEN_sorted_MappedtoONSEN.sam" # .sam file with reads mapping to ONSEN, from Gaubert et al. (2017) Genetics
FastaIn = "Reads_2L_GenomeUnmapped.fasta" # .fasta with sequence reads, from Gaubert et al. (2017) Genetics
d_type = 'a48' # max lenght of NGS read name, in number of characters, for matrix
################################################################################
###define CountLines fx to count the lines a the file
################################################################################
def CountLines (filename):
with open ("%s%s" %(path,filename), 'r') as myfile: # opens input file
count=sum(1 for line in myfile)
return count
################################################################################
###define fx to run samtools view in commandline
################################################################################
def sam_view (InFile,OutFile,coordinates):
p = subprocess.Popen(["samtools view -f 8 -F 4 -o %s%s %s%s %s" %(path,OutFile,path,InFile,coordinates)], shell=True) # subprocess.call
p.wait()
return OutFile # returns name of the OutFile for report
################################################################################
### defines fx ParseBED and stores strand data
################################################################################
def ParseBED (input_file, number_lines):
BED_Matrix = np.array(range(number_lines), dtype=d_type)
STRAND_Matrix = np.array(range(number_lines), dtype='a2') # generates a matrix with strings of at most d_type characters, to store NGS read names
i=0 # counts position in matrix
for record in input_file:
if record.startswith("@"): continue # historical contingency; this just disregards a SAM header, just in case
else:
line = record.strip().split("\t")
name_sam = line[3]; name_sam=name_sam.replace("/1",""); name_sam=name_sam.replace("/2","") # line[3] of .bed is read name
BED_Matrix[i] = name_sam
STRAND_Matrix[i] = line [5] # stores strand data from line [5]
i +=1
return BED_Matrix, STRAND_Matrix
################################################################################
### defines fx ParseSAM
################################################################################
def ParseSAM (input_file, number_lines):
SAM_Matrix = np.array(range(number_lines), dtype=d_type) # generates a matrix with strings of at most d_type characters, to store NGS read names
i=0 # counts position in matrix
for record in input_file:
if record.startswith("@"): continue # disregards SAM header
else:
line = record.strip().split("\t")
name_sam = line[0]; name_sam=name_sam.replace("/1",""); name_sam=name_sam.replace("/2","") # line[0] of .sam is NGS read name
SAM_Matrix[i] = name_sam
i +=1
SAM_Matrix = np.unique(SAM_Matrix) # just in case, removes duplicated reads and sorts
return SAM_Matrix
############
### MAIN ###
############
counts_3 = CountLines (MappingFile_bed); print 'File BED count: ', counts_3
input_handle_3 = open("%s%s" %(path,MappingFile_bed), "r") # opens .bed file
Data_File3, Strand_File3 = ParseBED (input_handle_3, counts_3) # appplies ParseBED fx
input_handle_3.close()
print "Strand data input from BED mapping file in --- %s minutes run ---" %((time.time()-start_time)/60)
# Generates matrix with names of all reads mapping to ONSEN
counts_2 = CountLines (FilenameSAM); print 'File SAM count: ', counts_2
input_handle_2 = open("%s%s" %(path,FilenameSAM), "r") # opens .sam file
Data_File2 = ParseSAM (input_handle_2, counts_2) # appplies ParseSAM fx
input_handle_2.close()
print "SAM with mapped reads to ONSEN parsed in --- %s minutes run ---" %((time.time()-start_time)/60)
# Parse .bed file
input_handle_bed = open("%s%s" %(path,InputBEDFile), "r") # opens .bed file
for record in input_handle_bed:
Chr, start,end, coordinates_bed = record.strip("\n").split("\t") # coordinates is the name in .bed file
name_bed = str(Chr)+'_'+str(start)+'_'+str(end) # generates a name for the out files
ReadsFile = "2L_reads_around_insertion_%s.sam" %name_bed # .this will be the .sam with results of reads
# Recover NGS reads with mapped read/unmapped mate around particular insertion point
sam_view (MappingFile_bam,ReadsFile,coordinates_bed)
print "Recovered reads around", coordinates_bed, "in --- %s minutes run ---" %((time.time()-start_time)/60)
# Parse the ReadsFile .sam, to get reads names around insertion point
counts_1 = CountLines (ReadsFile); print 'File .sam count: ', counts_1
input_handle_1 = open("%s%s" %(path,ReadsFile), "r") # opens .sam file
Data_File1 = ParseSAM (input_handle_1, counts_1) # appplies ParseSAM, recover read names
input_handle_1.close()
print "file ", ReadsFile, " parsed in --- %s minutes run ---" %((time.time()-start_time)/60)
# Concatenate matrixes & match data
Final_Matrix = np.concatenate((Data_File1,Data_File2), axis=0)
count=0
Shared_array = np.array(([item for item,count in Counter(Final_Matrix).iteritems() if count > 1]), dtype=d_type)
Shared_array.sort()
print "Total reads match for ", ReadsFile, "is count", len(Shared_array), " in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Write report
ReportFile = "report_3L_reads_around_insertion_%s.txt" %name_bed
output_handle = open("%s%s" %(path,ReportFile), "w") # opens ReportFile to be written
for out_line in Shared_array:
output_handle.write("%s\n" %out_line) # writes .sam line plus \n
output_handle.close()
print "Report file ", ReportFile, " generated in --- %s minutes run ---" %((time.time()-start_time)/60)
# Input .txt data in matrix
counts_text = CountLines (ReportFile)
print "Number of lines in report file: ", counts_text, "--- %s minutes run ---" %((time.time()-start_time)/60)
Text_List = np.zeros( (counts_text), dtype=d_type) # make matrix with counts_text items at most
input_text_handle = open("%s%s" %(path,ReportFile), "r") # opens ReportFile
count=0
for record in input_text_handle:
record = record.strip()
record = record.strip("\t")
record = record.strip("\n")
Text_List[count] = record
count +=1
input_text_handle.close()
Text_List.sort()
print "text data in matrix ", "--- %s minutes run ---" %((time.time()-start_time)/60)
# Loop .txt data in the input fasta file
ReadsFastaFile_left = "Fasta_reads_2L_reads_around_insertion_%s_left.fasta" %name_bed
ReadsFastaFile_right = "Fasta_reads_2L_reads_around_insertion_%s_right.fasta" %name_bed
input_handle = open("%s%s" %(path,FastaIn), "rU") #opens .fastq file
output_handle_left = open("%s%s" %(path,ReadsFastaFile_left), "w") # opens the file to be written LEFT for + strand
output_handle_right = open("%s%s" %(path,ReadsFastaFile_right), "w") # opens the file to be written RIGH for - strand
count=0
for record in SeqIO.parse(input_handle, "fasta") :
if record.id in Text_List:
count+=1
index_data = np.where(Data_File3 == record.id) # find NGS read's position in matrix, to then recognize the strand stored in the strand matrix
print index_data,
index_strand = index_data[0][0]
read_strand = Strand_File3[index_strand]
if read_strand == '+':
SeqIO.write(record, output_handle_left, "fasta")
if read_strand == '-':
SeqIO.write(record, output_handle_right, "fasta")
input_handle.close()
output_handle_left.close()
output_handle_right.close()
print "number of output in fasta Files is: ", count
# END
print "Process time required: --- %s minutes run ---" %((time.time()-start_time)/60)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#########
STEP 3
#########
# Use out .fasta files to reconstruct new ONSEN LTR sequence in each insertion point; we used Genious software.
#################################################################################################################################################
# B) Workflow for estimating transcript levels with specific sequence 'adresses', from RNA-seq data
#################################################################################################################################################
#########
STEP 1
#########
# Use python workflow to recover NGS reads mapping to the genes of interest or ONSEN.
# Example is shown for housekeeping genes and stress genes.
# For this I used a .bed file with genomic coordinates and the function samtools view, on the TopHat mapping file .bam for each sample
#FILE: Workflow_recover_reads_TOPHAT_DUPLICATED_housekeepingANDstress_genes_SAMPLE.py
#Script:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import time
start_time=time.time() # to tell de time it take to process
import subprocess
# run in linux command line: $python Workflow_recover_reads_housekeepingANDstress_genes.py > log_recover_reads_housekeepingANDstress_genes.txt &
# path
path_input_sample = "~/PATH/"
############
### MAIN ###
############
# Recover reads
InFile = "RNA_SAMPLE_tophat_TAIR10_sorted.bam"
OutFile = "housekeepingANDstress_genes_in_RNA_SAMPLE_tophat_TAIR10.bam"
with open ("%s%s" %(path_input_sample,OutFile), 'w') as myfile:
p = subprocess.Popen(["samtools view -b -h -F4 %s%s -L housekeeping_stress_genes.txt" %(path_input_sample,InFile)], shell=True, stdout = myfile) #subprocess.call(["samtools view -bS %s%s" %(path,MappingFile)], shell=True )
p.wait()
myfile.flush()
print "Recovered reads in: ", "--- %s minutes run ---" %((time.time()-start_time)/60)
# Sort
InFile = "housekeepingANDstress_genes_in_RNA_SAMPLE_tophat_TAIR10.bam"
OutFile = "housekeepingANDstress_genes_in_RNA_SAMPLE_tophat_TAIR10_sorted.bam"
p = subprocess.Popen(["samtools sort %s%s -o %s%s" %(path_input_sample,InFile,path_input_sample,OutFile)], shell=True)
p.wait()
print "Sorted BAM file in: ", "--- %s minutes run ---" %((time.time()-start_time)/60)
# Index just in case
InFile = "housekeepingANDstress_genes_in_RNA_SAMPLE_tophat_TAIR10_sorted.bam"
p = subprocess.Popen(["samtools index %s%s" %(path_input_sample,InFile)], shell=True)
p.wait()
print "Index generated in: ", "--- %s minutes run ---" %((time.time()-start_time)/60)
# Generate fasq from bam
InFile = "housekeepingANDstress_genes_in_RNA_SAMPLE_tophat_TAIR10_sorted.bam"
OutFile = "housekeepingANDstress_genes_in_RNA_SAMPLE_tophat_TAIR10_sorted.fastq"
p = subprocess.Popen(["bedtools bamtofastq -i %s%s -fq %s%s" %(path_input_sample,InFile,path_input_sample,OutFile)], shell=True)
p.wait()
print "Generated .fastq file in: --- %s minutes run ---" %((time.time()-start_time)/60)
# END
print "Process time required: --- %s minutes run ---" %((time.time()-start_time)/60)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Generate a .fasta file from .fastq file in command line
$awk 'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' housekeepingANDstress_genes_in_RNA_SAMPLE_tophat_TAIR10_sorted.fastq > housekeepingANDstress_genes_in_RNA_SAMPLE_tophat_TAIR10_sorted.fasta &
#########
STEP 2
#########
# Use python script which generates a report of counts per target feature, based on perfect string matching.
# It requires a .fasta file with the adresses and a .fasta file with the NGS reads
#FILE: Search_for_housekeepingANDstress_noMULTI_adresses_counter_itteritem_SAMPLE.py
#Script:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import time
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import Seq
from collections import Counter
from operator import itemgetter
#THREADS = 6 # number of cores to paralelize analisys if necesarry; will require slight modification of code for processing Dummy_List = map(Busqueda, Genome_Dict.items());see below.
start_time=time.time() # to tell de time it take to process
# Files & path
InputFasta_Genome = "housekeeping_stress_adress.txt" # this is fasta with the adresses of each gene
InputFasta_NGS = "housekeepingANDstress_genes_in_RNA_SAMPLE_tophat_TAIR10_sorted.fasta" # this is the fasta with the NGS reads from SAMPLE
Out_Report = "Report_housekeepingANDstress_RNA_SAMPLE_tophat_adress.txt" # this is the .txt outfile with the count of hits in each gene
path = "~/PATH/"
################################################################################
###define CountLines fx to count the lines a the file
################################################################################
def CountLines (pathy,filename):
with open ("%s%s" %(pathy,filename), 'r') as myfile: # opens input file
count=sum(1 for line in myfile if line.startswith(">"))
return count
################################################################################
### Define Busqueda fx; if required enables paralelization for faster string matching
################################################################################
def Busqueda (key_value): # data in dictionary enters a tuple (key,value)
Tuple_List = []
item_genome = Seq.Seq(key_value[1])
rev_item_genome = item_genome.reverse_complement()
for record_ngs in NGS_Dict:
seq_ngs = Seq.Seq(str(NGS_Dict[record_ngs]))
if str(item_genome) in seq_ngs:
Tuple_List.append ((str(key_value[0]), str(record_ngs)))
if str(rev_item_genome) in seq_ngs:
Tuple_List.append ((str(key_value[0]), str(record_ngs))) # if reverse compliment shows the match, then writes the reverse complement of the NGS sequence
return set(Tuple_List) # returns a list of tuples. Also returns [] if there was no match. This is handled later with a Final_List.extend()
############
### MAIN ###
############
# Parse the recombination file
counts = CountLines (path,InputFasta_Genome); print "Fasta file with adresses: ", counts, " counts"
counts = CountLines (path,InputFasta_NGS); print "Fasta file with NGS reads: ", counts, " counts"
# Unpack adresses
Genome_Dict ={} # Dict with genomic adresses
input_handle_genome = open("%s%s" %(path,InputFasta_Genome), "rU") # opens .fasta file
for genome_item in SeqIO.parse(input_handle_genome, "fasta"):
Genome_Dict[genome_item.id] = str(genome_item.seq)
input_handle_genome.close()
print "Genome sequences unpacked in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Unpack NGS reads
NGS_Dict ={} # dict with NGS sequences
ID_List = [] # list with the record.id, to find the reads/mate with same name
input_handle_NGS = open("%s%s" %(path,InputFasta_NGS), "rU") # opens .fasta file
for NGS_item in SeqIO.parse(input_handle_NGS, "fasta"):
ID = str(NGS_item.id)+'/1' # rename the reads for the read/mate having the same read_name but different sequence; this accounts for recovering reads with samtools where they loose their read/mate info
if ID in ID_List: ID = str(NGS_item.id)+'/2'
NGS_Dict[ID] = str(NGS_item.seq)
ID_List.append(ID)
input_handle_NGS.close()
print len (NGS_Dict), "is the number of NGS reads"
print "NGS data unpacked in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Check if items of list are in adresses. This step is design to be paralelized if necesary.
Final_List = []
Dummy_List = map(Busqueda, Genome_Dict.items()) # .items() of a dictionary generates a list of tuples (key,value)
for items in Dummy_List: Final_List.extend(items) # transforms list of list of tuples in list of tuples
print len (Final_List),"is number of redundant sequences in Final List"
print "Final List generated in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Process the Final_List into NonTransient_List, with adresses showing more than one hit
NonTransient_List = [] # store adresses ID with more than one hit, considering one hit as the basal noise level
NonUnique_List = [] # will store adress,read data from FInal_List with those adreeses with more than one hit
for item,count in Counter(x for x,y in Final_List).iteritems():
if count>1: NonTransient_List.append(item)
# Make NonUnique_List from the Final_List, selecting for adresses showing more than one hit appended in NonTransient_List
for adress,read in Final_List:
if adress in NonTransient_List:
NonUnique_List.append((adress,read))
NonUnique_List = sorted(NonUnique_List, key=itemgetter(0)) # sorts list according to the 2nd element of tuple, which is the Genomic_ID
print len (NonUnique_List),"is number of hits in NonUnique List"
print "NonUnique List generated in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Write report with count number of NGS reads per adress
output_handle = open("%s%s" %(path,Out_Report), "w") #opens the file to be written
for item,count in Counter(x for x,y in NonUnique_List).iteritems(): # counts the adress ID as 'values' of a dictionary in NonUnique_List containing (adress_ID, NGS_ID)
print item, count
output_handle.write("%s\t%s\n" %(item,count))
output_handle.close()
# END
print "Total NGS reads match in time: --- %s minutes run ---" %((time.time()-start_time)/60)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# In the file "Report_housekeepingANDstress_RNA_SAMPLE_tophat_adress.txt" is a list of counts hitting the adress of each genes
#########
STEP 3
#########
# Manually generate a list of counts per gene across samples and normalize to TopHat non-deduplicated mapped library size
# Similar workflow was used with recovered ONSEN-mapping DNA sequencing data to infer ecDNA signals
#################################################################################################################################################
# C) Workflow for reavealing recombination in re-sequenced samples
#################################################################################################################################################
#########
STEP 1
#########
# Generate virtual library of fragments containing all possible recombination products between ONSEN members
# The input file FINAL_ONSEN_consensus_with_GAPs.txt was manually prepared and contains the consensus between ONSEN members, including GAPs, in FASTA format
# The script retrieves a .fasta FINAL_first_round_recombinations_GAPs_145_1.fasta, a library of redundant fragments containing all possible recombination products
# extend_ambiguous_dna and slidingWindow functions were taken from publically available sources
#FILE: New_version_ONSEN_Generate_recombination_sequences_145_1.py
#Script:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import time
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import Seq
from itertools import product
from Bio.Alphabet import IUPAC
start_time=time.time() # to tell de time it take to process
# Path
path = "~/PATH/"
# Files
InputFasta = "FINAL_ONSEN_consensus_with_GAPs.txt" # .fasta with manually prepared consensus sequence of ONSEN, with Ns
InputFasta_genome = "TAIR10_chr_all.fas" # .fasta with A.thaliana genome
OutFasta = "FINAL_first_round_recombinations_GAPs_145_1.fasta" # .fasta outfile with the name/sequences of recombination sequences
WINSIZE = 145
STEP = 1
################################################################################
###define CountLines fx to count the lines a the file
################################################################################
def CountLines (pathy,filename):
with open ("%s%s" %(pathy,filename), 'r') as myfile: # opens input file
count=sum(1 for line in myfile if line.startswith(">"))
return count
################################################################################
###define extend_ambiguous_dna fx
################################################################################
def extend_ambiguous_dna(sequencia):
"""return list of all possible sequences given an ambiguous DNA input"""
"""enter the sequence as a string"""
d = Seq.IUPAC.IUPACData.ambiguous_dna_values
r = []
for i in product(*[d[j] for j in sequencia]):
r.append("".join(i))
return r
################################################################################
###define slidingWindow fx
################################################################################
def slidingWindow(sequence,winSize,step):
"""Returns a generator that will iterate through
the defined chunks of input sequence. Input sequence
must be iterable."""
# Verify the inputs
try: it = iter(sequence)
except TypeError:
raise Exception("**ERROR** sequence must be iterable.")
if not ((type(winSize) == type(0)) and (type(step) == type(0))):
raise Exception("**ERROR** type(winSize) and type(step) must be int.")
if step > winSize:
raise Exception("**ERROR** step must not be larger than winSize.")
if winSize > len(sequence):
raise Exception("**ERROR** winSize must not be larger than sequence length.")
# Pre-compute number of chunks to emit
numOfChunks = ((len(sequence)-winSize)/step)+1
# Do the work
for i in range(0,numOfChunks*step,step):
yield sequence[i:i+winSize]
############
### MAIN ###
############
counts = CountLines (path,InputFasta); print "Fist fasta file: ", counts, " counts"
counts = CountLines (path_genome,InputFasta_genome); print "Second fasta file: ", counts, " counts"
# Unpack consensus seq data
Consensus_Dict ={} # dict with consensus sequences
input_handle_input = open("%s%s" %(path,InputFasta), "rU") # opens .fasta file
for input_item in SeqIO.parse(input_handle_input, "fasta"):
if len(input_item.seq)>=WINSIZE: # check that there is no error in size
Consensus_Dict[input_item.id] = str(input_item.seq)
input_handle_input.close()
# Unpack genome
Genome_Dict ={} # dict with genome sequences
input_handle_genome = open("%s%s" %(path_genome,InputFasta_genome), "rU") # opens .fasta file
for genome_item in SeqIO.parse(input_handle_genome, "fasta"):
Genome_Dict[genome_item.id] = str(genome_item.seq)
input_handle_genome.close()
# Check if putative recombination sequences ocurr in genome
count = 0
output_handle = open("%s%s" %(path,OutFasta), "w") # opens .fasta to be written
for record_input in Consensus_Dict:
for each_sequence in slidingWindow(Consensus_Dict[record_input],WINSIZE,STEP):
recombination_list = extend_ambiguous_dna(str(each_sequence))
# Check that the sequence doesnt exist in genome
for item in recombination_list:
item_seq = Seq.Seq(item)
rev_item_seq = item_seq.reverse_complement()
for record_genome in Genome_Dict:
seq_genome = Genome_Dict[record_genome]
if str(item_seq) in seq_genome or str(rev_item_seq) in seq_genome:
print "one prescence!"
break
else:
count=count+1; print count,
output_handle.write(">%s\n%s\n" %(count,item))
output_handle.close()
# END
counts = CountLines (path,OutFasta); print "Output fasta with final: ", counts, " counts"
print "Total sequences generate:", count, " in time: --- %s minutes run ---" %((time.time()-start_time)/60)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#########
STEP 2
#########
# Remove redundancy in the library
#FILE: FINAL_Check_generation_recombinants_avoid_duplicates_GAPs_145_1.py
#Script:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import time
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import Seq
from itertools import product
from Bio.Alphabet import IUPAC
start_time=time.time() # to tell de time it take to process
# Path
path = "~/PATH/"
# Files
InputFasta = "FINAL_first_round_recombinations_GAPs_145_1.fasta" # .fasta from previous step
OutFasta2 = "FINAL_putative_recombinations_GAPs_145_1.fasta" # final library
################################################################################
###define CountLines fx to count the lines a the file
################################################################################
def CountLines (filename):
with open ("%s%s" %(path,filename), 'r') as myfile: # opens input file
count=sum(1 for line in myfile if line.startswith(">"))
return count
############
### MAIN ###
############
# Unpack data
counts = CountLines (InputFasta); print "First redundant library with recombination sequences: ", counts, " counts"
input_handle = open("%s%s" %(path,InputFasta), "rU") # opens .fasta file
Check_Dict = {}
for record in SeqIO.parse(input_handle, "fasta"):
Check_Dict [record.id] = str(record.seq)
input_handle.close()
print "Data in dictionary in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Check if recombinations are in recombination list
Black_List = [] # to keep track of those repeated sequences and avoid writing them
output_handle = open("%s%s" %(path,OutFasta2), "w") # opens the file to be written
input_handle = open("%s%s" %(path,InputFasta), "rU") # opens .fasta file
count = 0
count_yes = 0
for record in SeqIO.parse(input_handle, "fasta"):
print record.id,
seq = str(record.seq)
repeat = "No"
for key in Check_Dict:
if seq == Check_Dict [key]:
if record.id == key:
print "present! record: ", record.id, " is key: ", key
continue
else:
Black_List.append(key)
print "repeated!"
print record.id, seq
print key, Check_Dict [key]
repeat = "Yes"; count_yes +=1
break
if repeat == "No" and record.id not in Black_List:
count=count+1
SeqIO.write(record, output_handle, "fasta")
input_handle.close()
output_handle.close()
# END
counts = CountLines (OutFasta2); print "Final file with recombinants sequences: ", counts, " counts"
print "Total repeated sequences: ", count_yes
print "Total reads match: ", count, " in time: --- %s minutes run ---" %((time.time()-start_time)/60)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#########
STEP 3
#########
# Search for ocurrence of virtual library sequences in ONSEN's mapped NGS reads
#FILE: MULTI_ONSEN_Search_recombination_GAPs_SAMPLE_145_1.py
#Script:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import time
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import Seq
import multiprocessing
THREADS = 16 # number of cores to parelelize analisys
start_time=time.time() # to tell de time it take to process
# Path
path = "~/PATH/"
# Files
InputFasta_RecSeq = "FINAL_putative_recombinations_GAPs_145_1.fasta" # virtual library with putative recombination sequences
InputFasta_Genome = "fasta_ONSEN.fa" # .fasta with sequences of ONSEN members
InputFasta_NGS = "Reads_SAMPLE_TAIR10_chr_all_maskedONSEN_Unmapped_mapONSEN.fasta" # .fasta with the NGS reads from SAMPLE
OutFasta_Reads = "FINAL_found_in_reads_SAMPLE_GAPs_145_1.fasta" # .fasta outfile with the name/sequences of the NGS reads with matching sequences to library
OutFasta_Recomb = "FINAL_recombinations_found_GAPs_SAMPLE_145_1.fasta" # .fasta outfile with the recombinations from library having matching sequences to reads
################################################################################
###define CountLines fx to count the lines of a file
################################################################################
def CountLines (pathy,filename):
with open ("%s%s" %(pathy,filename), 'r') as myfile: # opens input file
count=sum(1 for line in myfile if line.startswith(">"))
return count
################################################################################
### Define Busqueda fx, paralelization enabled
################################################################################
def Busqueda (key_value): # data from dictionary enters a tuple (key,value)
Tuple_List = []
item_recseq = Seq.Seq(key_value[1])
rev_item_seq = item_recseq.reverse_complement()
for record_ngs in NGS_Dict:
seq_ngs = Seq.Seq(str(NGS_Dict[record_ngs]))
rev_seq_ngs = seq_ngs.reverse_complement()
if str(item_recseq) in seq_ngs:
Tuple_List.append ((str(key_value[0]), str(item_recseq), str(record_ngs),str(seq_ngs)))
if str(rev_item_seq) in seq_ngs:
Tuple_List.append ((str(key_value[0]), str(item_recseq), str(record_ngs),str(rev_seq_ngs))) # if reverse compliment shows the match, then writes the reverse complement of the NGS sequence
return Tuple_List # returns a list of tuples. Also returns [] if there was no match. This is handled later with a Final_List.extend()
############
### MAIN ###
############
# Count files
counts = CountLines (path,InputFasta_RecSeq); print "Fist .fasta file with library sequences: ", counts, " counts"
counts = CountLines (path,InputFasta_Genome); print "Second .fasta file with ONSEN sequences: ", counts, " counts"
counts = CountLines (path,InputFasta_NGS); print "Third .fasta file with NGS reads: ", counts, " counts"
# Unpack library sequences
RecSeq_Dict ={} # dict with library sequences
input_handle_RecSeq = open("%s%s" %(path,InputFasta_RecSeq), "rU") # opens .fasta file
for input_item in SeqIO.parse(input_handle_RecSeq, "fasta"):
RecSeq_Dict[input_item.id] = str(input_item.seq)
input_handle_RecSeq.close()
print len (RecSeq_Dict), "is the number of library sequences"
print "Recombination Sequences unpacked in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Unpack ONSEN sequences
Genome_Dict ={} # dict with ONSEN sequences
input_handle_genome = open("%s%s" %(path,InputFasta_Genome), "rU") # opens .fasta file
for genome_item in SeqIO.parse(input_handle_genome, "fasta"):
Genome_Dict[genome_item.id] = str(genome_item.seq)
input_handle_genome.close()
print "ONSEN sequences unpacked in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Unpack NGS reads
NGS_Dict ={} # dict with NGS sequences
ID_List = [] # list with the record.id, to find the reads/mate with same name
input_handle_NGS = open("%s%s" %(path,InputFasta_NGS), "rU") # opens .fasta file
for NGS_item in SeqIO.parse(input_handle_NGS, "fasta"):
ID = str(NGS_item.id)+'/1' # rename reads; this accounts for recovering reads with samtools
if ID in ID_List: ID = str(NGS_item.id)+'/2'
NGS_Dict[ID] = str(NGS_item.seq)
ID_List.append(ID)
input_handle_NGS.close()
print len (NGS_Dict), "is the number of NGS reads"
print "NGS data unpacked in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Re-check if library sequences are in ONSEN, just in case
Black_List = []
for record_input in RecSeq_Dict:
seq_input = str(RecSeq_Dict[record_input])
for record_genome in Genome_Dict:
seq_genome = Seq.Seq(str(Genome_Dict[record_genome]))
rev_seq_genome = str(seq_genome.reverse_complement()) # rev_seq_genome is the reverse-complement
if seq_input in seq_genome or seq_input in rev_seq_genome: # test if sequence is ONSEN sequence
print record_input.id, " is a genomic sequence"
Black_List.append(record_input) # stores ID or recombination sequence in a Black_List
# NOTE: Black_list should be always empty if the library was checked agains genome assembly
print len (Black_List), "is the number of library sequences present in ONSEN"
print "Check if sequences were in genome, in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Check if items of recombination list are in genome
# Paralelize analysis
if __name__ == '__main__':
Final_List = []
p = multiprocessing.Pool(THREADS)
Dummy_List = p.map(Busqueda, RecSeq_Dict.items()) # .items() of a dictionary generates a list of tuples (key,value)
p.close()
p.join()
for items in Dummy_List: Final_List.extend(items) # transforms list of list of tuples in list of tuples
print len (Final_List),"is number of sequences in Final List"
print "Paralelized and Final List generated in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Write results
List_Reads = [] # to append NGS ID and avoid reporting again
List_Recomb = [] # to append library sequences and avoid reporting again
count=0 # to count number of reported non-redundant NGS reads, just in case
output_handle_Reads = open("%s%s" %(path,OutFasta_Reads), "w") # opens file to be written, with NGS reads
output_handle_Recomb = open("%s%s" %(path,OutFasta_Recomb), "w") # opens file to be written, with library sequences
for tuple_item in Final_List: # Final_List is a list of tuples, each containing (RecSeq_ID, RecSeq_seq, NGS_ID, NGS_seq)
RecSeq_ID, RecSeq_seq, NGS_ID, NGS_seq = str(tuple_item[0]), str(tuple_item[1]), str(tuple_item[2]),str(tuple_item[3]) # unpack tuple
if RecSeq_ID not in Black_List: # Black_List stored library sequences IDs if present in ONSEN sequences
if NGS_ID not in List_Reads:
count+=1
List_Reads.append (NGS_ID) # when library seq ocurrs in NGS read, appends NGS read ID in List_Reads. Each read will thus be reported only once.
output_handle_Reads.write (">%s\n%s\n" %(NGS_ID, NGS_seq)) # write NGS read to outfile
if RecSeq_ID not in List_Recomb: # store library sequences with hits and write to outfile
List_Recomb.append (RecSeq_ID)
output_handle_Recomb.write (">%s\n%s\n" %(RecSeq_ID, RecSeq_seq))
output_handle_Reads.close()
output_handle_Recomb.close()
# END
counts = CountLines (path,OutFasta_Reads); print "Outfile with NGS reads: ", counts, " counts"
counts = CountLines (path,OutFasta_Recomb); print "Outfile with recombination sequences: ", counts, " counts"
print "Total NGS reads match: ", count, " in time: --- %s minutes run ---" %((time.time()-start_time)/60)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#################################################################################################################################################
# D) Workflow for finding Onsen-mapping reads with perfect or non-perfect match to ONSEN sequences
#################################################################################################################################################
# Use python script to find 'perfect match' and 'no match' NGS reads to genome.
# Becuase the input is reads mapping to ONSEN, no match reads represent ONSEN reads with SNPs (i.e. 'non-perfect match' or 'unperfect match')
#FILE: MODIF_FINAL_ONSEN_Search_PERFECT_MATCH_Control_noBlackList.py
#Script:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import time
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import Seq
from Bio.Alphabet import IUPAC
# File is 'MODIF' becuase it avoids Seq.record and packs data in memory
start_time=time.time() # to tell de time it take to process
# Path
path = "~/PATH/"
path_genome = ""~/Reference_genomes/Arabidopsis/PATH/" # path to Arabidopsis genome
# Files
InputFasta_genome = "TAIR10_chr_all.fas"# fasta with Arabidopsis genome sequences
InputFasta_reads = "Reads_SAMPLE_TAIR10_chr_all_maskedONSEN_Unmapped_mapONSEN.fasta" # fasta with the re-sequencing reads from SAMPLE, unmapped to masked genome and re-mapped to Onsen
OutFasta_match = "FINAL_reads_PERFECT_MATCH_SAMPLE_noBlackList.fasta" # fasta outfile with the name/sequences of the reads with PERFECT matching sequences
OutFasta_nomatch = "FINAL_reads_UNPERFECT_MATCH_SAMPLE_noBlackList.fasta" # fasta outfile with the name/sequences of the reads with NON-PERFECT matching sequences
################################################################################
###define CountLines fx to count the lines a the file
################################################################################
def CountLines (pathy,filename):
with open ("%s%s" %(pathy,filename), 'r') as myfile: # opens input file
count=sum(1 for line in myfile if line.startswith(">"))
return count
################################################################################
############
### MAIN ###
############
# Unpack genome in memory
Genome_Dict = {} # dict with genome in memory
input_handle_genome = open("%s%s" %(path_genome,InputFasta_genome), "rU") # opens .fasta
for record_genome in SeqIO.parse(input_handle_genome, "fasta"): # parse handle_input
Genome_Dict[record_genome.id] = str(record_genome.seq)
input_handle_genome.close()
print "Unpacked genome sequences, in time: --- %s minutes run ---" %((time.time()-start_time)/60)
# Unpack reads in memory
Reads_Dict = {} # dict with reads in memory
ID_List = [] # list with the record.id, to find the reads/mate with same name
input_handle_reads = open("%s%s" %(path,InputFasta_reads), "rU") # opens .fasta
for record_reads in SeqIO.parse(input_handle_reads, "fasta"): # parse handle_input
ID = str(record_reads.id)+'/1' # rename the reads for the read/mate having the same read_name but different sequence; this accounts for recovering reads with samtools where they loose their read/mate info
if ID in ID_List:
ID = str(record_reads.id)+'/2',
Reads_Dict[ID] = str(record_reads.seq)
ID_List.append(ID)
input_handle_reads.close()
print "unpacked reads, in time: --- %s minutes run ---" %((time.time()-start_time)/60)
print len (Reads_Dict)
# Match reads
count_match=0
count_nomatch=0
output_handle_match = open("%s%s" %(path,OutFasta_match), "w") # opens outfile with perfect match
output_handle_nomatch = open("%s%s" %(path,OutFasta_nomatch), "w") # opens outfile with non-perfect match
for record_input in Reads_Dict: # parse reads
seq_input = Seq.Seq(Reads_Dict[record_input]) # seq_input is the sequence of the read
rev_seq_input = str(seq_input.reverse_complement())
for record_target in Genome_Dict: # parse in genome
seq_target = Seq.Seq(Genome_Dict[record_target]) # seq_target is the sequence of the genome
# now test is the sequence of the read is in the genome
if seq_input in seq_target or rev_seq_input in seq_target:
count_match+=1; print count_match, # print number of reads found so far
output_handle_match.write (">%s\n%s\n" %(record_input, seq_input)) # write read to output if its present in genome
break
# after checking if it its present in any of the genome's chromosomes
else:
count_nomatch+=1;
output_handle_nomatch.write (">%s\n%s\n" %(record_input, seq_input)) # write read to output when it is NOT present in genome
output_handle_match.close()
output_handle_nomatch.close()
print "/n"
# Check outfiles
counts = CountLines (path_genome,InputFasta_genome); print "Fasta file with genome sequences: ", counts, " counts"
counts = CountLines (path,InputFasta_reads); print "Fasta file with re-sequencing reads: ", counts, " counts"
# END
print "Total reads match: ", count_match
counts = CountLines (path,OutFasta_match); print "Fasta file with perfect match sequences: ", counts, " counts"
print "Total reads no-match: ", count_nomatch
counts = CountLines (path,OutFasta_nomatch); print "Fasta file with non-perfect match sequences: ", counts, " counts"
print "Total reads match+no-match: ", count_nomatch+count_match
print "Total time for process: --- %s minutes run ---" %((time.time()-start_time)/60)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#################################################################################################################################################
# D) Workflow for finding NGS 'recombinant molecules' in ecDNA from heat-stressed samples
#################################################################################################################################################
#########
STEP 1
#########
# Use python script to finding NGS reads with blunt_end in ONSEN LTR start sequence, diagnostic of linear blunt-ended ecDNA molecules
#FILE: FINAL_UNPACKED_Search_blunt_end_ONSEN_Parse_3xfasta_search for sequences in fasta files_blunt_end_SMALL_EDGE.py
#Script:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import time
from Bio import SeqIO
from Bio.Seq import Seq
start_time=time.time() # to tell de time it take to process
# Path
path = "~/PATH/"