Research: To see if there is a relationship between the metadata and the contamination present in the resulting sequencing data.
requirements.txt
Use at your Discretion. Provided as reference to reproduce working Environment.
Assumption 1 : Public Study Accession ID name begins with "PRJ"
data
├── bam_files
| ├── PRJ######
| └── <Run Accession>.bam
├── csv_files
| ├── PRJ######
| └── PRJ######_<Run Accession>.csv
├── sam_files
| ├── PRJ######
| └── <Run Accession>.sam
└── PRJ######
├── <Run Accession>_1.fastq
└── <Run Accession>_2.fastq
Notebook to download FASTQ Files from The European Nucleotide Archive (ENA)
Public Study Accession ID ( Typically in Form PRJ##### )
FASTQ files Unpacked into $PWD/< Public Study Accession ID >/*.fastq
Sometimes on dual read, the 2nd read will try to download twice ( does not affect storage)
Notebook to process FASTQ files through HISAT2 to produce .sam files
Genome Reference Consortium Human Build 38
Directory Path
FASTQ files in $PWD/< Public Study Accession ID >/< Run Accession >.fastq data structure
Note: Single Reads typically store as .fastq while Dual reads will be stored as _1.fastq and _2.fastq
SAM file in .sam format Files are stored as $PWD/sam_files/< Public Study Accession ID >/.sam data structure
The Genome Reference Consortium
Notebook to Convert Sequence Alignment and Map (SAM) file to Binary Alignment and Map (BAM) File. Additionally, will sort BAM file during conversion
$PWD/sam_files/< Public Study Accession ID >/.sam data structure
BAM file in .bam format Files are stored as $PWD/bam_files/< Public Study Accession ID >/.bam data structure
Processing Pipeline
capstoneUtils.py : Script containing additional processing functions
Directory in structure $PWD/bam_files/< Public Study Accession ID >/.bam data structure Directory in structure $PWD/csv_files/< Public Study Accession ID >_.csv data structure
Run Directory structure
runs
└── K_< K >
├── images
| └── *.png
├── models
| └── lda_model_Ntopic< Number of Topics>_K<K>*
├── cluster_sample_K< K >.csv
├── crosstab_K< K >.csv
├── filtered_crosstab_K< K >.csv
├── K< K >_library.txt
├── kmer_df_K< K >.csv
├── objs_K< K >.pkl
├── sizes.csv
├── test.csv
└── train.csv
- cluster_sample_K< K >.csv
- 100000 Rows of K-Mer Length Sequences and Estimated Topic
- crosstab_K< K >.csv
- Cross Tab in the Format K-Mer Length Sequence Rows and < Public Study Accession ID > Columns pre Chi-Squared Filtering
- filtered_crosstab_K< K >.csv
- Cross Tab in the Format K-Mer Length Sequence Rows and < Public Study Accession ID > Columns post Chi-Squared Filtering
- K< K >_library.txt
- List K-Mer Length Sequences derived from Training Set Post Chi-Squared Filtering
- kmer_df_K< K >.csv
- K-Mer Length Sequences derived from Training Set. Used to generate model
- objs_K< K >.pkl
- Pickle file in format K, kmer_dictionary. kmer_dictionary format is: {< Public Study Accession ID and Run Accession> : K-Mer Length Sequences as List }
- sizes.csv
- File to show Raw Data number of Sequences and Pandas Data Frame Size. Useful to estimate RAM and Hard Disk Space Requirements
- test.csv
- Test Set of Sequences after initial import of Raw Data
- train.csv
- Training Set of Sequences after initial import of Raw Data