A pipeline for ChIP-Seq and HT-SELEX motif benchmarking for HOCOMOCO v12. It requires additional software to be placed in the specified directories.
Your Python version must be 3.8 or higher.
You must have these Python packages installed:
- NumPy
- pandas
- SciPy
- scikit-learn
- MACRO-PERFECTOS-APE —
place the file
ape.jar
into the directory./external_programs
. - Bedtools —
place the file
bedtools.static
into the directory./external_programs
. - SPRY-SARUS v2.0.2 —
place the file
sarus-2.0.2.jar
into the directory./external_programs
.
To increase the number of threads for computing write the exact number into ./procfile
without any other symbols. The default value is 1 which means single-threaded computing.
Execute ./autorun.sh
in this very directory!
Ten motifs of the transcription factor FOXA2 were chosen as demonstration motifs.
These matrices are placed in the ./pwm
directory.
- You can benchmark your own models placing them into the
./pwm
directory. - The file name must start with transcription factor name separated from the rest of the PWM name with
@
symbol. The extension of the file must be.pwm
. - ADASTRA and HT-SELEX data for a custom transcription factor must be placed in the directories
./adastra/TF
and./selex/batchX
respectively wherebatchX
refers to batch1 or batch2 depending on the set of HT-SELEX experiments. The names of these files must include transcription factor name only. The extension of the files must be.tsv
. - Genome files must be placed in the
./assembly
directory. These must be GRCh37 (hg19) and GRCh38 (hg38) human genome assemblies. The file names must behg19.fa
andhg38.fa
respectively.
The output data is stored in the ./results
directory.
The results for ChIP-Seq both batches of HT-SELEX are written in the files adastra_motifs.tsv
and selex_motifs.tsv
respectively.
The benchmark was written by Mikhail Nikonov.