Skip to content
ddcap edited this page Sep 2, 2014 · 18 revisions

Halvade - Hadoop Aligner & Variant Detection

SYNOPSIS

Hadoop jar HalvadeWithLibs.jar –I /halvade/in/ -O /halvade/out/ -B /halvade/ -R /halvade/ref/hg19.fasta -D /halvade/dbsnp/dbsnp.vcf -nodes 15 –mem 62 –vcores 48

DESCRIPTION

Halvade is a Hadoop MapReduce implementation of the best practices pipeline for DNA seq from Broad Institute. This program supports both DNA seq and Exome seq analysis, the output are the vcf files for each region. This program can only be run with Hadoop on either a local cluster or Amazon EMR. The GATK only works with Java v1.7 or newer so this should be installed on every node. If this is not the default Java you should use the –J option to set the location of the Java v1.7 binary.

For the program a reference FASTA file is needed and the corresponding BWT index build by BWA v 0.6 or later. The filenames of the reference need to start with the full name of the FASTA reference. For variant detection a DBSNP file is also required, this contains known SNPs for the reference FASTA file. On the GATK website you can download the latest human genome reference containing a complete FASTA file and corresponding DBSNP file. This FASTA file contains many partial chromosomes which can be removed from the reference to decrease runtime.

OPTIONS

Argument   | Description

------------- | ------------- -B STR | Binary directory. Required. This string gives the directory where bin.tar.gz is located. This can be both HDFS or S3 where a directory on S3 requires the S3://bucketname/ prefix -D STR | DBSNP file location. Required. This string gives the absolute filename of the DBSNP file, this file needs to be compatible with the reference FASTA file provided by the –R option. This can be either on HDFS or S3 where S3 requires the s3://bucketname/ prefix. -I STR | Input directory. Required. The string points to the directory containing the preprocessed input on either HDFS or S3. For S3 directories the S3://bucketname/ is required. -mem INT | Memory size. Required. This gives the total memory each node in the cluster has. The memory size is given in GB. When running the script for amazon EMR this will be automatically set. -nodes INT | Nodes count. Required. This gives the total number of nodes in the local cluster or the number of nodes you want to request when using Amazon EMR. Amazon limits the number of nodes to 20 if you don’t reserve them for a longer time period. When running the script for amazon EMR this will be automatically set. -O STR | Output directory. Required. This string points to the directory which will contain the output vcf files of Halvade. This can be HDFS or S3, however a S3 directory requires the S3://bucketname/ prefix. -R STR | Reference Location. Required. This string gives the absolute filename of the reference in FASTA format. The corresponding index files build with BWA need to be in this directory having the same prefix as the reference FASTA file. This can be both on HDFS or S3, where a location on S3 requires the S3://bucketname/ prefix. -vcores INT | Vcores count. Required. This gives the number of threads that can be used per node on the cluster. If hyper threading is enabled on the nodes these threads should also be counted for optimal performance. When running the script for amazon EMR this will be automatically set. -b | Bedtools. This option enables the use of Bedtools to filter the DBSNP file to only keep SNPs that are present in the interval for this chromosome region. If the regions are too small the overhead of first filtering DBSNP can be bigger than the overall time it would take without filtering. Only use if regions are big enough. -bwamem | BWA mem. With this option Halvade will use BWA mem to perform the alignment in the map phase. By default Halvade will use BWA aln & sampe to do the alignment. -chr STR | Chromosomes. This options sets the chromosomes to be used during the pipeline. The string contains all chromosomes that need to be used separated by “,”. Halvade will calculate region size based on the sizes of the given chromosomes. -cov INT | Coverage. This option sets the coverage of the input over the entire genome. This will be used to estimate the optimal region sizes per chromosome. -exome STR | Exome. This enables the exome seq pipeline and the string points to the location of a bed file for this exome sample. This bed file will be used to select regions of interest for GATK to increase the overall performance. -hc | HaplotypeCaller. With this option Halvade will use the HaplotypeCaller tool from GATK instead of the UnifiedGenotyper tool, which is used by default. This is the newer variant caller which is slower but more accurate. -id STR | Read Group ID. This string sets the Read Group ID which will be used when adding Read Group information to the intermediate results. [GROUP1] -J STR | Java. This string sets the location of the Java v1.7 binary, this file should be present on every node in the cluster. If this is not set Halvade with use the default Java which should be v1.7 or newer. -justalign | Just align. This option is used to only align the data, primarily for testing purposes. The aligned reads are written to the output folder set with the –O option. -keep | Keep intermediate files. This option enables all intermediate files to be stored in the temporary folder set by –tmp. This allows the user to check the data after processing. -lb STR | Read Group Library. This string sets the Read Group Library which will be used when adding Read Group information to the intermediate results. [LIB1] -P | Picard. Use Picard in the preprocessing steps, by default elPrep is used which is a more efficient execution of the algorithms called in Picard. This however requires less memory and can be useful on some clusters. -pl STR | Read Group Platform. This string sets the Read Group Platform which will be used when adding Read Group information to the intermediate results. [ILLUMINA] -pu STR | Read Group Platform Unit. This string sets the Read Group Platform Unit which will be used when adding Read Group information to the intermediate results. [UNIT1] -r INT | Region size. This sets the region size to be used to split the data for the reduce phase. Halvade will attempt to optimize this and this option should only be used in cases where a different reference is used where the chromosome sizes are very small. -rjvm | Reuse JVM. Some version of Hadoop allow the reuse of Java Virtual Machines, this should be set if this is the case. This will allow BWA to exist through several map tasks and share the same BWA instance and give a performance increase. -scc INT | stand_call_conf. The value of this option will be used for the stand_call_conf when calling the GATK Variant Caller (UnifiedGenotyper by default). -sec INT | stand_emit_conf. The value of this option will be used for the stand_emit_conf when calling the GATK Variant Caller (UnifiedGenotyper by default). -s | Single-end reads. This option sets the input to be single-ended reads. By default Halvade reads in paired-end interleaved FASTQ files. -sm STR | Read Group Sample Name. This string sets the Read Group Sample Name which will be used when adding Read Group information to the intermediate results. [SAMPLE1] -tmp STR | Temporary directory. This string gives the location where intermediate files will be stored. This should be on a local disk for every node for optimal performance.

PREPROCESSING

SYNOPSIS

Hadoop jar HalvadeUploaderWithLibs.jar –M /dir/to/input.manifest -O /halvade/out/ –t 8

Hadoop jar HalvadeUploaderWithLibs.jar –M /dir/to/input.manifest -O s3://bucketname/halvade/out/ -cred /dir/to/credentials.txt –t 8

DESCRIPTION

Preprocessing the fastq files will interleave the paired-end reads and split the files in pieces of 60MB (by default, can be changed with the -s option). HalvadeUploader.jar will automatically upload these preprocessed files to HDFS or S3 depending on the output directory.

OPTIONS

Argument   | Description

------------- | ------------- -O STR | Output directory. Required. This string gives the directory where the output files will be put. This can be both HDFS or S3 where a directory on S3 requires the S3://bucketname/ prefix -M STR | Manifest file. Required. This string gives the absolute path of the Manifest file. This manifest file contains a line per file pair, separated by a tab: /dir/to/fastq1.fastq /dir/to.fastq2.fastq -cred STR | Credentials file. Gives the path of the credentials file used to acces S3. This should be configured when installing the Amazon EMR command line interface. -s INT | Size. This sets the maximum file size of each interleaved file [60MB]. -t INT | Threads. This sets the number of threads used to preprocess the input data.