Skip to content
Dries Decap edited this page May 26, 2015 · 18 revisions

Halvade - Hadoop Aligner & Variant Detection

SYNOPSIS

Hadoop jar HalvadeWithLibs.jar –I /halvade/in/ -O /halvade/out/ -B /halvade/bin.tar.gz -R /halvade/ref/hg19 -D /halvade/dbsnp/dbsnp.vcf -nodes 15 –mem 62 –vcores 48

DESCRIPTION

Halvade is a Hadoop MapReduce implementation of the best practices pipeline for DNA seq from Broad Institute. This program supports both DNA seq and Exome seq analysis, the output are the vcf files for each region. This program can only be run with Hadoop on either a local cluster or Amazon EMR. The GATK only works with Java v1.7 or newer so this should be installed on every node. If this is not the default Java you should use the –J option to set the location of the Java v1.7 binary.

For the program a reference FASTA file is needed and the corresponding BWT index build by BWA v 0.6 or later. The filenames of the reference need to start with the full name of the FASTA reference. For variant detection a DBSNP file is also required, this contains known SNPs for the reference FASTA file. On the GATK website you can download the latest human genome reference containing a complete FASTA file and corresponding DBSNP file. This FASTA file contains many partial chromosomes which can be removed from the reference to decrease runtime.

OPTIONS

Required Arguments

Argument   | Description

------------- | ------------- -B STR | Binary location. Required. This string gives the location where bin.tar.gz is located. This can be both HDFS or S3 where a directory on S3 requires the S3://bucketname/ prefix -D STR | DBSNP file location. Required. This string gives the absolute filename of the DBSNP file, this file needs to be compatible with the reference FASTA file provided by the –R option. This can be either on HDFS or S3 where S3 requires the s3://bucketname/ prefix. -I STR | Input directory. Required. The string points to the directory containing the preprocessed input on either HDFS or S3. For S3 directories the S3://bucketname/ is required. -mem INT | Memory size. Required. This gives the total memory each node in the cluster has. The memory size is given in GB. -nodes INT | Node count. Required. This gives the total number of nodes in the local cluster or the number of nodes you want to request when using Amazon EMR. Amazon limits the number of nodes to 20 if you don’t reserve them for a longer time period. -O STR | Output directory. Required. This string points to the directory which will contain the output vcf files of Halvade. This can be HDFS or S3, however, on S3 the directory requires the S3://bucketname/ prefix. -R STR | Reference Location. Required. This string gives the prefix (without .fasta) of the absolute filename of the reference in FASTA format. The corresponding index files build with BWA need to be in this directory having the same prefix as the reference FASTA file. This can be both on HDFS or S3, where a location on S3 requires the S3://bucketname/ prefix. -vcores INT | Vcores count. Required. This gives the number of cores that can be used per node on the cluster (to enable simultaneous multithreading use the -smt option). When running the script for amazon EMR this will be automatically set.

Optional Arguments

Argument   | Description

------------- | ------------- -b | Bedtools. This option enables the use of Bedtools to filter the dbSNP file to only keep SNPs that are present in the interval for this chromosome region. If the regions are too small the overhead of first filtering the dbSNP can be greater than the overall time it would take without filtering. Only use if regions are big enough. -bam | Bam input. This option enables reading aligned BAM input, using this will avoid realigning. If a realignment is required, the data needs to be transformed to fastq files, shuffled and preprocessed for Halvade. -bwamem | BWA mem. With this option Halvade will use BWA mem to perform the alignment in the map phase. By default Halvade will use BWA aln & sampe to do the alignment. -c | Combine VCF. With this option Halvade will combine VCF files in the input directory and not perform variant calling. This is done by default after the variant calling. -CA STR=STR | Custom arguments. This options allows the tools run with Halvade to be run with additional arguments. The arguments are given in this form: toolname=extra arguments. All options must be correct for the tool in question, multiple arguments can be added by giving a quoted string and separating the arguments with a space. Possible toolnames are bwa_aln, bwa_mem, bwa_sampe, star, elprep, samtools_view, bedtools_bdsnp, bedtools_exome, picard_buildbamindex, picard_addorreplacereadgroup, picard_markduplicates, picard_cleansam, gatk_realignertargetcreator, gatk_indelrealigner, gatk_baserecalibrator, gatk_printreads, gatk_combinevariants, gatk_variantcaller, gatk_variantannotator, gatk_variantfiltration, gatk_splitncigarreads. -chr STR | Chromosomes. This options sets the chromosomes to be used during the pipeline. The string contains all chromosomes that need to be used separated by “,”. Halvade will calculate region size based on the sizes of the given chromosomes. -cov INT | Coverage. This option overrides the estimated coverage of the input over the entire genome. This will be used to estimate the optimal region sizes per chromosome. -drop | Drop. Halvade will drop all paired-end reads where the pairs are aligned to different chromosomes. -dryrun | Dry run. This will initialize Halvade, which calculates the task sizes and region sizes of the chromosomes, but Halvade will not execute the Hadoop job. -exome STR | Exome. This option will start the exome seq pipeline in Halvade and the string points to the location of a bed file for the used exome. This bed file will be used to select regions of interest for GATK to increase the overall performance. -hc | HaplotypeCaller. With this option Halvade will use the HaplotypeCaller tool from GATK instead of the UnifiedGenotyper tool, which is used by default. This is the newer variant caller which is slower but more accurate. -id STR | Read Group ID. This string sets the Read Group ID which will be used when adding Read Group information to the intermediate results. [GROUP1] -J STR | Java. This string sets the location of the Java v1.7 binary, this file should be present on every node in the cluster. If this is not set Halvade with use the default Java which should be v1.7 or newer. -justalign | Just align. This option is used to only align the data. The aligned reads are written to the output folder set with the –O option. -keep | Keep intermediate files. This option enables all intermediate files to be stored in the temporary folder set by –tmp. This allows the user to check the data after processing. -lb STR | Read Group Library. This string sets the Read Group Library which will be used when adding Read Group information to the intermediate results. [LIB1] -mpn INT | Maps per node. This overrides the number of map tasks that are run simultaneously on each node. Only use this when the number of map containers per node does not make sense for your cluster. -P | Picard. Use Picard in the preprocessing steps, by default elPrep is used which is a more efficient execution of the algorithms called in Picard. This however requires less memory and can be useful on some clusters. -pl STR | Read Group Platform. This string sets the Read Group Platform which will be used when adding Read Group information to the intermediate results. [ILLUMINA] -pu STR | Read Group Platform Unit. This string sets the Read Group Platform Unit which will be used when adding Read Group information to the intermediate results. [UNIT1] -redistribute | Redistribute Cores. This is an optimization to better utilize the CPU cores at the end of the map phase, to improve load balancing. Only use when the cores per container is less than 4. -refdir STR | Reference directory. This sets the reference directory, Halvade will use this directory to find existing references on each node. This directory needs to be accessible by all nodes, but can be a local disk or a network disk. Halvade finds the reference files by looking for files in the directory or subdirectory with these suffixes: .bwa_ref, .gatk_ref, .star_ref, .dbsnp. -report_all | Report all output. This option will give all vcf output records in the merged output file. By default the vcf record with the highest score will be kept if multiple records are found at the same location. -rna | RNA pipeline. This options enables Halvade to run the RNA seq pipeline instead of the default DNA seq pipeline. This option requires an additional argument SG which points to the STAR genome directory. -rpn INT | Reduces per node. This overrides the number of reduce tasks that are run simultaneously on each node. Only use this when the number of reduce containers per node does not make sense for your cluster. -scc INT | stand_call_conf. The value of this option will be used for the stand_call_conf when calling the GATK Variant Caller (UnifiedGenotyper by default). -sec INT | stand_emit_conf. The value of this option will be used for the stand_emit_conf when calling the GATK Variant Caller (UnifiedGenotyper by default). -SG STR | Star genome. This gives the directory of the Star genome reference. This can be HDFS or S3, however a S3 directory requires the S3://bucketname/ prefix. -s | Single-end reads. This option sets the input to be single-ended reads. By default Halvade reads in paired-end interleaved FASTQ files. -sm STR | Read Group Sample Name. This string sets the Read Group Sample Name which will be used when adding Read Group information to the intermediate results. [SAMPLE1] -smt | Simultaneous multithreading. This option enables Halvade to use simultaneous multithreading on each node. -tmp STR | Temporary directory. This string gives the location where intermediate files will be stored. This should be on a local disk for every node for optimal performance.