-
Notifications
You must be signed in to change notification settings - Fork 10
Home
#Getting started
##Setup data for Halvade
Before Halvade can be used some data needs to be put onto HDFS or S3. The data needed by Halvade is:
- Fasta reference including the corresponding .dict, fasta.fai file and the BWA reference made from this fasta reference. All these files need to be in the same folder with the same prefix.
- A SNP database for the same reference fasta genome.
- The zipped binary file, which contains all needed binaries for Halvade to run (this is provided with the releases).
How to put this data on HDFS or S3 can be found here: HDFS and Amazon S3.
To put files on HDFS the Hadoop (version 2.0 or newer) hdfs
command can be used as follows:
hdfs fs -put /path/to/local/filename /path/on/hdfs/
hdfs fs -put /path/to/local/filename /path/on/hdfs/custom_filename
If you want to make a new folder that will contain the data, this command can be used:
hdfs fs -mkdir /path/to/new/folder/
To put files on Amazon S3 a bucket has to be created first, instructions can be found on Amazon. Once a bucket has been made, files can be uploading using the Amazon console (instructions from Amazon). An alternative way to upload files to S3 is using s3cmd
. This can be downloaded here and instructions on how to use s3cmd
can be found here.
To preprocess the input data (paired-end FASTQ files) a tool is provided with Halvade. This tool, HalvadeUploader.jar, will preprocess the input data and upload it onto HDFS or S3 depending on the output directory. For more information on how to run the preprocessing tool, [go here] (Halvade-Manual#preprocessing).
To run the program a script has been added, this reads configuration from two files:
- halvade.conf - contains configuration for the cluster (details here) / amazon EMR (details here)
- halvade_run.conf - a list of options for a DNA seq run (details here)
To set an option, remove the #
before the line and add an argument (between "..."
if the option is a string) if necessary.
After all options are set, run runHalvade.py and wait until completion.
###Local cluster To configure a cluster the options in halvade.conf need to be set, for a local cluster this is:
-
nodes
: sets he number of nodes in the cluster -
vcores
: sets the number of threads that can run on each node -
B
: sets the absolute path to the HDFS directory containing the bin.tar.gz file -
D
: sets the absolute path to the SNP database file on HDFS -
R
: sets the absolute path of the fastq file of the reference on HDFS, all other reference files should be in the same folder with this path as prefix
Make sure that all options for Amazon EMR are disabled by putting the line in comment (add #
before the line)
Once this is set for your cluster you only need to change halvade_run.conf for jobs you want to run. Two options that are mandatory are the input I, which gives the path to the input directory, and output O, which gives the path to the output directory. With this all options are set and Halvade can be run.
###Amazon EMR To run on Amazon EMR the Amazon EMR command line interface (instructions from Amazon) needs to be installed. To configure a cluster the options in halvade.conf need to be set, for Amazon EMR this is:
-
nodes
: sets he number of nodes in the cluster -
vcores
: sets the number of threads that can run on each node -
B
: sets the absolute path to the directory containing the bin.tar.gz file -
D
: sets the absolute path to SNP database file on S3 -
R
: sets the absolute path of the fastq file of the reference, all other reference files should be in the same folder with this path as prefix -
emr_jar
: sets the absolute path of _HalvadeWithLibs.jar on S3 -
emr_script
: sets the absolute path of halvade_bootstrap.sh on S3 -
emr_type
: sets the Amazon EMR instance type (e.g. "c3.8xlarge") -
emr_ami_v
: sets the AMI number for Amazon EMR, should be set to "3.1.0" or newer -
tmp
: this should be set to "/mnt/halvade/"
For locations on S3 a uri of this form should be used: s3://bucketname/directory/to/file
Once this is set for your cluster you only need to change halvade_run.conf for jobs you want to run. Two options that are mandatory are the input I, which gives the path to the input directory, and output O, which gives the path to the output directory. With this all options are set and Halvade can be run.