-
Notifications
You must be signed in to change notification settings - Fork 596
Efficiently Copying Files into HDFS
It is a known fact that GATK spark tools run faster when they are pulling their underlying input files from an HDFS source as opposed to a Google bucket. The cost of putting a file into HDFS from GCS can be expensive, so this optimization is especially important for persistent clusters where the same data might be run multiple times using a GATK spark tool. Under some conditions, it might actually be faster to download inputs into HDFS and run them rather than relying on the GCS adapter as can be explored in this pull request. To efficiently put input into HDFS, the following command should suffice:
$ hadoop distcp gs://my/gcs/path.file hdfs:///my/hdfs/path.file
The GATK tool ParallelCopyGCSDirectoryIntoHDFSSpark
may perform even better, since it can split large files into blocks and copy them in parallel:
gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \
--inputGCSPath gs://my/gcs/path \
--outputHDFSDirectory hdfs:///my/hdfs/path