README.txt

Readme    18-Mar-2016

Running the program
    Main class is com.lordjoe.distributed.hydra.comet_spark.SparkCometScanScorer
    VMOptions -Xmx8g -Dlog4j.configuration=conf/log4j.properties
    arguments SparkLocalClusterEg3.properties input_searchGUI20.xml
    working directory <path_to_sparkhydra>\sparkhydra\data

    where  SparkLocalClusterEg3.properties looks like

    #
    # These are properties to be set on the spark cluster
    #

    # NOTE this is system specific but should be the full path to
    # the directory where files are stored
    # prepend to path
     com.lordjoe.distributed.PathPrepend=E:/SparkHydra/data/eg3/


    com.lordjoe.distributed.hydra.BypassScoring=false
    com.lordjoe.distributed.hydra.KeepBinStatistics=true
    com.lordjoe.distributed.hydra.doGCAfterBin=false
     # End   SparkLocalClusterEg3.properties ===================
     #=============================================

     When running on the cluster
     The command line looks like
     spark-submit --class com.lordjoe.distributed.hydra.comet_spark.SparkCometScanScorer SteveSpark.jar ~/SparkClusterEupaG.properties input_searchGUI_scan1000.xml
     Where
     SteveSpark.jar is a jar generated by calling com.lordjoe.distributed.hydra.HydraDeployer with the argument SteveSpark.jar
     HydraDepolyer is my code which is known ot generate a jar that works properly

     ~/SparkClusterEupaG.properties (see example below) describes the cluster and sets cluster specific properties

      input_searchGUI_scan1000.xml describes the search

      NOTE - there is an assumption that hdfs is mounted somewhere on the file system such that    com.lordjoe.distributed.PathPrepend can describe a path to reach it. Prefixes like
      hdfs:// will work prefic s3:// might work but this code has not been tested

       where  SparkClusterEupaG.properties looks like
      #
      # These are properties to be set on the spark cluster
      #
      #
      # prepend to path
      com.lordjoe.distributed.PathPrepend=hdfs://daas/steve/eg3/

      spark.mesos.coarse=true
      spark.mesos.executor.memoryOverhead=3128


      com.lordjoe.distributed.hydra.BypassScoring=true
      com.lordjoe.distributed.hydra.KeepBinStatistics=true
      com.lordjoe.distributed.hydra.doGCAfterBin=false

      # give executors more memory
      spark.executor.memory=12g

      # Spark shuffle properties
      spark.shuffle.spill=false
      spark.shuffle.memoryFraction=0.4
      spark.shuffle.consolidateFiles=true
      spark.shuffle.file.buffer.kb=1024
      spark.reducer.maxMbInFlight=128

      spark.storage.memoryFraction=0.3
      spark.shuffle.manager=sort
      spark.default.parallelism=360
      spark.hadoop.validateOutputSpecs=false

      #spark.rdd.compress=true
      #spark.shuffle.compress=true
      spark.shuffle.spill.compress=true
      spark.io.compression.codec=lz4
      spark.shuffle.sort.bypassMergeThreshold=100

      # try to divide the problem into this many partitions
      com.lordjoe.distributed.number_partitions=360
      # End   SparkClusterEupaG.properties ===================
      #=============================================


Output of running
mainclass com.lordjoe.distributed.hydra.comet_spark.SparkCometScanScorer
vmoptions -Xmx8g -Dlog4j.configuration=conf/log4j.properties
args SparkLocalClusterEg3.properties input_searchGUI20.xml
 working directory <path_to_sparkhydra>\sparkhydra\data

Should be
===================================================
Total Scans Scored 0
Finished Scoring in 44.346 sec
=========================================
====  Accululators              =========
=========================================
TotalPeptidesScored 105
TotalSpectraScored 411
AddIndexToSpectrum  totalCalls:20 totalTime:      0.08 msec machines:1 variance 0
AppendScanStringToWriter  totalCalls:2 totalTime:      0.00 msec machines:1 variance 0
CombineCometScoringResults  totalCalls:8 totalTime:      0.08 msec machines:1 variance 0
DigestProteinFunction  totalCalls:56 totalTime:   8396.52 msec machines:1 variance 0
MapToCometSpectrum  totalCalls:20 totalTime:      0.32 msec machines:1 variance 0
ParsedProteinToProtein  totalCalls:56 totalTime:      6.81 msec machines:1 variance 0
ScoreSpectrumAndPeptideWithCogroupWithoutHash  totalCalls:378 totalTime:    832.06 msec machines:1 variance 0
SplitMapPolypeptidesToBin  totalCalls:376K totalTime:    710.59 msec machines:1 variance 0
mapMeasuredSpectraToBins  totalCalls:20 totalTime:      0.68 msec machines:1 variance 0
GCTimeAccumulator
GC Time Max Allocation 34M

LogRareEventsAccumulator
 None
MemoryUsage
Mem Use Max Allocation 247M
500M	378
Allocated
200M	378

MemoryUseAccumulatorAndBinSize
max memoryUse=247M memoryAllocated=9394K, numberSpectra=1, numberPeptides=10
max memoryUse=247M memoryAllocated=9394K, numberSpectra=1, numberPeptides=10

NotScoredBins

PeptideDistribution
Max value 0 total 0

SpectrumDistribution
Max value 0 total 0

Total all Functions
 totalCalls:376K totalTime:   9947.13 msec machines:1 variance 0
Total Run Time in 44.347 sec
===================================================
Test Status
    Several tests are present but commented out - currently