-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.txt
137 lines (108 loc) · 5.06 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
Readme 18-Mar-2016
Running the program
Main class is com.lordjoe.distributed.hydra.comet_spark.SparkCometScanScorer
VMOptions -Xmx8g -Dlog4j.configuration=conf/log4j.properties
arguments SparkLocalClusterEg3.properties input_searchGUI20.xml
working directory <path_to_sparkhydra>\sparkhydra\data
where SparkLocalClusterEg3.properties looks like
#
# These are properties to be set on the spark cluster
#
# NOTE this is system specific but should be the full path to
# the directory where files are stored
# prepend to path
com.lordjoe.distributed.PathPrepend=E:/SparkHydra/data/eg3/
com.lordjoe.distributed.hydra.BypassScoring=false
com.lordjoe.distributed.hydra.KeepBinStatistics=true
com.lordjoe.distributed.hydra.doGCAfterBin=false
# End SparkLocalClusterEg3.properties ===================
#=============================================
When running on the cluster
The command line looks like
spark-submit --class com.lordjoe.distributed.hydra.comet_spark.SparkCometScanScorer SteveSpark.jar ~/SparkClusterEupaG.properties input_searchGUI_scan1000.xml
Where
SteveSpark.jar is a jar generated by calling com.lordjoe.distributed.hydra.HydraDeployer with the argument SteveSpark.jar
HydraDepolyer is my code which is known ot generate a jar that works properly
~/SparkClusterEupaG.properties (see example below) describes the cluster and sets cluster specific properties
input_searchGUI_scan1000.xml describes the search
NOTE - there is an assumption that hdfs is mounted somewhere on the file system such that com.lordjoe.distributed.PathPrepend can describe a path to reach it. Prefixes like
hdfs:// will work prefic s3:// might work but this code has not been tested
where SparkClusterEupaG.properties looks like
#
# These are properties to be set on the spark cluster
#
#
# prepend to path
com.lordjoe.distributed.PathPrepend=hdfs://daas/steve/eg3/
spark.mesos.coarse=true
spark.mesos.executor.memoryOverhead=3128
com.lordjoe.distributed.hydra.BypassScoring=true
com.lordjoe.distributed.hydra.KeepBinStatistics=true
com.lordjoe.distributed.hydra.doGCAfterBin=false
# give executors more memory
spark.executor.memory=12g
# Spark shuffle properties
spark.shuffle.spill=false
spark.shuffle.memoryFraction=0.4
spark.shuffle.consolidateFiles=true
spark.shuffle.file.buffer.kb=1024
spark.reducer.maxMbInFlight=128
spark.storage.memoryFraction=0.3
spark.shuffle.manager=sort
spark.default.parallelism=360
spark.hadoop.validateOutputSpecs=false
#spark.rdd.compress=true
#spark.shuffle.compress=true
spark.shuffle.spill.compress=true
spark.io.compression.codec=lz4
spark.shuffle.sort.bypassMergeThreshold=100
# try to divide the problem into this many partitions
com.lordjoe.distributed.number_partitions=360
# End SparkClusterEupaG.properties ===================
#=============================================
Output of running
mainclass com.lordjoe.distributed.hydra.comet_spark.SparkCometScanScorer
vmoptions -Xmx8g -Dlog4j.configuration=conf/log4j.properties
args SparkLocalClusterEg3.properties input_searchGUI20.xml
working directory <path_to_sparkhydra>\sparkhydra\data
Should be
===================================================
Total Scans Scored 0
Finished Scoring in 44.346 sec
=========================================
==== Accululators =========
=========================================
TotalPeptidesScored 105
TotalSpectraScored 411
AddIndexToSpectrum totalCalls:20 totalTime: 0.08 msec machines:1 variance 0
AppendScanStringToWriter totalCalls:2 totalTime: 0.00 msec machines:1 variance 0
CombineCometScoringResults totalCalls:8 totalTime: 0.08 msec machines:1 variance 0
DigestProteinFunction totalCalls:56 totalTime: 8396.52 msec machines:1 variance 0
MapToCometSpectrum totalCalls:20 totalTime: 0.32 msec machines:1 variance 0
ParsedProteinToProtein totalCalls:56 totalTime: 6.81 msec machines:1 variance 0
ScoreSpectrumAndPeptideWithCogroupWithoutHash totalCalls:378 totalTime: 832.06 msec machines:1 variance 0
SplitMapPolypeptidesToBin totalCalls:376K totalTime: 710.59 msec machines:1 variance 0
mapMeasuredSpectraToBins totalCalls:20 totalTime: 0.68 msec machines:1 variance 0
GCTimeAccumulator
GC Time Max Allocation 34M
LogRareEventsAccumulator
None
MemoryUsage
Mem Use Max Allocation 247M
500M 378
Allocated
200M 378
MemoryUseAccumulatorAndBinSize
max memoryUse=247M memoryAllocated=9394K, numberSpectra=1, numberPeptides=10
max memoryUse=247M memoryAllocated=9394K, numberSpectra=1, numberPeptides=10
NotScoredBins
PeptideDistribution
Max value 0 total 0
SpectrumDistribution
Max value 0 total 0
Total all Functions
totalCalls:376K totalTime: 9947.13 msec machines:1 variance 0
Total Run Time in 44.347 sec
===================================================
Test Status
Several tests are present but commented out - currently