SubmittingJobs

Submitting jobs

A list of samples to be processed is maintained in the files:

AnalysisStep/test/prod/samples_2015.csv (data)
AnalysisStep/test/prod/samples_2016mc_miniaod_v1.csv (MINIAOD v1)
AnalysisStep/test/prod/samples_2016mc_miniaod_v2.csv (MINIAOD v2)
AnalysisStep/test/prod/samples_2016mc_anomalous_miniaod_v2.csv (non-SM samples)

For 2015 data and MC, all samples are in the file samples_2015.csv.

These files specify the name assigned to the sample, cross section, BR, dataset name, the number of files per job, and additional customization for each sample (JSONs, special parameters, etc); see below for more details.

One can comment lines to select which samples to run on.

Steps for a full production:

Create the job scripts (for example, the 2016 MCs):

batch.py samples_2016mc_miniaod_v2.csv

The job scripts are creaded in a folder named PROD_<name_of_csv_file>, eg. PROD_samples_2016mc_miniaod_v2 in this case.

Submit the jobs (from lxplus; does not work from private machines):

cd PROD_samples_2016mc_miniaod_v2
resubmit.csh

Check how many jobs are still running or pending:

bjobs | grep -c RUN
bjobs | grep -c PEND

Once the jobs are done, from the same folder:

checkProd.csh

This checks all jobs and moves all those which finished correctly to a subfolder named AAAOK.

A failure rate of ~2% is normal due to transient problem with working nodes. To resubmit failed jobs (repeat until all jobs succeed for data; a small fraction of failures is acceptable for MC):

cleanup.csh
resubmit.csh
# wait all jobs to finish
checkProd.csh

Once all jobs are finished, log in to lxcms03, and run the following script (from the submission folder) to move the trees to the standard path and add all chunks:

moveTrees.csh <YYMMDD>

(This takes a while.)

After this operation, all the hadd’ed trees are located in /data3/Higgs/<YYMMDD> and accessible via xrootd at root://lxcms03//data/Higgs/<YYMMDD> .

Notes for advanced usage

CSV file field specification

identifier: name given to the sample.
process: reserved for future development.
crossSection, BR: The xsection weight in the trees will be set using xsection*BR. The two values are kept separate just for logging reasons as only their product is used, so one can directly set the product in the crossSection field, use BR for the filter efficiency, etc.
execute: reserved for future development.
dataset,prefix: if prefix = dbs, dataset is assumed to be the dataset name and the list of files is taken from das. Otherwise, the string "prefix+dataset" is assumed to be an EOS path.
pattern: reserved for future development.
splitLevel: number of files per job.
::variables: list of additional python variables to be set for this specific sample.
::pyFragments: list of additional python snippets to be loaded for this specific sample (eg json); assumed to be found in the pyFragments subdir.
comment: A text comment

resubmit.csh and grid certificates.

The scripts submits all jobs through bsub. In addition:

it logs the job number in the file "jobid" (useful in case the status a specific jobs need to be checked or killed)
it checks if a valid grid proxy is available, and in that case it makes it available to the working node. This is needed to run on samples that are not available on eos. See here for details on how to get a grid proxy.

checkProd.csh and handling of failures

The checkProd.csh script checks the cmsRun exit status (that is recorded in the exitStatus.txt file) and the validity of the root file after its copied it back to AFS.

It can be safely run even while jobs are still running; jobs folders are moved only when they are fiully completed.

In addition, if "mf" is specified as argument, the scripts move jobs that failed with cmsRun exit status !=0 to a folder named AAAFAIL. These can be resubmitted with the usual scheme (cd in, cleanup, resubmit). Be careful not to resubmit jobs that are already running!! [FIXME: add protection in resubmit.csh, eg to abort if one "jobid" file is found in any of the job folders].

Note that this does not catch all failures as some job fail without reporting an exit status !=0. [FIXME: could override the exit status value in case failures are found in the final checks over the root file]

Some additional notes:

Sometimes jobs get stuck in the processing nodes. This can be checked with bjobs -l - if job is running since a lot of time but appears to be idle (check e.g. IDLE_FACTOR), it's probably best to kill it; wait that bjobs reports it as finished (can take a while), and resubmit it.

Sometimes failures are permanent (job keeps failing), in this case it must be investigated looking at the log.txt.gz file. In rare cases, this happens to be due to an EOS failure. If a file is reported as unaccessible:

try opening it in plain root with xrootd; that should give a reproducible error message

check =eos fileinfo <file>= : a possible problem is that both servers holding file replicas are down, or that one of the two replicas is reported with incorrect information. Open an EOS ticket with a clear pointer to the file and to the eos fileinfo message; it is normally addressed within few hours.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly