Skip to content
Roberto Covarelli edited this page Jan 22, 2020 · 41 revisions

Submitting jobs

A list of samples to be processed is maintained in the csv files.

These files specify the name assigned to the sample, cross section, BR, dataset name, the number of files per job, and additional customization for each sample (JSONs, special parameters, etc); see below for more details.

One can comment lines to select which samples to run on.

Steps for a full production:

  1. Setup your GRID credentials:

    voms-proxy-init -voms cms
    
  2. Create the job scripts (for example, the 2018 Data):

    batch_Condor.py samples_2018_Data.csv
    

    The job scripts are creaded in a folder named PROD_<name_of_csv_file>_<CJLST_framework_commit_number>, eg. PROD_samples_2018_Data_b738728 in this case.

  3. Submit the jobs (from lxplus; does not work from private machines):

    cd PROD_samples_2018_Data_b738728
    resubmit_Condor.csh
    
  4. Check how many jobs are still running or pending on Condor:

    condor_q
    
  5. Once the jobs are done, from the same folder:

    checkProd.csh
    

    This checks all jobs and moves all those which finished correctly to a subfolder named AAAOK.

    It can be safely run even while jobs are still running; jobs folders are moved only when they are fully completed.

    A failure rate of ~2% is normal due to a transient problem with working nodes. To resubmit failed jobs (repeat until all jobs succeed for data; a small fraction of failures is acceptable for MC):

    cleanup.csh
    resubmit_Condor.csh.csh
    checkProd.csh
    # wait all jobs to finish
    
  6. Once all jobs are finished, the trees for the different jobs need to be hadded. This can be either done locally (for private tests and production) or into standard storage for a central production

    • for local adding, run in the same directory:

      haddChunks.py AAAOK
      
    • NEW: For a central production: choose/make a subdirectory in your CERNBox directory /eos/user/<initial>/<username> and copy h-added trees there from step above. From CERNBox website (http://cernbox.cern.ch), find this subdirectory and share with the e-group "CJLST-hzz4l@cern.ch", then let production coordinator know the location of the trees.

    • OBSOLETE: For a central production: log in to lxcms03, and run the following script (from the submission folder) to move the trees to the standard path and add all chunks:

      moveTrees.csh <YYMMDD>
      

      (This takes a while.) After this operation, all the hadd’ed trees are located in /data3/Higgs/<YYMMDD> and accessible via xrootd at root://lxcms03//data/Higgs/<YYMMDD> .

Notes for advanced usage

CSV file field specification

identifier: name given to the sample.
process: reserved for future development.
crossSection, BR: The xsection weight in the trees will be set using xsection*BR. The two values are kept separate just for logging reasons as only their product is used, so one can directly set the product in the crossSection field, use BR for the filter efficiency, etc.
execute: reserved for future development.
dataset,prefix: if prefix = dbs, dataset is assumed to be the dataset name and the list of files is taken from das. Otherwise, the string "prefix+dataset" is assumed to be an EOS path.
pattern: reserved for future development.
splitLevel: number of files per job.
::variables: list of additional python variables to be set for this specific sample.
::pyFragments: list of additional python snippets to be loaded for this specific sample (eg json); assumed to be found in the pyFragments subdir.
comment: A text comment

resubmit.csh and grid certificates.

The scripts submits all jobs through bsub. In addition:

  • it logs the job number in the file "jobid" (useful in case the status a specific jobs need to be checked or killed)
  • it checks if a valid grid proxy is available, and in that case it makes it available to the working node. This is needed to run on samples that are not available on eos. See here for details on how to get a grid proxy.

checkProd.csh and handling of failures

The checkProd.csh script checks the cmsRun exit status (that is recorded in the exitStatus.txt file) and the validity of the root file after its copied it back to AFS.

It can be safely run even while jobs are still running; jobs folders are moved only when they are fully completed.

In addition, if "mf" is specified as argument, the scripts move jobs that failed with cmsRun exit status !=0 to a folder named AAAFAIL. These can be resubmitted with the usual scheme (cd in, cleanup, resubmit); but it's a good idea to rename the folder beforehand (eg mv AAAFAIL AAARESUB1) if other jobs are still pending (otherwise rerunning checkProd.csh mf will add more failed job to the same directory. Be careful not to resubmit jobs that are already running!! There is currently no protection against this.

Note that checkProd.csh does not catch all possible failures, as some job fail without reporting an exit status !=0.

Some additional notes:

  • Sometimes jobs get stuck in the processing nodes. This can be checked with bjobs -l - if job is running since a lot of time but appears to be idle (check e.g. IDLE_FACTOR), it's probably best to kill it; wait that bjobs reports it as finished (can take a while), and resubmit it.
  • Sometimes failures are permanent (job keeps failing), in this case it must be investigated looking at the log.txt.gz file. In rare cases, this happens to be due to an EOS failure. If a file is reported as unaccessible:
    • try opening it in plain root with xrootd; that should give a reproducible error message 
      
    • check =eos fileinfo <file>= : a possible problem is that both servers holding file replicas are down, or that one of the two replicas is reported with incorrect information. Open an EOS ticket with a clear pointer to the file and to the eos fileinfo message; it is normally addressed within few hours.
      
Clone this wiki locally