-
Notifications
You must be signed in to change notification settings - Fork 51
SubmittingJobs
A list of samples to be processed is maintained in the csv files.
These files specify the name assigned to the sample, cross section, BR, dataset name, the number of files per job, and additional customization for each sample (JSONs, special parameters, etc); see below for more details.
One can comment lines to select which samples to run on.
-
Setup your GRID credentials:
voms-proxy-init --voms cms
-
Create the job scripts (for example, the 2018 Data):
batch_Condor.py samples_2018_Data.csv
The job scripts are creaded in a folder named PROD_<name_of_csv_file>_<CJLST_framework_commit_number>, eg. PROD_samples_2018_Data_b738728 in this case. Note that enough quota should be available in the submit area to hold all output. This is typlically not the case in AFS home areas. Since as of today (3/2023) submission from /eos/ areas are not permitted, it is possible to specify a destination folder in /eos/home/ where jobs will transfer their output. With this option, the output is directly transferred to /eos at the end of job, which also avoids ecxcessive load to AFS. Example:
batch_Condor.py samples_2018_Data.csv -t /eos/user/j/john/230426
In using this option, please note that:
- The destination folder must be under /eos/user/. Writing to eos
- Subfolders are created in the destination folder, matching the production directory tree (PROD_samples[...]/Chunks)
- Only the .root file and job output log are transferred. the CONDOR logs are still created in the submission folder.
- a link to the transfer area is created in the production folder.
-
Submit the jobs (from lxplus; does not work from private machines):
cd PROD_samples_2018_Data_b738728 resubmit_Condor.csh
BONUS: Just before submitting the jobs, it's a time-saving trick to check the status of the server that will submit on your behalf (called bigbirdXX). First, type:
condor_status -schedd
Then, look for the bigbirdXX with the least amount of jobs running; pass the schedd number XX to the submit command:
resubmit_Condor.csh XX
(you can also switch your machine to that schedd by typing: export _condor_SCHEDD_HOST="bigbirdXX.cern.ch"
;export _condor_CREDD_HOST="bigbirdXX.cern.ch"
)
- Check how many jobs are still running or pending on Condor:
condor_q
(if a specific scheedd was specified in the point above, you have to either specify it with -name bigbirdXX
or set the _CONDOR_SCHEDD_HOST
environment variable)
-
Once the jobs are done, from the same folder (regardless of whether the option -t mentioned above is used or not)
checkProd.csh
This checks all jobs and moves all those which finished correctly to a subfolder named AAAOK.
It can be safely run even while jobs are still running; jobs folders are moved only when they are fully completed.
A failure rate of ~2% is normal due to a transient problem with working nodes. To resubmit failed jobs (repeat until all jobs succeed for data; a small fraction of failures is acceptable for MC):
cleanup.csh resubmit_Condor.csh checkProd.csh # wait all jobs to finish
-
Once all jobs are finished, the trees for the different jobs need to be hadded. This can be either done locally (for private tests and production) or into standard storage for a central production
- for local adding (no -t option specified) run in the submission directory:
haddChunks.py AAAOK
- If the option -t has been used to transfer the output directly to EOS, the command should be run from the eos folder.
haddChunks.py PROD_samples<XXX>
- For a central production: either use -t or copy the hadded trees to a subdirectory in your CERNBox directory /eos/user/<initial>/<username>. From the CERNBox website (http://cernbox.cern.ch), find this subdirectory and share with the e-group "CJLST-hzz4l@cern.ch", then let the production coordinator know the location of the trees.
- for local adding (no -t option specified) run in the submission directory:
-
For data,
haddData.py
can be used to merge the different PDs.
identifier: name given to the sample.
process: reserved for future development.
crossSection, BR: The xsection weight in the trees will be set using xsection*BR. The two values are kept separate just for logging reasons as only their product is used, so one can directly set the product in the crossSection field, use BR for the filter efficiency, etc.
execute: reserved for future development.
dataset,prefix: if prefix = dbs, dataset is assumed to be the dataset name and the list of files is taken from das. Otherwise, the string "prefix+dataset" is assumed to be an EOS path.
pattern: reserved for future development.
splitLevel: number of files per job.
::variables: list of additional python variables to be set for this specific sample.
::pyFragments: list of additional python snippets to be loaded for this specific sample (eg json); assumed to be found in the pyFragments subdir.
comment: A text comment
The scripts submits all jobs through CONDOR. In addition:
- it logs the job number in the log files under log/ (useful in case the status a specific jobs need to be checked or killed)
- it checks if a valid grid proxy is available, and in that case it makes it available to the working node. This is needed to run on samples that are not available on eos. See here for details on how to get a grid proxy.
The checkProd.csh script checks the cmsRun exit status (that is recorded in the exitStatus.txt file) and the validity of the root file after its copied it back to AFS.
It can be safely run even while jobs are still running; jobs folders are moved only when they are fully completed.
In addition, if "mf" is specified as argument, the scripts move jobs that failed with cmsRun exit status !=0 to a folder named AAAFAIL. These can be resubmitted with the usual scheme (cd in, cleanup, resubmit); but it's a good idea to rename the folder beforehand (eg mv AAAFAIL AAARESUB1) if other jobs are still pending (otherwise rerunning checkProd.csh mf will add more failed job to the same directory. Be careful not to resubmit jobs that are already running!! There is currently no protection against this.
Note that checkProd.csh does not catch all possible failures, as some job fail without reporting an exit status !=0.
If a large fraction of chunk fail, please inspect the log files to determine the cause of the failures.
- The files in the
log/
subfolder contain the Condor log/errors, these may be useful to understand what happened on the Condor node, but generally don't contain the details of the error - The actual output of the job is in the file log.txt.gz. Note that this is to be found in the destination folder on EOS if the -t option has been specified. The actual error must be searched for here (file access errors, segmentation violation, runtime errors, etc).
Failures fall broadly into a few categories:
- Files that are not accessible: an error appears right after attempting to open the file, possibly after several retries and/or after the open is attempted at different sites. Specifying the global redirector may help in this case (see below) as it may pick a replica in a different site.
- Corrupted files: the file is opened, and some events are processed, but some read error happens in the middle. These can be due to corrupted replicas in some of the sites, or communication issues with the remote sites. Also in this case, Specifying the global redirector may help
- CONDOR kills the job because it exceeeds resource usage; this is generally apparent from the files in the log/ folder
- Genuine job crashes due to configuration or issues in the code; these needs to be debugged by experts.
- In order to determine the source of the problem, start inspecting the log.txt.gz file for each failing chunk:
- Start from the end of the file and go back to find the actual error. Note that the last error message is not necessarily the cause; for example, corrupted files will generally result in a file read error followed by a crash, that is the effect, not the cause of the problem.
- If the log output stops with no error message, the problem may be due to a CONDOR problem, timeout, etc. Check the files in the log/ folder, which should report what happened on the CONDOR side.
- In some cases, in particular when running with the EOS transfer (-t) option, the log.txt.gz file may be missing. In this case, try one of the following:
- check the files in the log/ folder for any hint;
- run the job interactively, redirecting the output (
cmsRun run_cfg.py |& tee log.txt
); - resubmit the job;
- if the output is still missing try remaking the chunks in a different folder without the -t option and resubmit the failing one alone,
If the failure is due to a file access error, do the following:
-
Specify a redirector explicitly, and resubmit the job. This is the first thing to be tried in general.
A script to patch chunks is available; within the folder containing the Chunks to be modified, run
forceRedirector.csh
- If you suspect the file is not accessible or corrupted try copying the file locally:
xrdcp -d 1 -f root://cms-xrd-global.cern.ch//store/[...] $TMP
- If the transfer fails, report it to JIRA (see below).
- Even if the transfer succeeds, the file may still be corrupted; this is likely the case if in the log.txt.gz the file is opened succesfully and an error appears after a number of events have been read. To verify this, obtain the checksum of the transferred file with:
xrdadler32 $TMP/[file]
and compare it with the value reported by das (search for the file path, click on "show" and search for the "adler32" value)
- Sometimes, due to network problems, files on remote files keep failing even if copying them with xrdcp as specified above succeeds. In this case, it is possible to tweak the affected chunks to prefetch the input files before starting the processing. For this purpose, run:
forcePrefetch.csh
IMPORTANT: this is sufficient for job sets created with the -t transfer to EOS) option.
If the production is set up to write jobs back to the submission area, CONDOR will try to copy back the prefetched files as well. To avoid this, please add the following to batchScript.sh
script within each of the concerned chunks, before the last line (exit $cmsRunStatus
):
find . -type f -name "*.root" ! -name 'ZZ4lAnalysis.root' -exec rm -f {}
- Please note that opening the file from root via xrootd may fail for no good reason if the file is specified as command line argument; in case you want to try this please use
TFile::Open("root://cms-xrd-global.cern.ch//store/[...]")
. Even in this case, the fact that the file is opened correctly does not exclude it is bad (e.g. truncated) on the remote server. - Note that a file may have different replicas in different sites, so running at different times may lead to different outcome because xrootd picks a different replica. Some details on accessing a specific replica with rucio can be found in this thread.
- It is possible to ignore file access failures adding the following at the end of run_cfg.py:
process.source.skipBadFiles = True
This can be useful if a job keeps failing for a given file, and one wants to process the rest of the job's files nevertheless. If using this, keep in mind that the job will succeed even if not all input files have been processed!
Please open a ticket for files you have verified to be unaccessible or corrupted (bad checksum) on JIRA. An example of such a report can be found here.
In case a sample is present only at tape, one can try transferring the sample to T1 or T2 sites via rule auto-approval. First you need to setup the RUCIO client following these steps. Basically setting up the RUCIO client comes down to these four commands which you should run from lxplus command line:
source /cvmfs/cms.cern.ch/cmsset_default.sh #Optional: Not required by members of the zh group
source /cvmfs/cms.cern.ch/rucio/setup-py3.sh
voms-proxy-init -voms cms -rfc -valid 192:00
export RUCIO_ACCOUNT=`whoami` # Or your CERN username if it differs
As a second step you need to run these commands for each sample (here shown for one specific sample) which is on tape:
rucio add-rule \
cms:/QCD_Pt_470to600_TuneCP5_13TeV_pythia8/RunIISummer20UL18NanoAODv9-20UL18JMENano_106X_upgrade2018_realistic_v16_L1v1-v1/NANOAODSIM \
1 \
'rse_type=DISK&cms_type=real\tier=3\tier=0' \
--lifetime 86400 \
--activity "User AutoApprove" \
--ask-approval \
--grouping 'ALL' \
--comment "Details for use, ticket reference if any"
In this exaple a transfer request for one specific sample (/QCD_Pt_470to600_TuneCP5_13TeV_pythia8/RunIISummer20UL18NanoAODv9-20UL18JMENano_106X_upgrade2018_realistic_v16_L1v1-v1/NANOAODSIM) was made. What these lines mean, and other options, you can find here. In short: Note that at the end of each line there's a backslash \
. Flag --grouping 'ALL'
prevents fragmentation of data across multiple RSEs. Lifetime should be set carefully, as this is a lifetime of a rule, expressed in seconds. Line 'rse_type=DISK&cms_type=real\tier=3\tier=0' \
is recommendation and excludes tier 3 and 0, meaning the sample will be transferred to either tier 1 or tier 2 disk. It may happen that some samples need to be transferred to a specific site, e.g. T2 (excluding T1 also). The transfer should be done within couple of days at most. Upon running these commands, you will be notified via email about the submission of the request. In this email you will find a link which takes you to a page of the created rule, where you can see the status of the sample, but also modify the rule.
If jobs take forever to start, or go to HOLD state, please refer to this debugging guide.