Skip to content

Run Global Workflow

Kate Friedman edited this page Nov 13, 2020 · 19 revisions

Sections:


Clone and build global-workflow

This section will provide details on cloning, building, linking, setting up, and running the global-workflow with the ROCOTO workflow manager.

Quick instructions

Quick clone/build/link instructions (more detailed instructions below):

> git clone https://github.com/NOAA-EMC/global-workflow.git
> cd global-workflow/sorc
> sh checkout.sh
> sh build_all.sh
> sh link_fv3gfs.sh emc [dell][cray][hera]

1) Clone workflow and component repositories

Workflow:

https method:

git clone https://github.com/NOAA-EMC/global-workflow.git

ssh method (using a password protected SSH key):

git clone git@github.com:NOAA-EMC/global-workflow.git

Check what you just cloned (by default you will have only the develop branch):

> cd global-workflow
> git branch
* develop

You now have a cloned copy of the global-workflow git repository. To checkout a branch or tag in your clone:

git checkout BRANCH_NAME

Note: Branch must already exist here ^. If it does not you need to make a new branch using the “-b” flag:

git checkout -b BRANCH_NAME

The “checkout” command will checkout BRANCH_NAME and switch your clone to that branch. Example:

> git checkout my_branch          ← checkout the ‘my_branch’ branch into clone
> git branch
* my_branch                       ← now your clone is the ‘my_branch’ branch
  develop

Components:

Once you have cloned the workflow repository it's time to checkout/clone its components. The components will be checked out under the /sorc folder via a script called checkout.sh. Run the script with no arguments:

> cd sorc
> sh checkout.sh

Each component cloned via checkout.sh will have a log (checkout-COMPONENT.log). Check the screen output and logs for clone errors.

2) Build components

Under the /sorc folder is a script to build all components called build_all.sh. After running checkout.sh run this script to build all components codes:

sh build_all.sh

A partial build option is also available via two methods:

a) modify fv3gfs_build.cfg config file to disable/enable particular builds and then rerun build_all.sh

b) run individual build scripts also available in /sorc folder for each component or group of codes

3) Link components

At runtime the global-workflow needs all pieces in place within the main superstructure. To establish this a link script is run to create symlinks from the top level folders down to component files checked out in /sorc folders.

After running the checkout and build scripts run the link script:

sh link_fv3gfs.sh $RUN_ENVIR $MACHINE

...where:

  • RUN_ENVIR is either "emc" or "nco". The "nco" option is only used by NCO during installation into production. Users should use the "emc" option otherwise.
  • MACHINE is the HPC/platform/machine you're on. Options are: dell, cray, hera

Example:

> sh link_fv3gfs.sh emc hera

Prepare initial conditions

There are two types of initial conditions for the global-workflow:

  1. Warm start: these ICs are taken directly from either the GFS in production or an experiment "warmed" up (at least one cycle in).
  2. Cold start: any ICs converted to a new resolution or grid (e.g. GSM-GFS -> FV3GFS). These ICs are often prepared by global_chgres (change resolution).

Most users will initiate their experiments with cold start ICs unless running high resolution (C768 deterministic with C384 EnKF) for a date with warm starts available. It is not recommended to run high resolution unless required or as part of final testing.

Cold starts

The following information is for users needing to generate initial conditions for a cycled experiment that will run at a different resolution or layer amount than the operational GFS (C768C384L64).

The new chgres_cube code is available from the UFS_UTILS repository on GitHub (maintained by George Gayno) and can be used to convert GFS ICs to a different resolution or number of layers. The chgres_cube code/scripts currently support the following GFS inputs:

  • pre-GFSv14 (GFS-GSM)
  • GFSv14 (GFS-GSM)
  • GFSv15 (FV3GFS)

Clone UFS_UTILS:

git clone --recursive https://github.com/NOAA-EMC/UFS_UTILS.git

Build UFS_UTILS:

sh build_all.sh
cd fix
sh link_fixdirs.sh emc $MACHINE

...where $MACHINE is "cray", "dell", "hera", or "jet".

Configure your conversion:

cd util/gdas_init
vi config

Read the doc block at the top of the config and adjust the variables to meet you needs (e.g. yy, mm, dd, hh for SDATE).

Submit conversion script:

./driver.$MACHINE.sh

...where $MACHINE is currently "dell" or "cray" or "hera". Additional options will be available as support for other machines expands.

90 small jobs will be submitted:

  • 9 jobs to pull inputs off HPSS (1 for deterministic and 8 for the EnKF ensemble members)
  • 81 jobs to run chgres (1 for deterministic/hires and 80 for each EnKF ensemble member)

The chgres jobs will have a dependency on the data-pull jobs and will wait to run until all data-pull jobs have completed.

Check output:

In the config you will have defined an output folder called $OUTDIR. The converted/chgres'd output will be found there, including the needed abias and radstat initial condition files. The files will be in the needed directory structure for the global-workflow system, therefore a user can move the contents of their $OUTDIR directly into their $ROTDIR/$COMROT.

Report bugs:

This is a preliminary version of the new chgres_cube code/scripts. Please report bugs to George Gayno (george.gayno@noaa.gov) and Kate Friedman (kate.friedman@noaa.gov).

Warm starts (from production)

The GFSv15 was implemented into production on June 12th, 2019 at 12z. The GFS was spun up ahead of that cycle and thus production output for the system is available from the 00z cycle (2019061200) and later. Production output tarballs from the prior GFSv14 system are located in the same location on HPSS but have "hps" in the name to represent that it was run on the Cray, where as the GFS now runs in production on the Dell and has "dell1" in the tarball name.

See production output in the following location on HPSS:

/NCEPPROD/hpssprod/runhistory/rhYYYY/YYYYMM/YYYYMMDD

Example location:

/NCEPPROD/hpssprod/runhistory/rh2019/201907/20190704

Example listing for 2019070400 production tarballs:

[Kate.Friedman@m72a2 ~]$ hpsstar dir /NCEPPROD/hpssprod/runhistory/rh2019/201907/20190704 | grep gfs | grep 20190704_00
[connecting to hpsscore1.fairmont.rdhpcs.noaa.gov/1217]
******************************************************************
*   Welcome to the NESCC High Performance Storage System         *
*                                                                *
*   Current HPSS version: 7.4.3 Patch 2                          *
*                                                                *
*                                                                *
*           Please Submit Helpdesk Request to                    *
*              rdhpcs.hpss.help@noaa.gov                         *
*                                                                *
*  Announcements:                                                *
******************************************************************
Username: Kate.Friedman  UID: 2391  Acct: 2391(2391) Copies: 1 Firewall: off [hsi.5.0.2.p5 Thu Apr 26 13:19:38 UTC 2018]
/NCEPPROD/hpssprod/runhistory/rh2019/201907:
drwxr-xr-x    2 nwprod    prod           12800 Jul 10 07:39 20190704
[connecting to hpsscore1.fairmont.rdhpcs.noaa.gov/1217]
-rw-r-----    1 nwprod    rstprod  24201632768 Jul  6 10:39 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas.tar
-rw-r--r--    1 nwprod    prod           11040 Jul  6 10:39 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas.tar.idx
-rw-r-----    1 nwprod    rstprod  104316883456 Jul  6 15:20 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp1.tar
-rw-r--r--    1 nwprod    prod          246560 Jul  6 15:20 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp1.tar.idx
-rw-r-----    1 nwprod    rstprod  104316883456 Jul  6 15:39 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp2.tar
-rw-r--r--    1 nwprod    prod          246560 Jul  6 15:39 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp2.tar.idx
-rw-r-----    1 nwprod    rstprod  104316883456 Jul  6 15:57 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp3.tar
-rw-r--r--    1 nwprod    prod          246560 Jul  6 15:57 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp3.tar.idx
-rw-r-----    1 nwprod    rstprod  104316883456 Jul  6 16:17 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp4.tar
-rw-r--r--    1 nwprod    prod          246560 Jul  6 16:17 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp4.tar.idx
-rw-r-----    1 nwprod    rstprod  104316883456 Jul  6 16:38 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp5.tar
-rw-r--r--    1 nwprod    prod          246560 Jul  6 16:38 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp5.tar.idx
-rw-r-----    1 nwprod    rstprod  104316883456 Jul  6 16:58 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp6.tar
-rw-r--r--    1 nwprod    prod          246560 Jul  6 16:58 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp6.tar.idx
-rw-r-----    1 nwprod    rstprod  104316883456 Jul  6 17:17 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp7.tar
-rw-r--r--    1 nwprod    prod          246560 Jul  6 17:17 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp7.tar.idx
-rw-r-----    1 nwprod    rstprod  104316883456 Jul  6 17:36 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp8.tar
-rw-r--r--    1 nwprod    prod          246560 Jul  6 17:36 gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190704_00.enkfgdas_restart_grp8.tar.idx
-rw-r-----    1 nwprod    rstprod   8213389824 Jul  6 04:57 gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190704_00.gdas.tar
-rw-r--r--    1 nwprod    prod          305440 Jul  6 04:57 gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190704_00.gdas.tar.idx
-rw-r--r--    1 nwprod    prod       760274432 Jul  6 04:57 gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190704_00.gdas_flux.tar
-rw-r--r--    1 nwprod    prod            4896 Jul  6 04:57 gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190704_00.gdas_flux.tar.idx
-rw-r--r--    1 nwprod    prod     95334748160 Jul  6 05:22 gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190704_00.gdas_nemsio.tar
-rw-r--r--    1 nwprod    prod            8480 Jul  6 05:22 gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190704_00.gdas_nemsio.tar.idx
-rw-r--r--    1 nwprod    prod      3623646720 Jul  6 04:57 gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190704_00.gdas_pgrb2.tar
-rw-r--r--    1 nwprod    prod           31520 Jul  6 04:57 gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190704_00.gdas_pgrb2.tar.idx
-rw-r-----    1 nwprod    rstprod  40406691840 Jul  6 05:04 gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190704_00.gdas_restart.tar
-rw-r--r--    1 nwprod    prod           26400 Jul  6 05:04 gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190704_00.gdas_restart.tar.idx
-rw-r-----    1 nwprod    rstprod  21489377280 Jul  6 05:26 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs.tar
-rw-r--r--    1 nwprod    prod         2031392 Jul  6 05:26 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs.tar.idx
-rw-r--r--    1 nwprod    prod     46592740864 Jul  6 05:34 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_flux.tar
-rw-r--r--    1 nwprod    prod          214816 Jul  6 05:34 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_flux.tar.idx
-rw-r--r--    1 nwprod    prod     294403269120 Jul  6 07:01 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_nemsioa.tar
-rw-r--r--    1 nwprod    prod           23328 Jul  6 07:01 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_nemsioa.tar.idx
-rw-r--r--    1 nwprod    prod     336908471296 Jul  6 08:05 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_nemsiob.tar
-rw-r--r--    1 nwprod    prod           26912 Jul  6 08:05 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_nemsiob.tar.idx
-rw-r--r--    1 nwprod    prod     63337960960 Jul  6 05:44 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_pgrb2.tar
-rw-r--r--    1 nwprod    prod          400672 Jul  6 05:44 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_pgrb2.tar.idx
-rw-r--r--    1 nwprod    prod     43709473792 Jul  6 05:52 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_pgrb2b.tar
-rw-r--r--    1 nwprod    prod          400160 Jul  6 05:52 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_pgrb2b.tar.idx
-rw-r--r--    1 nwprod    prod     12637940736 Jul  6 05:55 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_restart.tar
-rw-r--r--    1 nwprod    prod            5408 Jul  6 05:55 gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190704_00.gfs_restart.tar.idx

The warm starts and other output from production are at C768 deterministic and C384 EnKF. The warm start files must be converted to your desired resolution(s) using global_chgres if you wish to run a different resolution. If you are running a C768/C384 experiment you can use them as is.

What files should you pull for starting a new experiment with warm starts from production?

That depends on what mode you want to run...free-forecast or cycled. Whichever mode navigate to the top of your COMROT and pull the entirety of the tarball(s) listed below for your mode. The files within the tarball are already in the $CDUMP.$PDY/$CYC folder format expected by the system.

free-forecast

Two tarballs to pull:

File #1 (for starting cycle SDATE):

/NCEPPROD/hpssprod/runhistory/rhYYYY/YYYYMM/YYYYMMDD/gpfs_dell1_nco_ops_com_gfs_prod_gfs.YYYYMMDD_CC.gfs_restart.tar

File #2 (for prior cycle GDATE=SDATE-06):

/NCEPPROD/hpssprod/runhistory/rhYYYY/YYYYMM/YYYYMMDD/gpfs_dell1_nco_ops_com_gfs_prod_gdas.YYYYMMDD_CC.gdas_restart.tar

cycled

There are 18 tarballs to pull (9 for SDATE and 9 for GDATE (SDATE-06)):

HPSS path: /NCEPPROD/hpssprod/runhistory/rhYYYY/YYYYMM/YYYYMMDD/

Tarballs per cycle:

gpfs_dell1_nco_ops_com_gfs_prod_gdas.YYYYMMDD_CC.gdas_restart.tar
gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.YYYYMMDD_CC.enkfgdas_restart_grp1.tar
gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.YYYYMMDD_CC.enkfgdas_restart_grp2.tar
gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.YYYYMMDD_CC.enkfgdas_restart_grp3.tar
gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.YYYYMMDD_CC.enkfgdas_restart_grp4.tar
gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.YYYYMMDD_CC.enkfgdas_restart_grp5.tar
gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.YYYYMMDD_CC.enkfgdas_restart_grp6.tar
gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.YYYYMMDD_CC.enkfgdas_restart_grp7.tar
gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.YYYYMMDD_CC.enkfgdas_restart_grp8.tar

Go to the top of your COMROT/ROTDIR and pull the contents of all tarballs there. The tarballs already contain the needed directory structure.

Example for SDATE 2019090900 using the hpsstar utility:

cd /scratch1/NCEPDEV/stmp4/Joe.Schmo/comrot/mytest
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190909/gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190909_00.gdas_restart.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190909/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190909_00.enkfgdas_restart_grp1.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190909/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190909_00.enkfgdas_restart_grp2.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190909/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190909_00.enkfgdas_restart_grp3.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190909/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190909_00.enkfgdas_restart_grp4.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190909/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190909_00.enkfgdas_restart_grp5.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190909/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190909_00.enkfgdas_restart_grp6.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190909/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190909_00.enkfgdas_restart_grp7.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190909/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190909_00.enkfgdas_restart_grp8.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190908/gpfs_dell1_nco_ops_com_gfs_prod_gdas.20190908_18.gdas_restart.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190908/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190908_18.enkfgdas_restart_grp1.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190908/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190908_18.enkfgdas_restart_grp2.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190908/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190908_18.enkfgdas_restart_grp3.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190908/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190908_18.enkfgdas_restart_grp4.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190908/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190908_18.enkfgdas_restart_grp5.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190908/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190908_18.enkfgdas_restart_grp6.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190908/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190908_18.enkfgdas_restart_grp7.tar
hpsstar get /NCEPPROD/hpssprod/runhistory/rh2019/201909/20190908/gpfs_dell1_nco_ops_com_gfs_prod_enkfgdas.20190908_18.enkfgdas_restart_grp8.tar

Warm starts (from pre-production parallels)

Recent pre-implementation parallel series was for GFS v15 (Q2FY19):

  • What resolution are warm-starts available for? Warm-start ICs are saved at the resolution the model was run at (C768/C384) and can only be used to run at the same resolution combination. If you need to run a different resolution you will need to make your own cold-start ICs. See cold start section above.
  • What dates have warm-start files saved? Unfortunately the frequency changed enough during the runs that it’s not easy to provide a definitive list easily.
  • What files? All warm-starts are saved in separate tarballs which include “restart” in the name. You need to pull the entirety of each tarball, all files included in the restart tarballs are needed.
  • Where are these tarballs? See below for the location on HPSS for each Q2FY19 pre-implementation parallel.
  • What tarballs do I need to grab for my experiment? Tarballs from two cycles are required. The tarballs are listed below, where $CDATE is your starting cycle and $GDATE is one cycle prior.
    • Free-forecast:
      • ../$CDATE/gfs_restarta.tar
      • ../$GDATE/gdas_restartb.tar
    • Cycled w/EnKF:
      • ../$CDATE/gdas_restarta.tar
      • ../$CDATE/enkfgdas_restarta_grp##.tar (where ## is 01 through 08) (note, older tarballs may include a period between enkf and gdas: "enkf.gdas")
      • ../$GDATE/gdas_restartb.tar
      • ../$GDATE/enkfgdas_restartb_grp##.tar (where ## is 01 through 08) (note, older tarballs may include a period between enkf and gdas: "enkf.gdas")
  • Where do I put the warm-start initial conditions? Extraction should occur right inside your COMROT. You may need to rename the enkf folder (enkf.gdas.$PDY -> enkfgdas.$PDY).
Time Period Parallel Name Archive Location on HPSS
(/NCEPDEV/emc-global/5year/...)
Real-time
(05/25/2018 ~ 06/12/2019)
prfv3rt1 .../emc.glopara/WCOSS_C/Q2FY19/prfv3rt1
2017/2018 Winter/Spring
(11/25/2017 ~ 05/31/2018)
fv3q2fy19retro1 .../Fanglin.Yang/WCOSS_DELL_P3/Q2FY19/fv3q2fy19retro1
2017 Summer/Fall Part 1
(05/25//2017 ~ 08/31/2017)
fv3q2fy19retro2 .../emc.glopara/WCOSS_C/Q2FY19/fv3q2fy19retro2
2017 Summer/Fall Part 2
(08/02//2017 ~ 11/30/2017)
fv3q2fy19retro2 .../Fanglin.Yang/WCOSS_DELL_P3/Q2FY19/fv3q2fy19retro2
2016/2017 Winter/Spring
(11/25/2016 ~ 05/31/2017)
fv3q2fy19retro3 .../Fanglin.Yang/WCOSS_DELL_P3/Q2FY19/fv3q2fy19retro3
2016 Summer/Fall Part 1
(5/22/2016 ~ 08/25/2016)
fv3q2fy19retro4 .../emc.glopara/WCOSS_C/Q2FY19/fv3q2fy19retro4
2016 Summer/Fall Part 2
(08/17//2016 ~ 11/30/2016)
fv3q2fy19retro4 .../emc.glopara/WCOSS_DELL_P3/Q2FY19/fv3q2fy19retro4
2015/2016 Winter/Spring
(11/25/2015 ~ 05/31/2016)
fv3q2fy19retro5 .../emc.glopara/WCOSS_DELL_P3/Q2FY19/fv3q2fy19retro5
2015 Summer/Fall
(5/03/2015 ~ 11/30/2015)
fv3q2fy19retro6 .../emc.glopara/WCOSS_DELL_P3/Q2FY19/fv3q2fy19retro6

Run setup scripts to generate experiment

If running with Rocoto make sure to have a Rocoto module loaded before running setup scripts:

module load rocoto

Free-forecast experiment

Scripts that will be used:

  • ush/rocoto/setup_expt_fcstonly.py
  • ush/rocoto/setup_workflow_fcstonly.py

1) Run experiment generator script (creates EXPDIR and COMROT)

NOTE: The following command examples include variables for reference but users should not use environmental variables to submit the commands. Exporting variables like EXPDIR to your environment causes an error when the python scripts run. Please explicitly include the argument inputs when running both setup scripts.

cd ush/rocoto
./setup_expt_fcstonly.py --pslot $PSLOT --configdir $CONFIGDIR --idate $IDATE --edate $EDATE --res $RES --gfs_cyc $GFS_CYC --comrot $COMROT --expdir $EXPDIR

...where:

  • $PSLOT is the name of your experiment
  • $CONFIGDIR is the path to the /config folder under the copy of the system you're using (i.e. $PATH_TO_CLONE/parm/config/)
  • $IDATE is the initial start date of your run (first cycle CDATE, YYYYMMDDCC)
  • $EDATE is the ending date of your run (YYYYMMDDCC) and is the last cycle that will complete
  • $RES is the resolution of the forecast (i.e. 768 for C768)
  • $GFS_CYC is the forecast frequency (0 = none, 1 = 00z only [default], 2 = 00z & 12z, 4 = all cycles)
  • $COMROT is the path to your experiment output directory. DO NOT include PSLOT folder at end of path, it’ll be built for you.
  • $EXPDIR is the path to your experiment directory where your configs will be placed and where you will find your workflow monitoring files (i.e. rocoto database and xml file). DO NOT include PSLOT folder at end of path, it will be built for you.

Example:

cd ush/rocoto
./setup_expt_fcstonly.py --pslot test --configdir /home/Joe.Schmo/git/global-workflow/parm/config --idate 2020010100 --edate 2020010118 --res 384 --gfs_cyc 4 --comrot /some_large_disk_area/Joe.Schmo/comrot --expdir /some_safe_disk_area/Joe.Schmo/expdir

2) Set user and experiment settings

Go to your EXPDIR and check/change the following variables within your config.base now before running the next script:

  • ACCOUNT
  • HOMEDIR
  • STMP
  • PTMP
  • ARCDIR (location on disk for online archive used by verification system)
  • HPSSARCH (YES turns on archival)
  • HPSS_PROJECT (project on HPSS if archiving)
  • ATARDIR (location on HPSS if archiving)

Some of those variables will be found within a machine-specific if-block so make sure to change the correct ones for the machine you'll be running on.

Now is also the time to change any other variables/settings you wish to change in config.base or other configs. Do that now. Once done making changes to the configs in your EXPDIR go back to your clone to run the second setup script.

3) Run workflow generator script (creates ROCOTO xml in EXPDIR)

./setup_workflow_fcstonly.py --expdir $EXPDIR/$PSLOT

Example:

./setup_workflow_fcstonly.py --expdir /some_safe_disk_area/Joe.Schmo/expdir/test

4) Resulting files from setup scripts

You will now have a rocoto xml file in your EXPDIR ($PSLOT.xml) and a crontab file generated for your use. If you do not have a crontab file you may not have had the rocoto module loaded. To fix this: load a rocoto module and then rerun setup_workflow*.py script again. Cron is handled differently on WCOSS-Dell so follow different instructions for setting up your rocoto cron on Mars/Venus.

Cycled experiment

Scripts that will be used:

  • ush/rocoto/setup_expt.py
  • ush/rocoto/setup_workflow.py

1) Run experiment generator script (creates EXPDIR and COMROT)

NOTE: The following command examples include variables for reference but users should not use environmental variables to submit the commands. Exporting variables like EXPDIR to your environment causes an error when the python scripts run. Please explicitly include the argument inputs when running both setup scripts.

cd ush/rocoto
./setup_expt.py --pslot $PSLOT --configdir $CONFIGDIR --idate $IDATE --edate $EDATE --comrot $COMROT --expdir $EXPDIR [ --icsdir $ICSDIR --resdet $RESDET --resens $RESENS --nens $NENS --gfs_cyc $GFS_CYC ]

Example:

cd ush/rocoto
./setup_expt.py --pslot test --configdir /home/Joe.Schmo/git/global-workflow/parm/config --idate 2020010100 --edate 2020010118 --comrot /some_large_disk_area/Joe.Schmo/comrot --expdir /some_safe_disk_area/Joe.Schmo/expdir --resdet 384 --resens 192 --nens 80 --gfs_cyc 4

...where:

  • $PSLOT is the name of your experiment
  • $CONFIGDIR is the path to the /config folder under the copy of the system you're using (i.e. $PATH_TO_CLONE/parm/config/)
  • $IDATE is the initial start date of your run (first cycle CDATE, YYYYMMDDCC)
  • $EDATE is the ending date of your run (YYYYMMDDCC) and is the last cycle that will complete
  • $ICSDIR is the path to the ICs for your run if generated separately.
  • $COMROT is the path to your experiment output directory. Do not use noscrub space on Cray for COMROT, use ptmp. DO NOT include PSLOT folder at end of path, it’ll be built for you.
  • $EXPDIR is the path to your experiment directory where your configs will be placed and where you will find your workflow monitoring files (i.e. rocoto database and xml file). DO NOT include PSLOT folder at end of path, it will be built for you.
  • $RESDET is the resolution of the deterministic forecast (i.e. ‘--resdet 768’, optional, default is C384)
  • $RESENS is the resolution of the ensemble (EnKF) forecast (i.e. ‘--resens 384’, optional, default is C192)
  • $NENS is the number of ensemble members (optional, default is 20)
  • $GFS_CYC is the cycle frequency of the long GFS forecast (0 = none, 1 = 00z only [default], 2 = 00z & 12z, 4 = all cycles)

Example setup_expt.py on WCOSS_C:

SURGE-slogin1 > ./setup_expt.py --pslot fv3demo --configdir /gpfs/hps3/emc/global/noscrub/Joe.Schmo/git/global-workflow/parm/config --idate 2017073118 --edate 2017080106 --comrot /gpfs/hps2/ptmp/Joe.Schmo --expdir /gpfs/hps3/emc/global/noscrub/Joe.Schmo/para_gfs

SDATE = 2017-07-31 18:00:00
EDATE = 2017-08-01 06:00:00

EDITED:  /gpfs/hps3/emc/global/noscrub/Joe.Schmo/para_gfs/fv3demo/config.base as per user input.
DEFAULT: /gpfs/hps3/emc/global/noscrub/Joe.Schmo/para_gfs/fv3demo/config.base.default is for reference only.
Please verify and delete the default file before proceeding.

SURGE-slogin1 >

The message about the config.base.default is telling you that you are free to delete it if you wish but it’s not necessary to remove. Your resulting config.base was generated from config.base.default and the default one is there for your information.

What happens if I run setup_expt.py again for an experiment that already exists:

SURGE-slogin1 > ./setup_expt.py --pslot fv3demo --configdir /gpfs/hps3/emc/global/noscrub/Joe.Schmo/git/global-workflow/parm/config --idate 2017073118 
--edate 2017080106 --comrot /gpfs/hps2/ptmp/Joe.Schmo --expdir /gpfs/hps3/emc/global/noscrub/Joe.Schmo/para_gfs

COMROT already exists in /gpfs/hps2/ptmp/Joe.Schmo/fv3demo

Do you wish to over-write COMROT [y/N]: y

EXPDIR already exists in /gpfs/hps3/emc/global/noscrub/Joe.Schmo/para_gfs/fv3demo

Do you wish to over-write EXPDIR [y/N]: y

SDATE = 2017-07-31 18:00:00
EDATE = 2017-08-01 06:00:00

EDITED:  /gpfs/hps3/emc/global/noscrub/Joe.Schmo/para_gfs/fv3demo/config.base as per user input.
DEFAULT: /gpfs/hps3/emc/global/noscrub/Joe.Schmo/para_gfs/fv3demo/config.base.default is for reference only.
Please verify and delete the default file before proceeding.

Your COMROT and EXPDIR will be deleted and remade. Be careful with this!

2) Set user and experiment settings

Go to your EXPDIR and check/change the following variables within your config.base now before running the next script:

  • ACCOUNT
  • HOMEDIR
  • STMP
  • PTMP
  • ARCDIR (location on disk for online archive used by verification system)
  • HPSSARCH (YES turns on archival)
  • HPSS_PROJECT (project on HPSS if archiving)
  • ATARDIR (location on HPSS if archiving)

Some of those variables will be found within a machine-specific if-block so make sure to change the correct ones for the machine you'll be running on.

Now is also the time to change any other variables/settings you wish to change in config.base or other configs. Do that now. Once done making changes to the configs in your EXPDIR go back to your clone to run the second setup script.

3) Run workflow generator script (creates ROCOTO xml in EXPDIR)

./setup_workflow.py --expdir $EXPDIR/$PSLOT

Example:

./setup_workflow.py --expdir /some_safe_disk_area/Joe.Schmo/expdir/test

4) Resulting files from setup scripts

You will now have a rocoto xml file in your EXPDIR ($PSLOT.xml) and a crontab file generated for your use. If you do not have a crontab file you may not have had the rocoto module loaded. To fix this: load a rocoto module and then rerun setup_workflow*.py script again. Cron is handled differently on WCOSS-Dell so follow different instructions for setting up your rocoto cron on Mars/Venus.

Start the run

Make sure a rocoto module is loaded: module load rocoto

If needed check for available rocoto modules on machine: module avail rocoto or module spider rocoto

Start your run from within your EXPDIR

rocotorun -d $PSLOT.db -w $PSLOT.xml

The first jobs of your run should now be queued or already running (depending on machine traffic). How exciting!

You'll now have a "logs" folder in both your COMROT and EXPDIR. The EXPDIR log folder contains workflow log files (e.g. rocoto command results) and the COMROT log folder will contain logs for each job (previously known as dayfiles).

Set up your experiment cron

HPCs with access to directly edit your crontab files (WCOSS-Cray, Hera, Jet)

crontab -e

or

crontab $PSLOT.crontab

(WARNING: "crontab $PSLOT.crontab" command will overwrite existing crontab file on your login node. If running multiple crons recommend editing crontab file with "crontab -e" command.

Check your crontab setting:

crontab -l

Crontab uses following format:

*/5 * * * * /path/to/rocotorun -w /path/to/workflow/definition/file -d /path/to/workflow/database/file

HPCs without access to directly edit your crontab files (WCOSS-Dell)

Go to home cron directory

cd ~/cron

Open admin provided mycrontab file:

vi mycrontab

See provided cron example in initial mycrontab file. It is recommended to set up a script to run your rocotorun commands. That script would be what the mycrontab file would run.

#20 * * * * test -f /gpfs/dell2/ptmp/User.Name/cron/mycronscript-2.ksh && /gpfs/dell2/ptmp/User.Name/cron/mycronscript-2.ksh 1>/gpfs/dell2/ptmp/User.Name/cron/mycronscript-2.log 2>&1

Edit, save, and exit file. New or updated crons will begin the next time the time condition is met.

Monitor your rocoto-based run

Click here to view full rocoto documentation on GitHub:

https://github.com/christopherwharrop/rocoto/wiki/documentation

Use rocoto commands on the command line

Start or continue a run:

rocotorun -d /path/to/workflow/database/file -w /path/to/workflow/xml/file

Check the status of the workflow:

rocotostat -d /path/to/workflow/database/file -w /path/to/workflow/xml/file [-c YYYYMMDDCCmm,[YYYYMMDDCCmm,...]] [-t taskname,[taskname,...]] [-s] [-T]

Note: YYYYMMDDCCmm = YearMonthDayCycleMinute ...where mm/Minute is ’00’ for all cycles currently.

Check the status of a job:

rocotocheck -d /path/to/workflow/database/file -w /path/to/workflow/xml/file -c YYYYMMDDCCmm -t taskname

Force a task to run (ignores dependencies - USE CAREFULLY!):

rocotoboot -d /path/to/workflow/database/file -w /path/to/workflow/xml/file -c YYYYMMDDCCmm -t taskname

Rerun task(s):

rocotorewind -d /path/to/workflow/database/file -w /path/to/workflow/xml/file -c YYYYMMDDCCmm -t taskname

Several dates and task names may be specified in the same command by adding more -c and -t options. However, lists are not allowed.

Use rocoto viewer

A GUI was designed to assist with monitoring rocoto experiments. It can be found under the ush/rocoto folder in global-workflow.

Usage:

./rocoto_viewer.py -d /path/to/workflow/database/file -w /path/to/workflow/xml/file

Note 1: Terminal/window must be wide enough to display all experiment information columns, viewer will complain if not.

Note 2: The viewer requires the full path to the database and xml files if you are not in your EXPDIR when you invoke it.

https://vlab.ncep.noaa.gov/redmine/attachments/download/24069/rocoto_viewer_example.png

What the viewer shows:

  • First column: cycle (YYYYMMDDCCmm, YYYY=year, MM=month, DD=day, CC=cycle hour, mm=minute)
  • Second column: task name (a "<" symbol indicates a group/meta-task, click "x" when meta-task is selected to expand/collapse)
  • Third column: job ID from scheduler
  • Fourth column: job state (QUEUED, RUNNING, SUCCEEDED, FAILED, or DEAD)
  • Fifth column: exit code (0 if all ended well)
  • Sixth column: number of tries/attempts to run job (0 when not yet run or just rewound, 1 when run once successfully, 2+ for multiple tries up to max try value where job is considered DEAD)
  • Seventh column: job duration in seconds

How to navigate the viewer:

The rocoto viewer accepts both mouse and keyboard inputs. Click “h” for help menu and more options.

Available viewer commands:

c = get information on selected job
r = rewind (rerun) selected job, group, or cycle
R = run rocotorun
b = boot (forcibly run) selected job or group
-> = right arrow key, advance viewer forward to next cycle
<- = left arrow key, advance viewer backward to previous cycle
Q = quit/exit viewer

Advanced features:

  • Select multiple tasks at once ** Click “Enter” on a task to select it, click on other tasks or use the up/down arrows to move to other tasks and click “Enter” to select them as well. ** When you next choose “r” for rewinding the pop-up window will now ask if you are sure you want to rewind all those selected tasks.
  • Rewind entire group or cycle ** Group - While group/metatask is collapsed (<) click “r” to rewind whole group/metatask. ** Cycle - Use up arrow to move selector up past the first task until the entire left column is highlighted. Click “r” and the entire cycle will be rewound.

View experiment output

The output from your run will be found in the COMROT/ROTDIR you established. This is also where you placed your initial conditions. Within your COMROT you will have the following directory structure:

free-forecast

gfs.YYYYMMDD/CC/               <- contains deterministic long forecast gfs inputs/outputs
logs/                          <- logs for each cycle in the run
vrfyarch/                      <- contains files related to verification and archival

cycled

enkfgdas.YYYYMMDD/CC/mem###/   <- contains EnKF inputs/outputs for each cycle and each member
gdas.YYYYMMDD/CC/              <- contains deterministic gdas inputs/outputs
gfs.YYYYMMDD/CC/               <- contains deterministic long forecast gfs inputs/outputs, available from the first full cycle on depending on chosen gfs long forecast frequency (gfs_cyc)
logs/                          <- logs for each cycle in the run
vrfyarch/                      <- contains files related to verification and archival

Here is an example COMROT for a cycled run as it may look several days in (note the archival steps remove older cycle folders as the run progresses):

-bash-4.2$ ll /scratch1/NCEPDEV/stmp4/Joe.Schmo/comrot/testcyc192
total 88
drwxr-sr-x   4 Joe.Schmo stmp  4096 Oct 22 04:50 enkfgdas.20190529
drwxr-sr-x   4 Joe.Schmo stmp  4096 Oct 22 07:20 enkfgdas.20190530
drwxr-sr-x   6 Joe.Schmo stmp  4096 Oct 22 03:15 gdas.20190529
drwxr-sr-x   4 Joe.Schmo stmp  4096 Oct 22 07:15 gdas.20190530
drwxr-sr-x   6 Joe.Schmo stmp  4096 Oct 22 03:15 gfs.20190529
drwxr-sr-x   4 Joe.Schmo stmp  4096 Oct 22 07:15 gfs.20190530
drwxr-sr-x 120 Joe.Schmo stmp 12288 Oct 22 07:15 logs
drwxr-sr-x  13 Joe.Schmo stmp  4096 Oct 22 07:07 vrfyarch

Common errors, known issues, and their solutions

Error: "ImportError" message when running setup script

Example of error:

$ ./setup_workflow.py --expdir /path/to/your/experiment/directory
Traceback (most recent call last):
  File "./setup_workflow.py", line 32, in <module>
	from collections import OrderedDict
ImportError: cannot import name OrderedDict

Cause: Missing python in your environment

Solution: Load a python module ("module load python") and retry setup script.

Error: curses default colors when running viewer

Example:

$ ./rocoto_viewer.py -d blah.db -w blah.xml
Traceback (most recent call last):
  File "./rocoto_viewer.py", line 2376, in <module>
    curses.wrapper(main)
  File "/contrib/anaconda/anaconda2/4.4.0/lib/python2.7/curses/wrapper.py", line 43, in wrapper
    return func(stdscr, *args, **kwds)
  File "./rocoto_viewer.py", line 1202, in main
    curses.use_default_colors()
_curses.error: use_default_colors() returned ERR

Cause: wrong TERM setting for curses

Solution: set TERM to "xterm" (bash: export TERM=xterm ; csh/tcsh: setenv TERM xterm)

Issue: Directory name change for EnKF folder in COMROT

Issue: The EnKF COMROT folders were renamed during the GFS v15 development process to remove the period between "enkf" and "gdas": enkf.gdas.$PDY → enkfgdas.$PDY

Fix: Older tarballs on HPSS will have the older directory name with the period between 'enkf' and 'gdas'. Make sure to rename folder to 'enkfgdas.$PDY' after obtaining. Only an issue for the initial cycle.

Error: Git ssh variant setting when running checkout_externals

Error seen on WCOSS-Cray (Luna/Surge).

Example:

$ checkout_externals
Processing externals description file : Externals.cfg
Checking status of externals: nemsfv3gfs, emc_post, ufs_utils, gsi, emc_gfs_wafs, emc_verif-global,
Checking out externals: nemsfv3gfs, ERROR:root:Command '[u'git', u'clone', u'--quiet', u'ssh://vlab.ncep.noaa.gov:29418/NEMSfv3gfs', u'fv3gfs.fd']' returned non-zero exit status 128
ERROR:root:Failed with output:
    fatal: ssh variant 'simple' does not support setting port

ERROR: In directory
    /gpfs/hps3/emc/global/noscrub/Joe.Schmo/git/feature-manage_externals/sorc
Process did not run successfully; returned status 128:
    git clone --quiet ssh://vlab.ncep.noaa.gov:29418/NEMSfv3gfs fv3gfs.fd
See above for output from failed command.

Cause: Git ssh variant 'simple' does not support setting port

Solution: Adjust git config ssh setting:

$ git config --global ssh.variant ssh