diff --git a/doc/src/SUMMARY.md b/doc/src/SUMMARY.md index 0097403..3180188 100644 --- a/doc/src/SUMMARY.md +++ b/doc/src/SUMMARY.md @@ -16,16 +16,16 @@ - [Working with signac projects](guide/python/signac.md) - [Writing action commands in Python](guide/python/actions.md) - [Concepts](guide/concepts/index.md) - - [Best practices](guide/concepts/best-practices.md) - [Process parallelism](guide/concepts/process-parallelism.md) - [Thread parallelism](guide/concepts/thread-parallelism.md) - [Directory status](guide/concepts/status.md) - [JSON pointers](guide/concepts/json-pointers.md) - [Cache files](guide/concepts/cache.md) -- [How-to topics](guide/howto/index.md) +- [How-to](guide/howto/index.md) + - [Best practices](guide/howto/best-practices.md) - [Set the cluster account](guide/howto/account.md) - [Submit the same action to different groups/resources](guide/howto/same.md) - - [Use an action to summarize many directories]() + - [Summarize directory groups with an action](guide/howto/summarize.md) # Reference diff --git a/doc/src/clusters/built-in.md b/doc/src/clusters/built-in.md index 68dc76e..06a3477 100644 --- a/doc/src/clusters/built-in.md +++ b/doc/src/clusters/built-in.md @@ -4,47 +4,50 @@ ## Anvil (Purdue) -[Anvil documentation](https://www.rcac.purdue.edu/knowledge/anvil). - -**Row** automatically selects from the following partitions: +**Row** automatically selects from the following partitions on [Anvil]: * `shared` * `wholenode` * `gpu` Other partitions may be selected manually. -There is no need to set `--mem-per-*` options on Anvil as the cluster automatically +There is no need to set `--mem-per-*` options on [Anvil] as the cluster automatically chooses the largest amount of memory available per core by default. -## Delta (NCSA) +> Note: The whole node partitions **require** that each job submitted request an +> integer multiple of 128 CPU cores. -[Delta documentation](https://docs.ncsa.illinois.edu/systems/delta). +[Anvil]: https://www.rcac.purdue.edu/knowledge/anvil -**Row** automatically selects from the following partitions: +## Delta (NCSA) + +**Row** automatically selects from the following partitions on [Delta]: * `cpu` * `gpuA100x4` Other partitions may be selected manually. -Delta jobs default to a small amount of memory per core. **Row** inserts `--mem-per-cpu` -or `--mem-per-gpu` to select the maximum amount of memory possible that allows full-node -jobs and does not incur extra charges. +[Delta] jobs default to a small amount of memory per core. **Row** inserts +`--mem-per-cpu` or `--mem-per-gpu` to select the maximum amount of memory possible that +allows full-node jobs and does not incur extra charges. -## Great Lakes (University of Michigan) +[Delta]: https://docs.ncsa.illinois.edu/systems/delta -[Great Lakes documentation](https://arc.umich.edu/greatlakes/). +## Great Lakes (University of Michigan) -**Row** automatically selects from the following partitions: +**Row** automatically selects from the following partitions on [Great Lakes]: * `standard` * `gpu_mig40,gpu` * `gpu` Other partitions may be selected manually. -Great Lakes jobs default to a small amount of memory per core. **Row** inserts +[Great Lakes] jobs default to a small amount of memory per core. **Row** inserts `--mem-per-cpu` or `--mem-per-gpu` to select the maximum amount of memory possible that allows full-node jobs and does not incur extra charges. > Note: The `gpu_mig40,gpu` partition is selected only when there is one GPU per job. > This is a combination of 2 partitions which decreases queue wait time due to the > larger number of nodes that can run your job. + +[Great Lakes]: https://arc.umich.edu/greatlakes/ diff --git a/doc/src/clusters/cluster.md b/doc/src/clusters/cluster.md index a06885f..92712e3 100644 --- a/doc/src/clusters/cluster.md +++ b/doc/src/clusters/cluster.md @@ -38,10 +38,10 @@ on this cluster. The table **must** have one of the following keys: * `by_environment`: **array** of two strings - Identify the cluster when the environment variable `by_environment[0]` is set and equal to `by_environment[1]`. * `always`: **bool** - Set to `true` to always identify this cluster. When `false`, - this cluster can only be chosen by an explicit `--cluster` option. + this cluster may only be chosen by an explicit `--cluster` option. > Note: The *first* cluster in the list that sets `identify.always = true` will prevent -> any later cluster from being identified. +> any later cluster from being identified (except by explicit `--cluster=name`). ## scheduler @@ -87,16 +87,18 @@ will pass this option to the scheduler. For example SLURM schedulers will set ### cpus_per_node -`cluster.partition.cpus_per_node`: **string** - Number of CPUs per node. When -`cpus_per_node` is not set, **row** will ask the scheduler to schedule only a given -number of tasks. In this case, some schedulers are free to spread tasks among any -number of nodes (for example, shared partitions on Slurm schedulers). +`cluster.partition.cpus_per_node`: **string** - Number of CPUs per node. -When `cpus_per_node` is set, **row** will request the minimal number of nodes needed -to satisfy `n_nodes * cpus_per_node >= total_cpus`. This may result in longer queue -times, but will lead to more stable performance for users. +When `cpus_per_node` is not set, **row** will request `n_processes` tasks. In this case, +some schedulers are free to spread tasks among any number of nodes (for example, shared +partitions on Slurm schedulers). -Set `cpus_per_node` only when all nodes in the partition have the same number of CPUs. +When `cpus_per_node` is set, **row** will **also** request the minimal number of nodes +needed to satisfy `n_nodes * cpus_per_node >= total_cpus`. This may result in longer +queue times, but will lead to more stable performance for users. + +> Note: Set `cpus_per_node` only when all nodes in the partition have the same number +> of CPUs. ### minimum_gpus_per_job @@ -131,7 +133,7 @@ will pass this option to the scheduler. For example SLURM schedulers will set ### gpus_per_node `cluster.partition.gpus_per_node`: **string** - Number of GPUs per node. Like -`cpus_per_node` but used on jobs that request GPUs. +`cpus_per_node` but used when jobs request GPUs. ### prevent_auto_select @@ -140,6 +142,6 @@ automatically selecting this partition. ### account_suffix -`cluster.partition.account_suffix`: **string** - Set to provide an account suffix -when submitting jobs to this partition. Useful when clusters define separate -`aacount-cpu` and `account-gpu` accounts. +`cluster.partition.account_suffix`: **string** - An account suffix when submitting jobs +to this partition. Useful when clusters define separate `account-cpu` and `account-gpu` +accounts. diff --git a/doc/src/clusters/index.md b/doc/src/clusters/index.md index cccc058..9f521ec 100644 --- a/doc/src/clusters/index.md +++ b/doc/src/clusters/index.md @@ -18,13 +18,15 @@ name = "cluster2" ``` User-provided clusters in `$HOME/.config/row/clusters.toml` are placed first in the -array. +array. Execute [`row show cluster --all`](../row/show/cluster.md) to see the complete +cluster configuration. ## Cluster identification On startup, **row** iterates over the array of clusters in order. If `--cluster` is not set, **row** checks the `identify` condition in the configuration. If `--cluster` is -set, **row** checks to see if the name matches. +set, **row** checks to see if the name matches. **Row** selects the *first* cluster +that matches. -> Note: **Row** uses the *first* such match. To override a built-in, your configuration -> should include a cluster by the same name and `identify` condition. +> To override a built-in, your configuration should include a cluster by the same name +> and `identify` condition. diff --git a/doc/src/developers/contributing.md b/doc/src/developers/contributing.md index fe8cf03..7fb793d 100644 --- a/doc/src/developers/contributing.md +++ b/doc/src/developers/contributing.md @@ -2,7 +2,7 @@ Contributions are welcomed via [pull requests on GitHub][github]. Contact the **row** developers before starting work to ensure it meshes well with the planned development -direction and standards set for the project. +direction and follows standards set for the project. [github]: https://github.com/glotzerlab/gsd/row @@ -17,27 +17,31 @@ assist you in designing flexible interfaces. Expensive code paths should only execute when requested. +### Maintain compatibility + +New features should be opt-in and *preserve the behavior* of all existing user scripts. + ## Version control ### Base your work off the correct branch -- Base all new work on `trunk`. +Base all bug fixes and new features on `trunk`. ### Propose a minimal set of related changes -All changes in a pull request should be closely related. Multiple change sets that are -loosely coupled should be proposed in separate pull requests. +All changes in a pull request should be *closely related*. Multiple change sets that are +loosely coupled should be proposed in *separate pull requests*. ### Agree to the Contributor Agreement -All contributors must agree to the Contributor Agreement before their pull request can -be merged. +All contributors must agree to the **Contributor Agreement** before their pull request +can be merged. ### Set your git identity Git identifies every commit you make with your name and e-mail. [Set your identity][id] -to correctly identify your work and set it identically on all systems and accounts where -you make commits. +to correctly identify your work and set it *identically on all systems* and accounts +where you make commits. [id]: http://www.git-scm.com/book/en/v2/Getting-Started-First-Time-Git-Setup @@ -45,12 +49,12 @@ you make commits. ### Use a consistent style -The **Code style** section of the documentation sets the style guidelines for **row** -code. +Follow all guidelines outlined in the [Code style](style.md) section of the +documentation. ### Document code with comments -Use **Rust** documentation comments for classes, functions, etc. Also comment complex +Write Rust documentation comments for traits, functions, etc. Also comment complex sections of code so that other developers can understand them. ### Compile without warnings @@ -61,12 +65,12 @@ Your changes should compile without warnings. ### Write unit tests -Add unit tests for all new functionality. +Add unit tests for all new functionality and bug fixes. -### Validity tests +### Test validity -The developer should run research-scale simulations using the new functionality and -ensure that it behaves as intended. When appropriate, add a new test to `validate.py`. +Run research-scale simulations using new functionality and ensure that it behaves as +intended. ## User documentation @@ -77,8 +81,7 @@ and any important user-facing change in the mdBook documentation. ### Tutorial -When applicable, update or write a new tutorial. - +When applicable, update or write a new tutorial or how-to guide. ### Add developer to the credits diff --git a/doc/src/developers/style.md b/doc/src/developers/style.md index 6718f61..ca96b58 100644 --- a/doc/src/developers/style.md +++ b/doc/src/developers/style.md @@ -3,7 +3,8 @@ ## Rust **Row's** rust code follows the [Rust style guide][1]. **Row's** [pre-commit][2] -configuration applies style fixes with `rustfmt` checks for common errors with `clippy`. +configuration applies style fixes with `rustfmt` and checks for common errors with +`clippy`. [1]: https://doc.rust-lang.org/style-guide/index.html [2]: https://pre-commit.com/ @@ -16,7 +17,7 @@ configuration applies style fixes with `rustfmt` checks for common errors with ` Wrap **Markdown** files at 88 characters wide, except when not possible (e.g. when formatting a table). Follow layout and design patterns established in existing markdown -files. +files. Use reference-style links for long URLs. ## Spelling/grammar diff --git a/doc/src/developers/testing.md b/doc/src/developers/testing.md index 4f67fa6..22280db 100644 --- a/doc/src/developers/testing.md +++ b/doc/src/developers/testing.md @@ -8,9 +8,12 @@ cargo test ``` in the source directory to execute the unit and integration tests. -All tests must be marked either `#[serial]` or `#[parallel]` explicitly. Some serial -tests set environment variables and/or the current working directory, which may conflict -with any test that is automatically run concurrently. Check for this with: +## Writing unit tests + +Write tests using standard Rust conventions. All tests must be marked either `#[serial]` +or `#[parallel]` explicitly. Some serial tests set environment variables and/or the +current working directory, which may conflict with any test that is automatically run +concurrently. Check for this with: ```bash rg --multiline "#\[test\]\n *fn" ``` diff --git a/doc/src/env.md b/doc/src/env.md index 0222864..ad7ddf1 100644 --- a/doc/src/env.md +++ b/doc/src/env.md @@ -1,7 +1,6 @@ # Environment variables -> Note: Environment variables that influence the execution of **row** are documented in -> [the command line options](row/index.md). +## In job scripts **Row** sets the following environment variables in generated job scripts: @@ -14,3 +13,18 @@ | `ACTION_PROCESSES_PER_DIRECTORY` | Set to the value of `action.resources.processes_per_directory`. Unset when `processes_per_submission`.| | `ACTION_THREADS_PER_PROCESS` | Set to the value of `action.resources.threads_per_process`. Unset when `threads_per_process` is omitted. | | `ACTION_GPUS_PER_PROCESS` | Set to the value of `action.resources.gpus_per_process`. Unset when `gpus_per_process` is omitted. | + +# Set row options + +Set any of these environment variables to provide default values for +[command line options]. + +| Environment variable | Option | +|----------------------|-------------| +| `ROW_CLEAR_PROGRESS`| --clear-progress | +| `ROW_CLUSTER` | --cluster | +| `ROW_COLOR` | --color | +| `ROW_IO_THREADS` | --io-threads | +| `ROW_NO_PROGRESS` | --no-progress | + +[command line options]: ../row/index.md diff --git a/doc/src/guide/concepts/best-practices.md b/doc/src/guide/concepts/best-practices.md deleted file mode 100644 index 2731a94..0000000 --- a/doc/src/guide/concepts/best-practices.md +++ /dev/null @@ -1,47 +0,0 @@ -# Best practices - -Follow these guidelines to use **row** effectively. - -## Exit actions early when products already exist. - -There are some cases where **row** may fail to identify when your action completes: - -* Software exits with an unrecoverable error. -* Your job exceeds its walltime and is killed. -* And many others... - -To ensure that your action executes as intended, you should **check for the existence -of product files** when your action starts and **exit immediately** when they already -exist. This way, resubmitting an already completed job will not needlessly recompute -results or overwrite files you intended to keep. - -## Write to temporary files and move them to the final product location. - -For example, say `products = ["output.dat"]`. Write to `output.dat.in_progress` -while your calculation executes. Once the action is fully complete, *move* -`output.dat.in_progress` to `output.dat`. - -If you wrote directly to `output.dat`, **row** might identify your computation as -**complete** right after it starts. This pattern also allows you to *continue* running -one calculation over several job submissions. Move the output file to its final location -only after the final submission completes the calculation. - -## Group directories whenever possible, but not to an extreme degree. - -The **scheduler** does an excellent job handling the queue. However, there is some -overhead and the scheduler can only process so many jobs at a time. Your cluster may -even limit how many jobs you are allowed to queue. So please don't submit thousands of -jobs at a time to your cluster. You can improve your workflow's throughput by grouping -directories together into a smaller number of jobs. - -Group jobs that execute quickly in serial with `processes.per_submission` and -`walltime.per_directory`. After a given job has waited in the queue, it can process many -directories before exiting. Limit group sizes so that the total wall time of the job -remains reasonable. - -Group jobs that take a longer time in parallel using MPI partitions, -`processes.per_directory` and `walltime.per_submission`. Limit the group sizes to a -relatively small fraction of the cluster (*except on Leadership class machines*). -Huge parallel jobs may wait a long time in queue before starting. Experiment with the -`group.maximum_size` value and find a good job size (in number of nodes) that balances -queue time vs. scheduler overhead. diff --git a/doc/src/guide/concepts/cache.md b/doc/src/guide/concepts/cache.md index 1fdbc04..9bcea36 100644 --- a/doc/src/guide/concepts/cache.md +++ b/doc/src/guide/concepts/cache.md @@ -3,7 +3,7 @@ **Row** stores cache files in `/.row` to improve performance. In most usage environments **row** will automatically update the cache and keep it synchronized with the state of the workflow and workspace. The rest of this document describes -some scenarios where they cache may not be updated and how you fix the problem. +some scenarios where they cache may not be updated and how you can fix the problem. ## Directory values @@ -19,13 +19,13 @@ invalid when: ## Submitted jobs -**Row** caches the job ID, directory, and cluster name for every job it submits +**Row** caches the *job ID*, *directory*, and *cluster name* for every job it submits to a cluster via `row submit`. **Row** will be unaware of any jobs that you manually submit with `sbatch`. > You should submit all jobs via: > ```bash -> `row submit` +> row submit > ``` Copying a project directory (including `.row/`) from one cluster to another (or from @@ -34,11 +34,11 @@ access the job queue of the first, so all jobs will remain in the cache. *Submit jobs on the 2nd cluster will inevitably lead to changes in the submitted cache on both clusters that cannot be merged. -> Before you copy your project directory, wait for all jobs to finish, then execute +> Wait for all jobs to finish, then execute > ```bash > row show status > ``` -> to update the cache. +> to update the cache. Now the submitted cache is empty and safe to copy. ## Completed directories @@ -50,7 +50,7 @@ if: * *You change products* in `workflow.toml`. * *You change the name of an action* in `workflow.toml`. -> To discover new completed directories, execute +> To discover all completed directories, execute > ```bash > row scan > ``` diff --git a/doc/src/guide/concepts/process-parallelism.md b/doc/src/guide/concepts/process-parallelism.md index 206c44a..7bbfe01 100644 --- a/doc/src/guide/concepts/process-parallelism.md +++ b/doc/src/guide/concepts/process-parallelism.md @@ -13,8 +13,7 @@ processes: e.g. `launcher = ["mpi"]`. > **processes**. At this time **MPI** is the only **process** launcher that **row** supports. You can -configure additional launchers in [`launchers.toml`](../../launchers/index.md) if your -cluster and application use a different launcher. +configure additional launchers in [`launchers.toml`](../../launchers/index.md). Use **MPI** parallelism to launch: * MPI-enabled applications on one directory (`processes.per_submission = N`, @@ -27,4 +26,6 @@ Use **MPI** parallelism to launch: (`processes.per_directory = N`). Instruct your application to *partition* the MPI communicator. -TODO: Concrete examples +TODO: Provide a concrete example using HOOMD + +TODO: Provide a concrete example using mpi4py diff --git a/doc/src/guide/concepts/status.md b/doc/src/guide/concepts/status.md index 2928b6c..9a387c7 100644 --- a/doc/src/guide/concepts/status.md +++ b/doc/src/guide/concepts/status.md @@ -1,7 +1,8 @@ # Directory status -For each action, each directory in the workspace that matches the action's -[include condition](../../workflow/action/group.md#include) has a single status: +Each directory in the workspace that matches the action's +[include condition](../../workflow/action/group.md#include) has a *single* status for +that action: | Status | Description | |--------|-------------| diff --git a/doc/src/guide/concepts/thread-parallelism.md b/doc/src/guide/concepts/thread-parallelism.md index da1d0a1..a427b47 100644 --- a/doc/src/guide/concepts/thread-parallelism.md +++ b/doc/src/guide/concepts/thread-parallelism.md @@ -24,4 +24,4 @@ provide some way to set the number of threads/processes. Use the environment var `ACTION_THREADS_PER_PROCESS` to ensure that the number of executed threads matches that requested. -TODO: Concrete examples +TODO: Provide a concrete example using the Python multiprocessing library diff --git a/doc/src/guide/howto/account.md b/doc/src/guide/howto/account.md index d49e1c2..5e514d2 100644 --- a/doc/src/guide/howto/account.md +++ b/doc/src/guide/howto/account.md @@ -19,3 +19,7 @@ account "cluster2-account" submit_options.cluster1.account = "alternate-account" # Will use the "alternate-account" on cluster1 and "cluster2-account" on cluster2. ``` + +> Note: NCSA Delta assigns `-cpu` and `-gpu` accounts. Set +> `submit_options.delta.account = ""`. **Row** will automatically append the +> `-cpu` or `-gpu` when submitting to the CPU or GPU partitions respectively. diff --git a/doc/src/guide/howto/best-practices.md b/doc/src/guide/howto/best-practices.md new file mode 100644 index 0000000..496eeb0 --- /dev/null +++ b/doc/src/guide/howto/best-practices.md @@ -0,0 +1,52 @@ +# Best practices + +Follow these guidelines to use **row** effectively. + +## Exit actions early when products already exist. + +There are some cases where **row** may fail to identify when your action completes: + +* Software exits with an unrecoverable error. +* Your job exceeds its walltime and is killed. +* And many others... + +To ensure that your action executes as intended, you should **check for the existence +of product files** when your action starts and **exit immediately** when they already +exist. This way, resubmitting an already completed job will not needlessly recompute +results or overwrite files you intended to keep. + +## Write to temporary files and move them to the final product location. + +For example, say `products = ["output.dat"]`. Write to `output.dat.in_progress` +while your calculation executes. Once the action is fully complete, *move* +`output.dat.in_progress` to `output.dat`. + +This pattern also allows you to *continue* running one calculation over several job +submissions. Move the output file to its final location only after the final submission +completes the calculation. + +> Note: If you wrote directly to `output.dat`, **row** might identify your computation +> as **complete** right after it starts. + +## Group directories whenever possible, but not to an extreme degree. + +The **scheduler** can effectively schedule **many** jobs. However, there is some +overhead. Each job takes a certain amount of time to launch at the start and clean up +at the end. Additionally, the scheduler can only process so many jobs efficiently. Your +cluster may even limit how many jobs you are allowed to queue. So please don't submit +thousands of jobs at a time to your cluster. You can improve your workflow's throughput +by grouping directories together into a smaller number of jobs. + +For actions that execute quickly in serial: Group directories and use with +`processes.per_submission` and `walltime.per_directory`. After a given job has waited in +the queue, it can process many directories before exiting. Limit group sizes so that the +total wall time of the job remains reasonable. + +For actions that take longer: Group directories and execute the action in parallel using +[MPI partitions], `processes.per_directory` and `walltime.per_submission`. You should +typically limit the group sizes to a relatively small fraction of the cluster. Unless +you are using a Leadership class machine, huge parallel jobs may wait a long time in +queue before starting. Experiment with the `group.maximum_size` value and find a good +job size (in number of nodes) that balances queue time vs. scheduler overhead. + +[MPI partitions]: ../concepts/process-parallelism.md diff --git a/doc/src/guide/howto/same.md b/doc/src/guide/howto/same.md index 8fa067f..16f4fb2 100644 --- a/doc/src/guide/howto/same.md +++ b/doc/src/guide/howto/same.md @@ -1,13 +1,13 @@ # Submit the same action to different groups/resources You can submit the same action to different groups and resources. To do so, -create multiple elements in the action *array with the same name*. Each must use +create multiple elements in the action array *with the same name*. Each must use [`group.include`](../../workflow/action/group.md#include) to select *non-overlapping subsets*. You can use [`action.from`](../../workflow/action/index.md#from) to copy all fields from one action and selectively override others. -For example, this `workflow.toml` uses 4 processors on small systems and 8 on large -ones. +For example, this `workflow.toml` uses 4 processors on directories with small *N* and 8 +those with a large *N*. ```toml [default.action] @@ -27,5 +27,4 @@ maximum_size = 32 from = "compute" resources.processes.per_directory = 8 group.include = [["/N", ">", "4096"]] -group.maximum_size = 16 ``` diff --git a/doc/src/guide/howto/summarize.md b/doc/src/guide/howto/summarize.md new file mode 100644 index 0000000..7583579 --- /dev/null +++ b/doc/src/guide/howto/summarize.md @@ -0,0 +1,31 @@ +# Summarize directory groups with an action + +Set [`submit_whole=true`] to ensure that an action is always submitted on the +*whole* group of included directories. For example, you could use this in an analysis +action that averages over replicates. Say your directories have values like +```json +{ + "temperature": 1.0, + "pressure": 0.3, + "replicate": 2 +} +``` +with many directories at the same *temperature* and *pressure* and different +values of *replicate*. You could average over all replicates at the same *temperature* +and *pressure* with an action like this: +```toml +[[action]] +name = "average" +[action.group] +sort_by = ["/temperature", "/pressure"] +split_by_sort_key = true +submit_whole = true +``` + +Actions that summarize output have no clear location to place output files (such as +plots). Many users will write summary output to the project root. +You may omit `products` in this case so that you do not need to create empty files in +each directory. This also makes it easy to rerun the analysis whenever needed as **row** +will never consider it **complete**. + +[`submit_whole=true`]: ../../workflow/action/group.md#submit_whole diff --git a/doc/src/guide/python/actions.md b/doc/src/guide/python/actions.md index 4f60b2b..95a843b 100644 --- a/doc/src/guide/python/actions.md +++ b/doc/src/guide/python/actions.md @@ -5,7 +5,7 @@ In **row**, actions execute arbitrary **shell commands**. When your action is that takes directories as arguments. There are many ways you can achieve this goal. This guide will show you how to structure all of your actions in a single file: -`actions.py`. This layout is inspired by **row's** predecessor: **signac-flow** +`actions.py`. This layout is inspired by **row's** predecessor **signac-flow** and its `project.py`. > Note: If you are familiar with **signac-fow**, see [migrating from signac-flow][1] @@ -37,12 +37,12 @@ Execute: ``` to initialize the signac workspace and populate it with directories. -> Note: If you aren't familiar with **signac**, then go read the [*basic* tutorial][2]. -> Come back to the **row** documentation when you get to the section on *workflows*. Or, -> for extra credit, reimplement the **signac** tutorial workflow in **row** after you +> Note: If you are not familiar with **signac**, then go read the [*basic* tutorial]. +> Come back to the **row** documentation when you get to the section on *workflows*. +> For extra credit, reimplement the **signac** tutorial workflow in **row** after you > finish reading this guide. -[2]: https://docs.signac.io/en/latest/tutorial.html#basics +[*basic* tutorial]: https://docs.signac.io/en/latest/tutorial.html#basics ## Write actions.py @@ -68,8 +68,7 @@ Next, replace the contents of `workflow.toml` with the corresponding workflow: {{#include signac-workflow.toml}} ``` -You should be familiar with all of these options from previous tutorials. The main point -of interest here is that *both* actions have the same **command**, set once by the +*Both* actions have the same **command**, set once by the [**default action**](../../workflow/default.md): ```toml {{#include signac-workflow.toml:5}} @@ -79,9 +78,9 @@ of interest here is that *both* actions have the same **command**, set once by t `--action $ACTION_NAME` which selects the Python function to call. Here `$ACTION_NAME` is an [environment variable](../../env.md) that **row** sets in job scripts. The last arguments are given by `{directories}`. Unlike `{directory}` shown in previous -tutorials, `{directories}` expands to *ALL* directories in the submitted **group**. -In this way, `action.py` is executed only once and is free to process the list of -directories in any way it chooses (e.g. in serial, with +tutorials, `{directories}` expands to *ALL* directories in the submitted **group**. In +this way, `action.py` is executed once and is free to process the list of directories in +any way it chooses (e.g. in serial, with [multiprocessing parallelism, multiple threads](../concepts/thread-parallelism.md), using [MPI parallelism](../concepts/process-parallelism.md), ...). @@ -112,18 +111,21 @@ Proceed? [Y/n]: y It worked! `sum` printed the result `285`. +> Note: If you are on a cluster, use `--cluster=none` or wait for jobs to complete +> after submitting. + ## Applying this structure to your workflows With this structure in place, you can add new **actions** to your workflow following these steps: -First, write a function `def action(*jobs)` in `actions.py`. -Then add: -```toml -[[action]] -name = "action" -# And other relevant options -``` -to your `workflow.toml` file. +1) Write a function `def action(*jobs)` in `actions.py`. +2) Add: + ```toml + [[action]] + name = "action" + # And other relevant options + ``` + to your `workflow.toml` file. > Note: You may write functions that take only one job `def action(job)` without > modifying the given implementation of `__main__`. However, you will need to set diff --git a/doc/src/guide/python/signac-workflow.toml b/doc/src/guide/python/signac-workflow.toml index 7f23ec0..67e589e 100644 --- a/doc/src/guide/python/signac-workflow.toml +++ b/doc/src/guide/python/signac-workflow.toml @@ -13,3 +13,4 @@ resources.walltime.per_directory = "00:00:01" name = "compute_sum" previous_actions = ["square"] resources.walltime.per_directory = "00:00:01" +group.submit_whole = true diff --git a/doc/src/guide/python/signac.md b/doc/src/guide/python/signac.md index 07d40ba..afa5382 100644 --- a/doc/src/guide/python/signac.md +++ b/doc/src/guide/python/signac.md @@ -12,7 +12,7 @@ project and add the lines: value_file = "signac_statepoint.json" ``` -That is all. Now you can use any values in your state points to form **groups**. +Now you can use any values in your state points to form **groups**. > Note: **signac** has a rich command line interface as well. You should consider using > **signac** even if you are not a Python user. diff --git a/doc/src/guide/tutorial/group.md b/doc/src/guide/tutorial/group.md index 483418c..b0f9822 100644 --- a/doc/src/guide/tutorial/group.md +++ b/doc/src/guide/tutorial/group.md @@ -11,7 +11,7 @@ on a **group** of directories. So far, this tutorial has demonstrated small toy examples. In practice, any workflow that you need to execute on a cluster likely has hundreds or thousands of directories - each with different parameters. You could try to encode these parameters into the -directory names, but *please don't*. This quickly becomes unmanageable. Instead, you +directory names, but *please don't* - it quickly becomes unmanageable. Instead, you should include a [JSON](https://www.json.org) file in each directory that identifies its **value**. diff --git a/doc/src/guide/tutorial/hello.md b/doc/src/guide/tutorial/hello.md index bd2e75e..ea92640 100644 --- a/doc/src/guide/tutorial/hello.md +++ b/doc/src/guide/tutorial/hello.md @@ -54,8 +54,8 @@ Submitting 1 job that may cost up to 3 CPU-hours. Proceed? [Y/n]: ``` -The cost is 3 CPU-hours because **action** defaults to 1 CPU-hour per directory. -Later sections in this tutorial will cover resource costs in more detail. +The cost is 3 CPU-hours because **action** defaults to 1 CPU-hour per directory +(later sections in this tutorial will cover resource costs in more detail). `echo "Hello, {directory}!"` is certainly not going to take that long, so confirm with `y` and then press enter. You should then see the action execute: ```plaintext diff --git a/doc/src/guide/tutorial/multiple.md b/doc/src/guide/tutorial/multiple.md index 438e2fa..d2a8512 100644 --- a/doc/src/guide/tutorial/multiple.md +++ b/doc/src/guide/tutorial/multiple.md @@ -63,7 +63,7 @@ Execute: {{#include hello.sh:submit2}} ``` -Go ahead, run `row show status` and see if the output is what you expect. +Run `row show status` and see if the output is what you expect. ## Getting more detailed information diff --git a/doc/src/guide/tutorial/resources.md b/doc/src/guide/tutorial/resources.md index bc3a126..544503d 100644 --- a/doc/src/guide/tutorial/resources.md +++ b/doc/src/guide/tutorial/resources.md @@ -65,7 +65,7 @@ walltime.per_directory = "00:10:00" # Execute MPI parallel calculations -To launch MPI enabled applications, request more than one *process* and request the +To launch MPI enabled applications, request more than one *process* and the `"mpi"` launcher. `launchers = ["mpi"]` will add the appropriate MPI launcher prefix before your command (e.g. `srun --ntasks 16 parallel_application $directory`). diff --git a/doc/src/guide/tutorial/scheduler.md b/doc/src/guide/tutorial/scheduler.md index 4da61e3..8240242 100644 --- a/doc/src/guide/tutorial/scheduler.md +++ b/doc/src/guide/tutorial/scheduler.md @@ -112,7 +112,7 @@ pid 830675's current affinity list: 99 > your **cluster's** documentation to see specific details on how jobs are allocated > to nodes and charged for resource usage. Remember, it is **YOUR RESPONSIBILITY** (not > **row's**) to understand whether `--ntasks=1` costs 1 CPU-hour per hour or more (e.g. -> 128) CPU-hours per hour. If your cluster lacks a *shared* partition, then you need to +> 128 CPU-hours per hour). If your cluster lacks a *shared* partition, then you need to > structure your **actions** and **groups** in such a way to use all the cores you are > given or else the resources are wasted. diff --git a/doc/src/guide/tutorial/submit.md b/doc/src/guide/tutorial/submit.md index 84c45c3..515fe55 100644 --- a/doc/src/guide/tutorial/submit.md +++ b/doc/src/guide/tutorial/submit.md @@ -14,8 +14,8 @@ This section explains how to **submit** jobs to the **scheduler** with **row**. You can skip to the [next heading](#checking-your-job-script) if you are using one of these clusters. -If not, then you need to create one or two configuration files that describe your -cluster and its launchers. +If not, then you need to create a configuration files that describe your +cluster. You may also need to define launchers specific to your cluster. * [`$HOME/.config/row/clusters.toml`](../../clusters/index.md) gives your cluster a name, instructions on how to identify it, and lists the partitions your cluster @@ -97,15 +97,15 @@ row submit > If your cluster does not default to the correct account, you can set it in > `workflow.toml`: > ```toml -> [submit_options] -> .account = "" +> [default.action.submit_options.] +> account = "" > ``` ### The submitted status **Row** tracks the **Job IDs** that it submits. Every time you execute `row show status` (or just about any `row` command), it will execute `squeue` in the background to see -which jobs are still **submitted** (in any state). +which jobs are still **submitted**. Use the `row show` family of commands to query details about submitted jobs. For the `hello` workflow: diff --git a/doc/src/launchers/built-in.md b/doc/src/launchers/built-in.md index 6a90dac..0aae5ea 100644 --- a/doc/src/launchers/built-in.md +++ b/doc/src/launchers/built-in.md @@ -6,6 +6,8 @@ You may need to add new configurations for your specific cluster or adjust the ` launcher to match your system. Execute [`row show launchers`](../row/show/launchers.md) to see the current launcher configuration. +## Hybrid OpenMP/MPI + When using OpenMP/MPI hybrid applications, place `"openmp"` first in the list of launchers (`launchers = ["openmp", "mpi"]`) to generate the appropriate command: ```bash diff --git a/doc/src/launchers/index.md b/doc/src/launchers/index.md index 14715ca..fcd93a9 100644 --- a/doc/src/launchers/index.md +++ b/doc/src/launchers/index.md @@ -2,9 +2,11 @@ **Row** includes [built-in launchers](built-in.md) to enable OpenMP and MPI on the [built-in clusters](../clusters/built-in.md). You can override these configurations -and add new launchers in the file `$HOME/.config/row/launchers.toml`. It defines how -each **launcher** expands into a **command prefix**, with the possibility for specific -settings on each [**cluster**](../clusters/index.md). For example, an +and add new launchers in the file `$HOME/.config/row/launchers.toml`. + +The launcher configuration defines how each **launcher** expands into a **command +prefix**, with the possibility for specific settings on each +[**cluster**](../clusters/index.md). For example, an [**action**](../workflow/action/index.md) with the configuration: ```toml [[action]] diff --git a/doc/src/launchers/launcher.md b/doc/src/launchers/launcher.md index 3e61fff..b811106 100644 --- a/doc/src/launchers/launcher.md +++ b/doc/src/launchers/launcher.md @@ -6,7 +6,7 @@ prefix constructed from this configuration will be: {launcher.executable} [option1] [option2] ... ``` -See [Built-in launchers](built-in.md) for examples. +Execute [`row show launchers`](../row/show/launchers.md) to see examples. ## executable @@ -34,8 +34,8 @@ When `launcher.processes` is set, add the following option to the launcher prefi where `total_processes` is `n_directories * resources.processes.per_directory` or `resources.processes.per_submission` depending on the resource configuration. -It is an error when `total_processes > 1` and the action requests *no* launchers that -set `processes`. +> Note: **Row** exits with an error when `total_processes > 1` and the action requests +> *no* launchers that set `processes`. ## threads_per_process diff --git a/doc/src/row/index.md b/doc/src/row/index.md index b72e251..2a1d364 100644 --- a/doc/src/row/index.md +++ b/doc/src/row/index.md @@ -7,14 +7,14 @@ row [OPTIONS] `` must be one of: * [`init`](init.md) -* [`show`](show/index.md) * [`submit`](submit.md) +* [`show`](show/index.md) * [`scan`](scan.md) * [`clean`](clean.md)
-You should execute only one instance of row at a time for a given project. -Row maintains a cache and concurrent invocations may corrupt it. The +You should execute at most one instance of row at a time for a given +project. Row maintains a cache and concurrent invocations may corrupt it. The scan command is excepted from this rule.
diff --git a/doc/src/row/scan.md b/doc/src/row/scan.md index 7539778..13e4408 100644 --- a/doc/src/row/scan.md +++ b/doc/src/row/scan.md @@ -9,7 +9,7 @@ row scan [OPTIONS] [DIRECTORIES] [products](../workflow/action/index.md#products) and updates the cache of completed directories accordingly. -Under normal usage, you should not need to execute `row scan` manually. +Under normal usage, you should not need to execute `row scan`. [`row submit`](submit.md) automatically scans the submitted directories after it executes the action's command. diff --git a/doc/src/row/show/launchers.md b/doc/src/row/show/launchers.md index fedd069..0ade229 100644 --- a/doc/src/row/show/launchers.md +++ b/doc/src/row/show/launchers.md @@ -9,7 +9,7 @@ Print the [launchers](../../launchers/index.md) defined for the current cluster cluster given in `--cluster`). The output is TOML formatted. This includes the user-provided launchers in [`launchers.toml`](../../launchers/index.md) -and the built-in launchers (or the user-provided overrides). +and the built-in launchers. ## `[OPTIONS]` diff --git a/doc/src/row/submit.md b/doc/src/row/submit.md index 66a58df..8b15404 100644 --- a/doc/src/row/submit.md +++ b/doc/src/row/submit.md @@ -8,7 +8,11 @@ row submit [OPTIONS] [DIRECTORIES] `row submit` submits jobs to the scheduler. First it determines the [status](../guide/concepts/status.md) of all the given directories for the selected actions. Then it forms [groups](../workflow/action/group.md) and submits one job for -each group. Pass `--dry-run` to see the script(s) that will be submitted. +each group. Pass `--dry-run` to see the script(s) that will be submitted. Execute +``` +row show directories action --eligible +``` +to see the specific directory groups that will be submitted. ## `[DIRECTORIES]` diff --git a/doc/src/signac-flow.md b/doc/src/signac-flow.md index f362678..19cb1bc 100644 --- a/doc/src/signac-flow.md +++ b/doc/src/signac-flow.md @@ -8,13 +8,14 @@ Concepts: | flow | row | |------|-----| | *job* | *directory* | +| *cluster job* | *job* | | *statepoint* | *value* | | *operation* | [`action`](workflow/action/index.md) in `workflow.toml`| | *group* | A command may perform multiple steps. | | *label* | Not implemented. | | *hooks* | Not implemented. | | *environments* | [`clusters.toml`](clusters/index.md) | -| `project.py` | [`workflow.toml`](workflow/index.md) and [`actions.py`](guide/python/actions.md) | +| `project.py` | [`workflow.toml`](workflow/index.md) combined with [`actions.py`](guide/python/actions.md) | Commands: | flow | row | @@ -22,20 +23,19 @@ Commands: | `project.py status` | [`row show status`](row/show/status.md) | | `project.py status --detailed` | [`row show directories `](row/show/directories.md) | | `project.py run` | [`row submit --cluster=none`](row/submit.md) | -| `project.py run --parallel` | A command may execute groups in parallel. | +| `project.py run --parallel` | A command *may* execute [group members][group] in [parallel]. | | `project.py exec ...` | Execute your action's command in the shell. | | `project.py submit` | [`row submit`](row/submit.md) | -| `project.py submit --partition ` | `row submit` automatically selects appropriate partitions. | +| `project.py submit --partition ` | `row submit` *automatically* selects appropriate partitions. | | `project.py submit -n ` | [`row submit -n `](row/submit.md) | | `project.py submit --pretend` | [`row submit --dry-run`](row/submit.md) | -| `project.py submit --bundle ` | [`group`](workflow/action/group.md) in `workflow.toml` | -| `project.py submit --bundle --parallel` | A command may execute groups in parallel. | +| `project.py submit --bundle ` | [`group`][group] in `workflow.toml` | +| `project.py submit --bundle --parallel` | A command *may* execute [group members][group] in [parallel]. | | `project.py submit -o ` | [`row submit --action `](row/submit.md) | | `project.py -j [JOB_ID1] [JOB_ID2] ...` | `row [JOB_ID1] [JOB_ID2] ...` | | `project.py -j a1234` | `cd workspace; row a1234` | | `project.py -f ` | `row $(signac find )` | - Conditions: | flow | row | |------|-----| @@ -44,10 +44,10 @@ Conditions: | precondition: `after` | [`previous_actions`](workflow/action/index.md#previous_actions) | | precondition: state point comparison | [`include`](workflow/action/group.md#include) | | precondition: others | Not implemented. | -| aggregation | [`group`](workflow/action/group.md) in `workflow.toml` | +| aggregation | [`group`][group] in `workflow.toml` | | aggregation: `select` | [`include`](workflow/action/group.md#include) | -| aggregation: `sort_by` | [`sort_by`](workflow/action/group.md#sort_by) | -| aggregation: `groupby` | `sort_by` and [`split_by_sort_key=true`](workflow/action/group.md#split_by_sort_key) | +| aggregation: `sort_by` | [`sort_by`] | +| aggregation: `groupby` | [`sort_by`] and [`split_by_sort_key=true`](workflow/action/group.md#split_by_sort_key) | | aggregation: `groupsof` | [`maximum_size`](workflow/action/group.md#maximum_size) | Execution: @@ -59,3 +59,7 @@ Execution: | directives: Launch with MPI | [`launchers`](workflow/action/index.md#launchers) `= ["mpi"]` | | directives: Launch with OpenMP | [`launchers`](workflow/action/index.md#launchers) `= ["openmp"]` | | template job script: `script.sh` | [`submit_options`](workflow/action/submit-options.md) in `workflow.toml` | + +[group]: workflow/action/group.md +[parallel]: guide/concepts/thread-parallelism.md +[`sort_by`]: workflow/action/group.md#sort_by diff --git a/doc/src/workflow/action/group.md b/doc/src/workflow/action/group.md index c603327..e9a7a13 100644 --- a/doc/src/workflow/action/group.md +++ b/doc/src/workflow/action/group.md @@ -41,7 +41,7 @@ include = [["/map/name", "==", "string"]] ``` Compare by array: ```toml -include = [["/array", "eqal_to", [1, "string", 14.0]]] +include = [["/array", "==", [1, "string", 14.0]]] ``` Both operands **must** have the same data type. The JSON pointer must be present in the @@ -60,8 +60,8 @@ pointers to specific keys in objects. `action.group.sort_by`: **array** of **strings** - An array of [JSON pointers](../../guide/concepts/json-pointers.md) to elements of each directory's -value. **Row** will sort directories matched by `include` by these quantities -*lexicographically*. For example, +value. **Row** will sort directories by these quantities *lexicographically*. For +example, ```toml action.group.sort_by = ["/a", "/b"] ``` @@ -99,10 +99,9 @@ by `include` are placed in a single group. `action.group.maximum_size`: **integer** - Maximum size of a group. -**Row** further splits the groups into smaller groups up to the given `maximum_size`. -When the number of directories is not evenly divisible by `maximum_size`, **row** -creates the first **n** groups with `maximum_size` elements and places one remainder -group at the end. +Split included directories into groups up to the given `maximum_size`. When the number +of directories is not evenly divisible by `maximum_size`, **row** creates the first +**n** groups with `maximum_size` elements and places one remainder group at the end. For example, with `maximum_size = 2` the directories: `[dir1, dir2, dir3, dir4, dir5]` diff --git a/doc/src/workflow/action/index.md b/doc/src/workflow/action/index.md index 5cf9d6f..03eb218 100644 --- a/doc/src/workflow/action/index.md +++ b/doc/src/workflow/action/index.md @@ -26,9 +26,9 @@ walltime.per_submission = "04:00:00" ## name `action.name`: **string** - The action's name. You must set a name for each -action, which may be set by [from](#from). +action. The name may be set by [from](#from). -> Note: Two (or more) conceptually identical elements in the actions array may have +> Note: Two or more conceptually identical elements in the actions array *may* have > the same name. All elements with the same name **must** have identical > [`products`](#products) and [`previous_actions`](#previous_actions). All elements > with the same name **must also** select non-intersecting subsets of directories with @@ -69,7 +69,7 @@ command = "echo Message && python action.py {directory}" `action.launchers`: **array** of **strings** - The launchers to apply when executing a command. A launcher is a prefix placed before the command in the submission script. The -cluster configuration [`clusters.toml`](../../clusters/index.md) defines what launchers +launcher configuration [`lauchers.toml`](../../launchers/index.md) defines what launchers are available on each cluster and how they are invoked. The example for `action_two` above (`launchers = ["openmp", "mpi"]`) would expand into something like: ```bash @@ -114,8 +114,12 @@ Every key in an `[[action]]` table (including sub-keys in `[action.group]`, 3. The default action: `default.action.key[.sub_key]`. The action will take on the value set in the **first** location that does not omit -the key. When all 3 locations omit the key, the "when omitted" behavior documented -takes effect (documented separately for each key). +the key. When all 3 locations omit the key, the "when omitted" behavior takes effect +(documented separately for each key). + +`from` is a convenient way to [submit the same action to different groups/resources]. > Note: `name` and `command` may be provided by `from` or `action.default` but may not > be omitted entirely. + +[submit the same action to different groups/resources]: ../../guide/howto/same.md diff --git a/doc/src/workflow/action/resources.md b/doc/src/workflow/action/resources.md index aa86889..7080eac 100644 --- a/doc/src/workflow/action/resources.md +++ b/doc/src/workflow/action/resources.md @@ -16,8 +16,7 @@ walltime.per_submission = "04:00:00" `action.resources.processes`: **table** - Set the number of processes this action will execute on (launched by `mpi` or similarly capable launcher). The table **must** -have one of two keys: `per_submission` or `per_directory` which both have **integer** -values. +have one of two keys: `per_submission` or `per_directory`. Examples: ```toml @@ -27,7 +26,7 @@ processes.per_submission = 16 processes.per_directory = 8 ``` -When set to `per_submission`, **row** always asks the scheduler to allocate the given +When set to `per_submission`, **row** asks the scheduler to allocate the given number of processes for each job. When set to `per_directory`, **row** requests the given value multiplied by the number of directories in the submission group. Use `per_submission` when your action loops over directories and reuses the same processes @@ -53,10 +52,9 @@ from the scheduler. Most schedulers default to 0 GPUs per process in this case. ## walltime `action.resources.walltime`: **table** - Set the walltime that this action takes to -execute. The table **must** have one of two keys: `per_submission` or `per_directory` -which both have **string** values. Valid walltime strings include `"HH:MM:SS"`, `"D -days, HH:MM:SS"`, and all other valid `Duration` formats parsed by -[speedate](https://docs.rs/speedate/latest/speedate/). +execute. The table **must** have one of two keys: `per_submission` or `per_directory`. +Valid walltime strings include `"HH:MM:SS"`, `"D days, HH:MM:SS"`, and all other valid +`Duration` formats parsed by [speedate](https://docs.rs/speedate/latest/speedate/). Examples: ```toml @@ -66,12 +64,11 @@ walltime.per_submission = "4 days, 12:00:00" walltime.per_directory = "00:10:00" ``` -When set to `per_submission`, **row** always asks the scheduler to allocate the given -walltime for each job. When set to `per_directory`, **row** requests the given value -multiplied by the number of directories in the submission group. Use `per_submission` -when your action parallelizes over directories and therefore takes the same amount of -time independent of the submission group size. Use `per_directory` when your action -loops over the directories and therefore the walltime scales with the number of -directories. +When set to `per_submission`, **row** asks the scheduler to allocate the given walltime +for each job. When set to `per_directory`, **row** requests the given value multiplied +by the number of directories in the submission group. Use `per_submission` when your +action parallelizes over directories and therefore takes the same amount of time +independent of the submission group size. Use `per_directory` when your action loops +over the directories and therefore the walltime scales with the number of directories. When omitted, `walltime` defaults to `per_directories = 01:00:00`. diff --git a/doc/src/workflow/action/submit-options.md b/doc/src/workflow/action/submit-options.md index d6a1031..c1d7a40 100644 --- a/doc/src/workflow/action/submit-options.md +++ b/doc/src/workflow/action/submit-options.md @@ -47,10 +47,10 @@ omitted. ## `.partition` -`action.scubmit_options..partition`: **string** - Force the use of a particular -partition when submitting jobs to the queue on cluster `. When omitted, **row** +`action.submit_options..partition`: **string** - Force the use of a particular +partition when submitting jobs to the queue on cluster ``. When omitted, **row** will automatically determine the correct partition based on the configuration in [`clusters.toml`](../../clusters/index.md). -> Note: You should almost always omit `partition`. Set it *only* when you need a -> specialty partition that is not automatically selected. +> Note: You should almost always omit `partition`. Set it *only* when your action +> **requires** a *specialty* partition that is not automatically selected.