Skip to content

Commit

Permalink
Docs
Browse files Browse the repository at this point in the history
  • Loading branch information
taehyounpark committed Mar 28, 2024
1 parent 7b43177 commit 8925e7b
Show file tree
Hide file tree
Showing 2 changed files with 71 additions and 44 deletions.
64 changes: 32 additions & 32 deletions docs/pages/conceptual.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,75 +7,74 @@ A `dataflow` consists of a directed, acyclic graph of tasks performed for each e

@image html dataflow.png "The dataflow"

An action falls into one of three types that reflect the nature of the task and are uniquely associated with a set of applicable methods.
Each type forms a sub-graphs of tasks, which receive actions of the previous types as inputs:
An action is a node belonging to one of three task sub-graphs, each of which are associated with a set of applicable methods.
Actions of each task graph can receive ones of the previous graphs as inputs:

| Action | Description | Methods | Description | Task Graph | Input actions |
| :--- | :-- | :-- | :-- | :-- | :-- |
| `column` | Quantity of interest | `read()` | Read a column. | Computation | -- |
| | | `define()` | Evaluate a column. | | |
| `selection` | Boolean decision | `filter()` | Apply a cut. | Cutflow | `column` |
| | Floating-point decision | `weight()` | Apply a statistical significance. | | |
| `query` | Perform a query | `make()` | Make a query plan. | Experiment | `column` & `selection` |
| `query` | Perform a query | `make()` | Plan a query. | Experiment | `column` & `selection` |
| | | `fill()` | Populate with column value(s). | | |
| | | `book()` | Perform over selected entries. | | |

@section conceptual-lazy Lazy actions

All actions are "lazy", meaning they are not executed them unless required.
Accessing the result of any query, which requires any of its associated actions to be performed over the entire dataset, turns them "eager".
Once the dataset traversal is underway for any one query, all existing queries for the dataflow up to that point are also processed.
The eagerness of actions in each entry are as follows:
Accessing the result of a query turns it "eager" and triggers the dataset traversal.
The dataset traversal performs all existing queries of the dataflow up to that point.
The eagerness of actions in each entry is as follows:

1. A query is executed only if its associated selection passes its selection.
2. A selection is evaluated only if all of its prior selections in the cutflow have passed.
1. A query is executed only if its associated selection passes the cut.
2. A selection is evaluated only if all prior cuts in the cutflow have passed.
3. A column is evaluated only if it is needed to determine any of the above.

@section conceptual-columns Columns

A `column` contains some data type `T` whose value changes, i.e. must be updated, for each entry.
Columns that are read-in from a dataset or defined as constants are *independent*, i.e. their values do not depend on others.
A tower of dependent columns evaluated out of others as inputs forms the computation graph:
A `column` contains some data type `T` whose value is updated for each entry.
Columns that are read-in from a dataset or user-defined constants are *independent*, i.e. their values do not depend on others, whereas other user-defined columns are *dependent* and require existing columns as inputs.
A tower of dependent columns evaluated out of more independent ones forms the computation graph:

@image html computation.png "Example computation graph"

@paragraph conceptual-columns-lazy Lazy optimizations
Only the minimum number of computations needed are performed for each entry:
- If and when a column value is computed for an entry, it is cached and never re-computed.
- A column value is not copied when used as an input for dependent columns.
- It *is* copied a conversion is required.
- It *is* copied if a conversion is required.

@section conceptual-selections Selections

A `selection` is a specific type of scalar-valued columns representing a decision made on an entry:
A `selection` represents a scalar-valued decision made on an entry:

- A boolean as a `cut` to determine if the entry should be counted in a query or not.
- A series of two or more cuts is equivalent to their intersection, `&&`.
- A floating-point value as a `weight` to assign a statistical significance to the entry.
- A series of two or more weights is equivalent to their product, `*`.
- A boolean `cut` to determine if a query should be performed for a given entry.
- A series of two or more cuts becomes their intersection, `&&`.
- A floating-point `weight` to assign a statistical significance to the entry.
- A series of two or more weights becomes to their product, `*`.

A cutflow can have from the following types connections between nodes:
A cutflow can have from the following types connections between selections:

@image html cutflow.png "Example cutflow structure"

- Applying a selection from an existing node, which determines the order in which they are compounded.
- Branching selections by applying more than one selection from a common node.
- Merging two selections, e.g. taking the union/intersection of two cuts.

@paragraph conceptual-selections-lazy Lazy optimizations
- A selection decision is cached for each entry (because they are also columns).
Selections constitute a specific type of columns; as such, they are subject to the same value-caching and computation behaviour.
In addition, the cutflow imposes the following additional rules on them:
- The cut decision is evaluated only if of its previous cut has passed.
- The weight decision is evaluated only if the cut has passed.

@section conceptual-query Queries

A `query` definition specifies an output whose result is obtained from counting entries of the dataset.
A `query` specifies an output result obtained from counting entries of the dataset.
For multithreaded runs, the user must also define how outputs from individual threads should be merged together to yield a result representative of the full dataset.

- The query definition dictates how an entry is to be counted, i.e. it is an arbitrary action:
- (Optional) The result is populated based on values of inputs columns.
- In multithreaded runs, it must also merge outputs from individual threads to yield a result representative of the full dataset.

- A query definition must be associated with a selection whose cut determines which entries to count.
- It must be associated with a selection whose cut determines which entries to count.
- (Optional) The result is populated with the weight taken into account.
- How an entry is to be counted to populate the query depends on the user definition, i.e. it is an arbitrary action.
- (Optional) The result is populated based on values of inputs columns.

Two common workflows exist in associating queries with selections:

Expand All @@ -85,12 +84,11 @@ Two common workflows exist in associating queries with selections:

@section conceptual-variations Systematic variations

A sensitivity analysis means to study how changes in the input of a system affect its output. In the context of dataset queries, a **systematic variation** constitutes a __change in a column value that affects the outcome of selections and queries__.
A sensitivity analysis means to study how changes in the system's inputs affect the output.
In the context of dataset queries, a **systematic variation** constitutes a __change in a column value that affects the outcome of selections and queries__.

In a dataflow, encapsulating the nominal and variations of a column to create a `varied` node in which each variation is mapped by the name of its associated systematic variation.
A varied node can be treated functionally identical to a non-varied one, with all nominal+variations being propagated underneath:
Then, the variations are automatically propagated through all relevant task graphs.
This ensures that the variations are processed in a single dataset traversal, eliminating the runtime overhead associated with repeated runs.
Encapsulating the nominal and variations of a column creates a `varied` node in which each variation is mapped by the name of its associated systematic variation.
A varied node in a dataflow can be treated functionally identical to a non-varied one, with all nominal+variations being propagated through all relevant task graphs implicitly:

- Any column definitions and selections evaluated out of varied input columns will be varied.
- Any queries performed filled with varied input columns and/or at varied selections will be varied.
Expand All @@ -100,4 +98,6 @@ The propagation proceeds in the following fashion:
- **Lockstep.** If two actions each have a variation of the same name, they are in effect together.
- **Transparent.** If only one action has a given variation, then the nominal is in effect for the other.

All variations are processed at once in a single dataset traversal, i.e. they do not incur additional runtime overhead other than what is already required to perform the actions themselves.

@image html variation.png "Propagation of systematic variations for z=x+y."
51 changes: 39 additions & 12 deletions docs/pages/guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ auto ds_another = df.load(dataset::input<json>(more_data));
// no need to be another double -- whatever else!
auto y = ds_another.read(dataset::column<double>("y"));

// syntactical shortcut: implicitly load a dataset and read out all columns at once.
// shortcut: implicitly load a dataset and read out all columns at once.
std::ifstream even_more_data("even_more_data.json");
auto [s, v] = df.read(
dataset::input<json>(even_more_data),
Expand All @@ -88,14 +88,42 @@ auto [s, v] = df.read(
Call queryosity::dataflow::define() with the appropriate argument.

@cpp
// Constant value across all entries
// -----------------------------------------------------------------------------
// constant
// -----------------------------------------------------------------------------
// their values will not change per-entry

auto zero = df.define(column::constant(0));
auto one = df.define(column::constant(1));
auto two = df.define(column::constant(2));

// -----------------------------------------------------------------------------
// simple expression
// -----------------------------------------------------------------------------
// binary/unary operators between underlying data types

auto three = one + two;
auto v0 = v[zero];

// one += zero; // self-assignment operators are not possible

// Callable (C++ funciton, functor, lambda, etc.) evaluated out of input columns
// use const& for non-trivial data types to prevent expensive copies
// -----------------------------------------------------------------------------
// custom expression
// -----------------------------------------------------------------------------
// (C++ funciton, functor, lambda, etc.) evaluated out of input columns

// pass large values by const reference to prevent expensive copies
auto s_length = df.define(column::expression([](const std::string& txt){return txt.length();}), s);

// Custom definition evaluated out of input columns
// -----------------------------------------------------------------------------
// custom definition
// -----------------------------------------------------------------------------
// the most general & performant way to compute a column

class VectorSelection : public column::definition<>
{

};
auto v_selected = df.define(column::definition<>(), v);
@endcpp

Expand Down Expand Up @@ -129,8 +157,8 @@ auto cut_b_or_c = df.filter(cut_b || cut_c);

@section guide-query Making queries

Call queryosity::dataflow::make() with a queryosity::query::plan (specifying the query definition and its constructor arguments).
The plan can be filled with input columns, then booked at a selection to instantiate the query.
Call queryosity::dataflow::make() with a "plan" specifying the exact definition and constructor arguments of the qeury.
Subsequently, the plan can be filled with input columns and booked at a selection to instantiate the query.

@cpp
using h1d = qty::hist::hist<double>;
Expand All @@ -149,10 +177,10 @@ auto qp_2d = df.make(query::plan<hist2d>(linax(100,0.0,1.0),linax(100,0.0,1.0)))
auto [q_1d, q_2d] = sel.book(qp_1d, qp_2d);
@endcpp

Accessing the result of any one query triggers the dataset processing for *all* actions.
Access the result of a query to turn all actions eager.

@cpp
auto hx_a = q_a.result(); // takes a while -- dataset needs to be processed
auto hx_a = q_a.result(); // takes a while -- dataset traversal
auto hxy_sel = q_2d.result(); // instantaneous -- already completed
@endcpp

Expand All @@ -163,13 +191,13 @@ auto hxy_sel = q_2d.result(); // instantaneous -- already completed

@section guide-vary Systematic variations

Call queryosity::dataflow::vary() to create queryosity::lazy::varied columns.
Call queryosity::dataflow::vary() to create varied columns.
There are two ways in which variations can be specified:

1. **Automatic.** Specify a specific type of column to be instantiated along with the nominal+variations. Always ensures the lockstep+transparent propagation of variations.
2. **Manual.** Provide existing instances of columns to be nominal+variations; any column whose output value type is compatible to that of the nominal can be set as a variation.

The two approaches can (and should) be used interchangeably with each other for full control over the creation & propagation of systematic variations.
Both approaches can (and should) be used in a dataflow for full control over the creation & propagation of systematic variations.

@cpp
// automatic -- set and forget
Expand Down Expand Up @@ -229,7 +257,6 @@ q.has_variation("vary_b"); // false
auto q_nom_res = q.nominal().result();
auto q_varx_res = q.["vary_x"].result();
auto q_none_res = q.["no"].result();

@endcpp

@see
Expand Down

0 comments on commit 8925e7b

Please sign in to comment.