From 8925e7bb1d4281506483254dfb3d09a414afdd9c Mon Sep 17 00:00:00 2001 From: taehyounpark Date: Thu, 28 Mar 2024 18:36:35 -0400 Subject: [PATCH] Docs --- docs/pages/conceptual.md | 64 ++++++++++++++++++++-------------------- docs/pages/guide.md | 51 ++++++++++++++++++++++++-------- 2 files changed, 71 insertions(+), 44 deletions(-) diff --git a/docs/pages/conceptual.md b/docs/pages/conceptual.md index 9248a0f..7990187 100644 --- a/docs/pages/conceptual.md +++ b/docs/pages/conceptual.md @@ -7,8 +7,8 @@ A `dataflow` consists of a directed, acyclic graph of tasks performed for each e @image html dataflow.png "The dataflow" -An action falls into one of three types that reflect the nature of the task and are uniquely associated with a set of applicable methods. -Each type forms a sub-graphs of tasks, which receive actions of the previous types as inputs: +An action is a node belonging to one of three task sub-graphs, each of which are associated with a set of applicable methods. +Actions of each task graph can receive ones of the previous graphs as inputs: | Action | Description | Methods | Description | Task Graph | Input actions | | :--- | :-- | :-- | :-- | :-- | :-- | @@ -16,44 +16,44 @@ Each type forms a sub-graphs of tasks, which receive actions of the previous typ | | | `define()` | Evaluate a column. | | | | `selection` | Boolean decision | `filter()` | Apply a cut. | Cutflow | `column` | | | Floating-point decision | `weight()` | Apply a statistical significance. | | | -| `query` | Perform a query | `make()` | Make a query plan. | Experiment | `column` & `selection` | +| `query` | Perform a query | `make()` | Plan a query. | Experiment | `column` & `selection` | | | | `fill()` | Populate with column value(s). | | | | | | `book()` | Perform over selected entries. | | | @section conceptual-lazy Lazy actions All actions are "lazy", meaning they are not executed them unless required. -Accessing the result of any query, which requires any of its associated actions to be performed over the entire dataset, turns them "eager". -Once the dataset traversal is underway for any one query, all existing queries for the dataflow up to that point are also processed. -The eagerness of actions in each entry are as follows: +Accessing the result of a query turns it "eager" and triggers the dataset traversal. +The dataset traversal performs all existing queries of the dataflow up to that point. +The eagerness of actions in each entry is as follows: -1. A query is executed only if its associated selection passes its selection. -2. A selection is evaluated only if all of its prior selections in the cutflow have passed. +1. A query is executed only if its associated selection passes the cut. +2. A selection is evaluated only if all prior cuts in the cutflow have passed. 3. A column is evaluated only if it is needed to determine any of the above. @section conceptual-columns Columns -A `column` contains some data type `T` whose value changes, i.e. must be updated, for each entry. -Columns that are read-in from a dataset or defined as constants are *independent*, i.e. their values do not depend on others. -A tower of dependent columns evaluated out of others as inputs forms the computation graph: +A `column` contains some data type `T` whose value is updated for each entry. +Columns that are read-in from a dataset or user-defined constants are *independent*, i.e. their values do not depend on others, whereas other user-defined columns are *dependent* and require existing columns as inputs. +A tower of dependent columns evaluated out of more independent ones forms the computation graph: @image html computation.png "Example computation graph" -@paragraph conceptual-columns-lazy Lazy optimizations +Only the minimum number of computations needed are performed for each entry: - If and when a column value is computed for an entry, it is cached and never re-computed. - A column value is not copied when used as an input for dependent columns. - - It *is* copied a conversion is required. + - It *is* copied if a conversion is required. @section conceptual-selections Selections -A `selection` is a specific type of scalar-valued columns representing a decision made on an entry: +A `selection` represents a scalar-valued decision made on an entry: -- A boolean as a `cut` to determine if the entry should be counted in a query or not. - - A series of two or more cuts is equivalent to their intersection, `&&`. -- A floating-point value as a `weight` to assign a statistical significance to the entry. - - A series of two or more weights is equivalent to their product, `*`. +- A boolean `cut` to determine if a query should be performed for a given entry. + - A series of two or more cuts becomes their intersection, `&&`. +- A floating-point `weight` to assign a statistical significance to the entry. + - A series of two or more weights becomes to their product, `*`. -A cutflow can have from the following types connections between nodes: +A cutflow can have from the following types connections between selections: @image html cutflow.png "Example cutflow structure" @@ -61,21 +61,20 @@ A cutflow can have from the following types connections between nodes: - Branching selections by applying more than one selection from a common node. - Merging two selections, e.g. taking the union/intersection of two cuts. -@paragraph conceptual-selections-lazy Lazy optimizations -- A selection decision is cached for each entry (because they are also columns). +Selections constitute a specific type of columns; as such, they are subject to the same value-caching and computation behaviour. +In addition, the cutflow imposes the following additional rules on them: - The cut decision is evaluated only if of its previous cut has passed. - The weight decision is evaluated only if the cut has passed. @section conceptual-query Queries -A `query` definition specifies an output whose result is obtained from counting entries of the dataset. +A `query` specifies an output result obtained from counting entries of the dataset. +For multithreaded runs, the user must also define how outputs from individual threads should be merged together to yield a result representative of the full dataset. -- The query definition dictates how an entry is to be counted, i.e. it is an arbitrary action: - - (Optional) The result is populated based on values of inputs columns. - - In multithreaded runs, it must also merge outputs from individual threads to yield a result representative of the full dataset. - -- A query definition must be associated with a selection whose cut determines which entries to count. +- It must be associated with a selection whose cut determines which entries to count. - (Optional) The result is populated with the weight taken into account. +- How an entry is to be counted to populate the query depends on the user definition, i.e. it is an arbitrary action. + - (Optional) The result is populated based on values of inputs columns. Two common workflows exist in associating queries with selections: @@ -85,12 +84,11 @@ Two common workflows exist in associating queries with selections: @section conceptual-variations Systematic variations -A sensitivity analysis means to study how changes in the input of a system affect its output. In the context of dataset queries, a **systematic variation** constitutes a __change in a column value that affects the outcome of selections and queries__. +A sensitivity analysis means to study how changes in the system's inputs affect the output. +In the context of dataset queries, a **systematic variation** constitutes a __change in a column value that affects the outcome of selections and queries__. -In a dataflow, encapsulating the nominal and variations of a column to create a `varied` node in which each variation is mapped by the name of its associated systematic variation. -A varied node can be treated functionally identical to a non-varied one, with all nominal+variations being propagated underneath: -Then, the variations are automatically propagated through all relevant task graphs. -This ensures that the variations are processed in a single dataset traversal, eliminating the runtime overhead associated with repeated runs. +Encapsulating the nominal and variations of a column creates a `varied` node in which each variation is mapped by the name of its associated systematic variation. +A varied node in a dataflow can be treated functionally identical to a non-varied one, with all nominal+variations being propagated through all relevant task graphs implicitly: - Any column definitions and selections evaluated out of varied input columns will be varied. - Any queries performed filled with varied input columns and/or at varied selections will be varied. @@ -100,4 +98,6 @@ The propagation proceeds in the following fashion: - **Lockstep.** If two actions each have a variation of the same name, they are in effect together. - **Transparent.** If only one action has a given variation, then the nominal is in effect for the other. +All variations are processed at once in a single dataset traversal, i.e. they do not incur additional runtime overhead other than what is already required to perform the actions themselves. + @image html variation.png "Propagation of systematic variations for z=x+y." diff --git a/docs/pages/guide.md b/docs/pages/guide.md index 0da8cc7..923b6fe 100644 --- a/docs/pages/guide.md +++ b/docs/pages/guide.md @@ -65,7 +65,7 @@ auto ds_another = df.load(dataset::input(more_data)); // no need to be another double -- whatever else! auto y = ds_another.read(dataset::column("y")); -// syntactical shortcut: implicitly load a dataset and read out all columns at once. +// shortcut: implicitly load a dataset and read out all columns at once. std::ifstream even_more_data("even_more_data.json"); auto [s, v] = df.read( dataset::input(even_more_data), @@ -88,14 +88,42 @@ auto [s, v] = df.read( Call queryosity::dataflow::define() with the appropriate argument. @cpp -// Constant value across all entries +// ----------------------------------------------------------------------------- +// constant +// ----------------------------------------------------------------------------- +// their values will not change per-entry + auto zero = df.define(column::constant(0)); +auto one = df.define(column::constant(1)); +auto two = df.define(column::constant(2)); + +// ----------------------------------------------------------------------------- +// simple expression +// ----------------------------------------------------------------------------- +// binary/unary operators between underlying data types + +auto three = one + two; +auto v0 = v[zero]; + +// one += zero; // self-assignment operators are not possible -// Callable (C++ funciton, functor, lambda, etc.) evaluated out of input columns -// use const& for non-trivial data types to prevent expensive copies +// ----------------------------------------------------------------------------- +// custom expression +// ----------------------------------------------------------------------------- +// (C++ funciton, functor, lambda, etc.) evaluated out of input columns + +// pass large values by const reference to prevent expensive copies auto s_length = df.define(column::expression([](const std::string& txt){return txt.length();}), s); -// Custom definition evaluated out of input columns +// ----------------------------------------------------------------------------- +// custom definition +// ----------------------------------------------------------------------------- +// the most general & performant way to compute a column + +class VectorSelection : public column::definition<> +{ + +}; auto v_selected = df.define(column::definition<>(), v); @endcpp @@ -129,8 +157,8 @@ auto cut_b_or_c = df.filter(cut_b || cut_c); @section guide-query Making queries -Call queryosity::dataflow::make() with a queryosity::query::plan (specifying the query definition and its constructor arguments). -The plan can be filled with input columns, then booked at a selection to instantiate the query. +Call queryosity::dataflow::make() with a "plan" specifying the exact definition and constructor arguments of the qeury. +Subsequently, the plan can be filled with input columns and booked at a selection to instantiate the query. @cpp using h1d = qty::hist::hist; @@ -149,10 +177,10 @@ auto qp_2d = df.make(query::plan(linax(100,0.0,1.0),linax(100,0.0,1.0))) auto [q_1d, q_2d] = sel.book(qp_1d, qp_2d); @endcpp -Accessing the result of any one query triggers the dataset processing for *all* actions. +Access the result of a query to turn all actions eager. @cpp -auto hx_a = q_a.result(); // takes a while -- dataset needs to be processed +auto hx_a = q_a.result(); // takes a while -- dataset traversal auto hxy_sel = q_2d.result(); // instantaneous -- already completed @endcpp @@ -163,13 +191,13 @@ auto hxy_sel = q_2d.result(); // instantaneous -- already completed @section guide-vary Systematic variations -Call queryosity::dataflow::vary() to create queryosity::lazy::varied columns. +Call queryosity::dataflow::vary() to create varied columns. There are two ways in which variations can be specified: 1. **Automatic.** Specify a specific type of column to be instantiated along with the nominal+variations. Always ensures the lockstep+transparent propagation of variations. 2. **Manual.** Provide existing instances of columns to be nominal+variations; any column whose output value type is compatible to that of the nominal can be set as a variation. -The two approaches can (and should) be used interchangeably with each other for full control over the creation & propagation of systematic variations. +Both approaches can (and should) be used in a dataflow for full control over the creation & propagation of systematic variations. @cpp // automatic -- set and forget @@ -229,7 +257,6 @@ q.has_variation("vary_b"); // false auto q_nom_res = q.nominal().result(); auto q_varx_res = q.["vary_x"].result(); auto q_none_res = q.["no"].result(); - @endcpp @see