From 8925e7bb1d4281506483254dfb3d09a414afdd9c Mon Sep 17 00:00:00 2001
From: taehyounpark <taehyounpark@icloud.com>
Date: Thu, 28 Mar 2024 18:36:35 -0400
Subject: [PATCH] Docs

---
 docs/pages/conceptual.md | 64 ++++++++++++++++++++--------------------
 docs/pages/guide.md      | 51 ++++++++++++++++++++++++--------
 2 files changed, 71 insertions(+), 44 deletions(-)

diff --git a/docs/pages/conceptual.md b/docs/pages/conceptual.md
index 9248a0f..7990187 100644
--- a/docs/pages/conceptual.md
+++ b/docs/pages/conceptual.md
@@ -7,8 +7,8 @@ A `dataflow` consists of a directed, acyclic graph of tasks performed for each e
 
 @image html dataflow.png "The dataflow"
 
-An action falls into one of three types that reflect the nature of the task and are uniquely associated with a set of applicable methods.
-Each type forms a sub-graphs of tasks, which receive actions of the previous types as inputs:
+An action is a node belonging to one of three task sub-graphs, each of which are associated with a set of applicable methods.
+Actions of each task graph can receive ones of the previous graphs as inputs:
 
 | Action | Description | Methods | Description | Task Graph | Input actions |
 | :--- | :-- | :-- | :-- | :-- | :-- | 
@@ -16,44 +16,44 @@ Each type forms a sub-graphs of tasks, which receive actions of the previous typ
 | | | `define()` | Evaluate a column. | | |
 | `selection` | Boolean decision | `filter()` | Apply a cut. | Cutflow | `column` |
 | | Floating-point decision | `weight()` | Apply a statistical significance. | | |
-| `query` | Perform a query | `make()` | Make a query plan. | Experiment | `column` & `selection` |
+| `query` | Perform a query | `make()` | Plan a query. | Experiment | `column` & `selection` |
 | | | `fill()` | Populate with column value(s). | | |
 | | | `book()` | Perform over selected entries. | | |
 
 @section conceptual-lazy Lazy actions
 
 All actions are "lazy", meaning they are not executed them unless required.
-Accessing the result of any query, which requires any of its associated actions to be performed over the entire dataset, turns them "eager".
-Once the dataset traversal is underway for any one query, all existing queries for the dataflow up to that point are also processed.
-The eagerness of actions in each entry are as follows:
+Accessing the result of a query turns it "eager" and triggers the dataset traversal.
+The dataset traversal performs all existing queries of the dataflow up to that point.
+The eagerness of actions in each entry is as follows:
 
-1. A query is executed only if its associated selection passes its selection.
-2. A selection is evaluated only if all of its prior selections in the cutflow have passed.
+1. A query is executed only if its associated selection passes the cut.
+2. A selection is evaluated only if all prior cuts in the cutflow have passed.
 3. A column is evaluated only if it is needed to determine any of the above.
 
 @section conceptual-columns Columns
 
-A `column` contains some data type `T` whose value changes, i.e. must be updated, for each entry.
-Columns that are read-in from a dataset or defined as constants are *independent*, i.e. their values do not depend on others.
-A tower of dependent columns evaluated out of others as inputs forms the computation graph:
+A `column` contains some data type `T` whose value is updated for each entry.
+Columns that are read-in from a dataset or user-defined constants are *independent*, i.e. their values do not depend on others, whereas other user-defined columns are *dependent* and require existing columns as inputs.
+A tower of dependent columns evaluated out of more independent ones forms the computation graph:
 
 @image html computation.png "Example computation graph"
 
-@paragraph conceptual-columns-lazy Lazy optimizations
+Only the minimum number of computations needed are performed for each entry:
 - If and when a column value is computed for an entry, it is cached and never re-computed.
 - A column value is not copied when used as an input for dependent columns.
-    - It *is* copied a conversion is required.
+    - It *is* copied if a conversion is required.
 
 @section conceptual-selections Selections
 
-A `selection` is a specific type of scalar-valued columns representing a decision made on an entry:
+A `selection` represents a scalar-valued decision made on an entry:
 
-- A boolean as a `cut` to determine if the entry should be counted in a query or not.
-    - A series of two or more cuts is equivalent to their intersection, `&&`.
-- A floating-point value as a `weight` to assign a statistical significance to the entry.
-    - A series of two or more weights is equivalent to their product, `*`.
+- A boolean `cut` to determine if a query should be performed for a given entry.
+    - A series of two or more cuts becomes their intersection, `&&`.
+- A floating-point `weight` to assign a statistical significance to the entry.
+    - A series of two or more weights becomes to their product, `*`.
 
-A cutflow can have from the following types connections between nodes:
+A cutflow can have from the following types connections between selections:
 
 @image html cutflow.png "Example cutflow structure"
 
@@ -61,21 +61,20 @@ A cutflow can have from the following types connections between nodes:
 - Branching selections by applying more than one selection from a common node.
 - Merging two selections, e.g. taking the union/intersection of two cuts.
 
-@paragraph conceptual-selections-lazy Lazy optimizations
-- A selection decision is cached for each entry (because they are also columns).
+Selections constitute a specific type of columns; as such, they are subject to the same value-caching and computation behaviour.
+In addition, the cutflow imposes the following additional rules on them:
 - The cut decision is evaluated only if of its previous cut has passed.
 - The weight decision is evaluated only if the cut has passed.
 
 @section conceptual-query Queries
 
-A `query` definition specifies an output whose result is obtained from counting entries of the dataset.
+A `query` specifies an output result obtained from counting entries of the dataset.
+For multithreaded runs, the user must also define how outputs from individual threads should be merged together to yield a result representative of the full dataset.
 
-- The query definition dictates how an entry is to be counted, i.e. it is an arbitrary action:
-    - (Optional) The result is populated based on values of inputs columns.
-    - In multithreaded runs, it must also merge outputs from individual threads to yield a result representative of the full dataset.
-
-- A query definition must be associated with a selection whose cut determines which entries to count.
+- It must be associated with a selection whose cut determines which entries to count.
     - (Optional) The result is populated with the weight taken into account.
+- How an entry is to be counted to populate the query depends on the user definition, i.e. it is an arbitrary action.
+    - (Optional) The result is populated based on values of inputs columns.
 
 Two common workflows exist in associating queries with selections:
 
@@ -85,12 +84,11 @@ Two common workflows exist in associating queries with selections:
 
 @section conceptual-variations Systematic variations
 
-A sensitivity analysis means to study how changes in the input of a system affect its output. In the context of dataset queries, a **systematic variation** constitutes a __change in a column value that affects the outcome of selections and queries__.
+A sensitivity analysis means to study how changes in the system's inputs affect the output. 
+In the context of dataset queries, a **systematic variation** constitutes a __change in a column value that affects the outcome of selections and queries__.
 
-In a dataflow, encapsulating the nominal and variations of a column to create a `varied` node in which each variation is mapped by the name of its associated systematic variation.
-A varied node can be treated functionally identical to a non-varied one, with all nominal+variations being propagated underneath:
-Then, the variations are automatically propagated through all relevant task graphs.
-This ensures that the variations are processed in a single dataset traversal, eliminating the runtime overhead associated with repeated runs.
+Encapsulating the nominal and variations of a column creates a `varied` node in which each variation is mapped by the name of its associated systematic variation.
+A varied node in a dataflow can be treated functionally identical to a non-varied one, with all nominal+variations being propagated through all relevant task graphs implicitly:
 
 - Any column definitions and selections evaluated out of varied input columns will be varied.
 - Any queries performed filled with varied input columns and/or at varied selections will be varied.
@@ -100,4 +98,6 @@ The propagation proceeds in the following fashion:
 - **Lockstep.** If two actions each have a variation of the same name, they are in effect together.
 - **Transparent.** If only one action has a given variation, then the nominal is in effect for the other.
 
+All variations are processed at once in a single dataset traversal, i.e. they do not incur additional runtime overhead other than what is already required to perform the actions themselves.
+
 @image html variation.png "Propagation of systematic variations for z=x+y."
diff --git a/docs/pages/guide.md b/docs/pages/guide.md
index 0da8cc7..923b6fe 100644
--- a/docs/pages/guide.md
+++ b/docs/pages/guide.md
@@ -65,7 +65,7 @@ auto ds_another = df.load(dataset::input<json>(more_data));
 // no need to be another double -- whatever else!
 auto y = ds_another.read(dataset::column<double>("y"));
 
-// syntactical shortcut: implicitly load a dataset and read out all columns at once.
+// shortcut: implicitly load a dataset and read out all columns at once.
 std::ifstream even_more_data("even_more_data.json");
 auto [s, v] = df.read(
   dataset::input<json>(even_more_data),
@@ -88,14 +88,42 @@ auto [s, v] = df.read(
 Call queryosity::dataflow::define() with the appropriate argument.
 
 @cpp
-// Constant value across all entries
+// -----------------------------------------------------------------------------
+// constant
+// -----------------------------------------------------------------------------
+// their values will not change per-entry
+
 auto zero = df.define(column::constant(0));
+auto one = df.define(column::constant(1));
+auto two = df.define(column::constant(2));
+
+// -----------------------------------------------------------------------------
+// simple expression
+// -----------------------------------------------------------------------------
+// binary/unary operators between underlying data types
+
+auto three = one + two;
+auto v0 = v[zero];
+
+// one += zero; // self-assignment operators are not possible
 
-// Callable (C++ funciton, functor, lambda, etc.) evaluated out of input columns
-// use const& for non-trivial data types to prevent expensive copies
+// -----------------------------------------------------------------------------
+// custom expression
+// -----------------------------------------------------------------------------
+// (C++ funciton, functor, lambda, etc.) evaluated out of input columns
+
+// pass large values by const reference to prevent expensive copies
 auto s_length = df.define(column::expression([](const std::string& txt){return txt.length();}), s);
 
-// Custom definition evaluated out of input columns
+// -----------------------------------------------------------------------------
+// custom definition
+// -----------------------------------------------------------------------------
+// the most general & performant way to compute a column
+
+class VectorSelection : public column::definition<> 
+{
+
+};
 auto v_selected = df.define(column::definition<>(), v);
 @endcpp
 
@@ -129,8 +157,8 @@ auto cut_b_or_c = df.filter(cut_b || cut_c);
 
 @section guide-query Making queries
 
-Call queryosity::dataflow::make() with a queryosity::query::plan (specifying the query definition and its constructor arguments).
-The plan can be filled with input columns, then booked at a selection to instantiate the query.
+Call queryosity::dataflow::make() with a "plan" specifying the exact definition and constructor arguments of the qeury.
+Subsequently, the plan can be filled with input columns and booked at a selection to instantiate the query.
 
 @cpp
 using h1d = qty::hist::hist<double>;
@@ -149,10 +177,10 @@ auto qp_2d = df.make(query::plan<hist2d>(linax(100,0.0,1.0),linax(100,0.0,1.0)))
 auto [q_1d, q_2d] = sel.book(qp_1d, qp_2d);
 @endcpp
 
-Accessing the result of any one query triggers the dataset processing for *all* actions.
+Access the result of a query to turn all actions eager.
 
 @cpp
-auto hx_a = q_a.result();  // takes a while -- dataset needs to be processed
+auto hx_a = q_a.result();  // takes a while -- dataset traversal
 auto hxy_sel = q_2d.result();  // instantaneous -- already completed
 @endcpp
 
@@ -163,13 +191,13 @@ auto hxy_sel = q_2d.result();  // instantaneous -- already completed
 
 @section guide-vary Systematic variations
 
-Call queryosity::dataflow::vary() to create queryosity::lazy::varied columns. 
+Call queryosity::dataflow::vary() to create varied columns. 
 There are two ways in which variations can be specified:
 
 1. **Automatic.** Specify a specific type of column to be instantiated along with the nominal+variations. Always ensures the lockstep+transparent propagation of variations.
 2. **Manual.** Provide existing instances of columns to be nominal+variations; any column whose output value type is compatible to that of the nominal can be set as a variation.
 
-The two approaches can (and should) be used interchangeably with each other for full control over the creation & propagation of systematic variations.
+Both approaches can (and should) be used in a dataflow for full control over the creation & propagation of systematic variations.
 
 @cpp
 // automatic -- set and forget
@@ -229,7 +257,6 @@ q.has_variation("vary_b"); // false
 auto q_nom_res = q.nominal().result();
 auto q_varx_res = q.["vary_x"].result();
 auto q_none_res = q.["no"].result();
-
 @endcpp
 
 @see