Skip to content

Commit

Permalink
Docs
Browse files Browse the repository at this point in the history
  • Loading branch information
taehyounpark committed Mar 8, 2024
1 parent 14e516e commit 429356e
Show file tree
Hide file tree
Showing 6 changed files with 70 additions and 68 deletions.
10 changes: 5 additions & 5 deletions docs/dataflow/cheatsheet.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ dataflow df(mulithread::enable(/*(1)!*/));

1. Requested number of (default: system maximum).

=== "Line-by-line"
=== "Keep dataset"
```cpp title="Read columns"
auto ds = df.load(dataset::input</*(1)!*/>(/*(2)!*/));
auto x = ds.read(dataset::column</*(3)!*/>("x"));
Expand All @@ -28,11 +28,11 @@ dataflow df(mulithread::enable(/*(1)!*/));
3. $x$ Data type
4. $y$ Data type

=== "Less lines"
=== "Just columns"
```cpp title="Read columns"
auto [x, y] = df.read(
dataset::input</*(1)!*/>(/*(2)!*/),
dataset::column</*(3)!*/>("x")
dataset::column</*(3)!*/>("x"),
dataset::column</*(4)!*/>("y")
);
```
Expand All @@ -42,7 +42,7 @@ dataflow df(mulithread::enable(/*(1)!*/));
3. $x$ Data type
4. $y$ Data type

=== "Least lines"
=== "One-liner"
```cpp title="Read columns"
auto [x, y] = df.read(
dataset::input</*(1)!*/>(/*(2)!*/),
Expand Down Expand Up @@ -78,7 +78,7 @@ auto cut_n_wgt = cut.weight(column::expression(/*(2)!*/), /*(3)!*/);
2. Weight decision expression
3. Input column argument(s)

```cpp title="Perform queries"
```cpp title="Make queries"
auto q = df.make(query::plan</*(1)!*/>(/*(2)!*/)).fill(/*(3)!*/).book(/*(4)!*/);
auto q_result = q.result();
```
Expand Down
33 changes: 13 additions & 20 deletions docs/dataflow/column.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,28 +6,25 @@

## Loading the dataset

First, a dataflow must load an input dataset. The signature is as follows:

```cpp
auto ds = df.load( dataset::input</*(1)!*/>(/*(2)!*/) );
```

1. `dataset::reader` implementation
2. Constructor arguments for implementation.

For example, consider the following JSON data:
```{.cpp .no-copy title="data.json"}
using json = qty::json;

```{.json .no-copy title="data.json"}
[
{"x": 1, "y": [1.0], "z": "a"},
{"x": 2, "y": [], "z": "b"},
{"x": 3, "y": [2.0,0.5], "z": "c"}
]
```
```cpp
std::ifstream data_file("data.json");
std::fstream data_file("data.json", std::ios::out | std::ios::in | std::ios::trunc);
data_file << "[\n"
<< " {\"x\": 1, \"y\": [1.0], \"z\": \"a\"},\n"
<< " {\"x\": 2, \"y\": [], \"z\": \"b\"},\n"
<< " {\"x\": 3, \"y\": [2.0,0.5], \"z\": \"c\"}\n"
<< "]";
data_file.clear();
data_file.seekg(0, std::ios::beg);

using json = qty::json;
auto ds = df.load( dataset::input<json>(data_file) );
```
??? abstract "qty::json implementation"
Expand Down Expand Up @@ -105,8 +102,6 @@ auto ds = df.load( dataset::input<json>(data_file) );
## Reading columns
The loaded dataset can then read columns contained within it.
```cpp
auto x = ds.read( dataset::column</*(1)!*/>(/*(2)!*/) );
```
Expand All @@ -119,10 +114,6 @@ auto x = ds.read( dataset::column<int>("x") );
auto y = ds.read( dataset::column<std::vector<float>>("y") );
auto z = ds.read( dataset::column<std::string>("z") );
```
??? abstract "qty::json::entry implementation"

Reading columns from an input dataset can be done in one line:

=== "More concise"

```{ .cpp .no-copy }
Expand All @@ -143,6 +134,8 @@ Reading columns from an input dataset can be done in one line:
);
```

??? abstract "qty::json::entry implementation"

## Computing columns

The underlying data type `T` must be:
Expand Down Expand Up @@ -179,7 +172,7 @@ auto y0 = y[zero];
- Self-assignment operators (e.g. `+=`) are not supported.

!!! info
- No undefined behaviour is invoked from `y0`, even if `y` might be empty in some entries.
- No undefined behaviour is invoked with `y0` (yet), even if `y` might be empty in some entries.
- Remember: all actions here are lazy, and nothing is actually being computed (yet)!

### Expression
Expand Down
10 changes: 5 additions & 5 deletions docs/dataflow/query.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
:heart: [`Boost.Histogram`](https://www.boost.org/doc/libs/1_84_0/libs/histogram/doc/html/index.html)

## Plan a query
## Make a plan

```cpp
auto q = df.make( query::plan</*(1)!*/>(/*(2)!*/) );
Expand All @@ -24,7 +24,7 @@ auto q = df.make( query::plan<hist_1d>(lin_ax(10,0.0,1.0)) );
auto q = df.make(/*(1)!*/).fill(/*(2)!*/);
```

1. See [Plan a query](#plan-a-query).
1. See [Make a plan](#make-a-plan).
2. Input column(s).

A query can be populated ed with input columns as many times per-entry as desired...
Expand Down Expand Up @@ -79,7 +79,7 @@ The associated selection also informs the query of the statistical weight of eac
auto q = df.make(/*(1)!*/).fill(/*(2)!*/).book(/*(3)!*/);
```

1. See [Plan a query](#create)
1. See [Make a plan](#create)
2. See [Fill with columns](#fill)
3. Query is executed over the subset of entries for which the selection cut passes.

Expand Down Expand Up @@ -119,12 +119,12 @@ Multiple selections/queries can be booked at a time:
auto q_result = df.make(/*(1)!*/).fill(/*(2)!*/).book(/*(3)!*/).result();
```

1. See [Plan a query](#plan-a-query).
1. See [Make a plan](#make-a-plan).
2. See [Fill with columns](#fill-with-columns).
3. See [Book over selections](#book-over-selections).


More concisely, if the query definition outputs a pointer-type result, the lazy node itself can be treated as the pointer:
If a query definition outputs a pointer to the result, the query itself can be treated as the pointer:
```cpp
h1x_a.result(); // std::shared_ptr to boost::histogram
h1x_a->at(0); // same as hist.result()->at(0);
Expand Down
31 changes: 12 additions & 19 deletions docs/dataflow/selection.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,18 +21,15 @@ auto weighted = df.weight( column::expression(/*(1)!*/), /*(1)!*/);

<!-- -->

1. The simplest way to apply a selection is using an existing column.
2. A syntactic shortcut to first evaluate a column expression is available.
1. Creating a selection out of an existing column.
2. A syntactic shortcut use an expression.

## Compounding selections

Calling a subsequent `filter()` and `weight()` action from an existing selection node compound it on top of the chain.
Either can be inter-compounded, as in:
Call a subsequent selection from an existing selection node compounds them.
Cuts and weights can be inter-compounded:
```{.cpp .no-copy}
auto [c, w] = ds.read(
dataset::column<bool>("c"),
dataset::column<double>("w")
)
auto [c, w] = ds.read(dataset::columns<bool,double>("c","w"));

auto cut_n_weighted = df.filter(c).weight(w);
// cut_n_weighted.passed_cut() = c.value() && true;
Expand All @@ -41,18 +38,14 @@ auto cut_n_weighted = df.filter(c).weight(w);

## Branching selections

Multiple selections can be compounded from one common node, i.e. selections can branch out:
Multiple selections can be compounded from one common node:
```{.cpp .no-copy}
auto [a, b] = ds.read(
dataset::column<bool>("a"),
dataset::column<bool>("b"),
);

auto cut_a = df.filter(a);
auto cut_ab = a.filter(b);
auto cut_ac = a.filter(c);
```
auto [a, b] = ds.read(dataset::columns<bool,bool>("a","b"));

auto cut_a = df.filter(a); // a
auto cut_ab = a.filter(b); // a && b
auto cut_ac = a.filter(c); // a && c
```

## Merging selections

Expand All @@ -61,7 +54,7 @@ Consider two selections within a cutflow. Taking the AND/OR of them is commonly
- AND: Quantifying overlap between two selections.
- OR: Consolidating two not-mutually-exclusive selections into one.

```cpp
```{.cpp .no-copy}
auto two = df.define( column::constant<unsigned int>(2) );
auto three = df.define( column::constant<unsigned int>(3) );

Expand Down
2 changes: 1 addition & 1 deletion docs/dataflow/vary.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Varied columns obtained from either approach can (and should) be used interchang
!!! tip

1. All variations are put in *manually* using existing columns as the user sees fit.
2. The variations propagate in lockstep and transparently otherwise afterwards.
2. The varied node propagates through automatically again afterwards.

Expand Down
52 changes: 34 additions & 18 deletions docs/home/design.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,41 @@
`queryosity` provides a "dataflow" model of structured data analysis.
Possible use cases include:
`queryosity` provides a "DataFlow", or row-wise, model of structured data analysis.
Users specify operations as if they are inside for-loop over the entries of the dataset.
The key is to *specify* without executing the operations on a dataset until they are needed, i.e. create *lazy* actions.
All lazy actions in a DataFlow are executed together so that the dataset traversal occurs once.
Each action is executed for a given entry *only if needed* in order maximize CPU usage and efficiency (see also [Lazy actions](../concepts/lazy.md)), which is especially relevant for analysis workflows not limited by I/O bandwidth.

- Data analysis of complex phenomena, e.g. high-energy physics experiments (the author's primary purpose).
- Data processing pipelines to transform complicated data structures into simpler ones.

Analyzers interact with a row-wise interface; in other words, specify operations as if they were inside the entry-loop of the dataset.
The keyword here is to *specify* without executing operations that need to be performed on a dataset until they are needed, which are called *lazy* actions.
All actions specified up in the dataflow are performed together so that the dataset traversal only needs to happen once.
Inside the multithreaded entry-loop, each action is performed for a given entry *only if needed* in order maximize CPU usage and efficiency; this is especially relevant for analysis workflows not bottle-necked by I/O bandwidth.
This library has been purposefully developed for high-energy physics experiments.
It is the author's hope that it can be similarly useful for studying other complex phenomena.

## Why not DataFrames?

The key distinction between dataflow and DataFrame models is in the layout of the underlying dataset.
DataFrames typically target a specific layout in which each column can be represented as a numerical or array data type, which also enables vectorized operations on columns.
This covers a wide majority of data analysis workflows as evident from the widespread use of such libraries; however, it may not be suitable for instances in which columns contain non-trivial and/or highly-nested data.
Not only are these scenarios are not so uncommon as listed above, trying to shoehorn DataFrame methods to fit their needs are unwieldy at best and could even result in worse performance (single-threaded loops).
Therefore, `queryosity` foregoes the array-wise approach (along with SIMD) and adopts a row-wise one (return to monke...) in favor of manipulating columns of arbitrary data types.
Two key distinctions separate the DataFlow against the plethora of DataFrame libraries.

### Conceptual

DataFrames can do both array-wise and row-wise operations, but the former mode is where it shines, whereas the latter is more of a fallback.
The result is a bloated and complicated API where careless mixing-and-matching of the two approaches will cost user sanity as well as machine performance (see also [Technical](#technical) point).

Other the other hand, putting row-wise reasoning at the forefront is the intuitive way of thinking about tabular datasets for humans (return to monke...).
This significantly simplifies the DataFlow API.
It is also distinguished by its syntactical/representational resemblance to the actual computation graph being performed inside each entry.
In other words, my analysis code can *actually* readable to you, and vice versa!

### Technical

DataFrames are optimized for a specific dataset structure in which the values in each column can be organized into a contiguous array.
Operations on these arrays of primitive data types can offer algorithmic (e.g. linear vs. binary search) as well as hardware (e.g. vectorized operations) speedups.

(i.e. If your dataset fits into a DataFrame, you should definitely use it).

While this covers a wide range of data analysis workflows as evident from the widespread use of these libraries, it is not suitable for instances in which columns contain non-trivial and/or highly-nested data.
This is not so uncommon as listed above.
"Extensions" to DataFrames to support more complex column data types do exist, but they are not true panaceas as long as there is *some* data out that cannot be vectorized.

Other variations of the DataFrame model that support more general column data types include (based on author's knowledge and understanding):
<!-- "Extensions" to DataFrames that support more complex column data types include (based on author's understanding): -->

- [C++ DataFrame](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrame.html): custom data types up to a certain register size and memory layout (i.e. no pointers!) using custom memory allocators.
- [Awkward Array](https://awkward-array.org/doc/main/): generalization of the array API to enable key lookups for nested traits while preserving SIMD where possible.
<!-- - [C++ DataFrame](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrame.html): data types satisfying contiguous memory layout using custom memory allocators.
- [Awkward Array](https://awkward-array.org/doc/main/): generalization of the array API to enable key lookups for nested traits while preserving SIMD for arrays possible. -->

While these further enhances the universality of the DataFrame methods, they are not panaceas as long as there is *some* data out there too non-trivial to be manipulated in such fashion.
No such restrictions need exist with DataFlow, as promised to be one of its main features.
It also does the best it can in terms of performance through lazy actions and multithreading.

0 comments on commit 429356e

Please sign in to comment.