Skip to content

Commit 0f8d1f8

Browse files
committed
Docs
1 parent 8b87325 commit 0f8d1f8

File tree

5 files changed

+864
-716
lines changed

5 files changed

+864
-716
lines changed

docs/pages/guide.md

Lines changed: 153 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -43,24 +43,31 @@ Call queryosity::dataflow::load() with an input dataset and its constructor argu
4343
The loaded dataset can then read out columns, provided their data types and names.
4444

4545
@cpp
46-
using json = qty::json;
46+
std::ifstream data_json("data.json");
4747

48-
std::ifstream data("data.json");
49-
auto ds = df.load(dataset::input<json>(data));
48+
// load a dataset
49+
using json = qty::json;
50+
auto ds = df.load(dataset::input<json>(data_json));
5051

52+
// read a column
5153
auto x = ds.read(dataset::column<double>("x"));
54+
55+
// shortcut: read multiple columns from a dataset
56+
auto [w, cat] = ds.read(dataset::column<double>("weight"), dataset::column<std::string>("category"));
5257
@endcpp
5358

5459
A dataflow can load multiple datasets, as long as all valid partitions reported by queryosity::dataset::source::partition() have the same number of total entries.
5560
A dataset can report an empty partition, which signals that it relinquishes the control to the other datasets.
5661

5762
@cpp
58-
using csv = qty::csv;
59-
6063
std::ifstream data_csv("data.csv");
61-
auto y = df.load(dataset::input<csv>(data_csv)).read(dataset::column<double>("y"));
6264

63-
auto z = x+y;
65+
// another shortcut: load dataset & read column(s) at once
66+
using csv = qty::csv;
67+
auto y = df.read(dataset::input<csv>(data_csv), dataset::column<double>("y"));
68+
69+
// x from json, y from csv
70+
auto z = x + y; // (see next section)
6471
@endcpp
6572

6673
@see
@@ -76,94 +83,116 @@ auto z = x+y;
7683

7784
@section guide-column Computing quantities
7885

79-
Call queryosity::dataflow::define() with the appropriate argument.
86+
New columns can be computed out of existing ones by calling queryosity::dataflow::define() with the appropriate argument, or operators between the underlying data types.
8087

8188
@cpp
82-
// -----------------------------------------------------------------------------
83-
// constant
84-
// -----------------------------------------------------------------------------
85-
// their values will not change per-entry
86-
89+
// constants columns do not change per-entry
8790
auto zero = df.define(column::constant(0));
8891
auto one = df.define(column::constant(1));
8992
auto two = df.define(column::constant(2));
9093

91-
// -----------------------------------------------------------------------------
92-
// simple expression
93-
// -----------------------------------------------------------------------------
94-
// binary/unary operators between underlying data types
95-
94+
// binary/unary operators
9695
auto three = one + two;
97-
auto v0 = v[zero];
96+
auto v_0 = v[zero];
97+
// reminder: actions are *lazy*, i.e. no undefined behaviour (yet)
9898

99-
// one += zero; // self-assignment operators are not possible
99+
// self-assignment operators are not possible
100+
// one += two;
101+
102+
// C++ function, functor, lambda, etc. evaluated out of input columns
103+
// tip: pass large values by const& to prevent copies
104+
auto s_length = df.define(
105+
column::expression([](const std::string &txt) { return txt.length(); }), s);
106+
@endcpp
100107

101-
// -----------------------------------------------------------------------------
102-
// custom expression
103-
// -----------------------------------------------------------------------------
104-
// (C++ funciton, functor, lambda, etc.) evaluated out of input columns
108+
A column can also be defined by a custom implementation, which offers:
105109

106-
// pass large values by const reference to prevent expensive copies
107-
auto s_length = df.define(column::expression([](const std::string& txt){return txt.length();}), s);
110+
- Customization: user-defined constructor arguments and member variables/functions.
111+
- Optimization: each input column is provided as a column::observable<T>, which defers its computation for the entry until column::observable<T>::value() is invoked.
108112

109-
// -----------------------------------------------------------------------------
110-
// custom definition
111-
// -----------------------------------------------------------------------------
112-
// the most general & performant way to compute a column
113+
As an example, consider the following calculation of a factorial via Stirling's approximation:
113114

114-
class VectorSelection : public column::definition<>
115-
{
115+
@cpp
116+
using ull_t = unsigned long long;
117+
118+
// 1. full calculation
119+
auto factorial(ull_t n) {
120+
ull_t result = 1;
121+
while (n > 1)
122+
result *= n--;
123+
return result;
124+
}
125+
// 2. approximation
126+
auto stirling = []() {
127+
return std::round(std::sqrt(2 * M_PI * n) * std::pow(n / std::exp(1), n));
128+
};
116129

130+
// 1. if n is small enough to for n! to fit inside ull_t, use full calculation
131+
// 2. if n is large enough, use approximation
132+
auto n = ds.read(dataset::column<ull_t>("n"));
133+
auto n_f_fast = df.define(column::expression(stirling), n);
134+
auto n_f_full = df.define(column::expression(factorial), n);
135+
ull_t n_threshold = 10;
136+
137+
// using expression
138+
auto n_f_slow = df.define(column::expression([n_threshold](ull_t n, ull_t fast, ull_t slow){
139+
return n >= n_threshold ? n_fast : n_flow; });
140+
// time elapsed = t(n) + t(fast) + t(slow)
141+
// :(
142+
143+
// using definition
144+
class Factorial : public column::definition<double(ull_t, double, double)> {
145+
public:
146+
Factorial(ull_t threshold) : m_threshold(threshold) {}
147+
virtual ~Factorial() = default;
148+
ull_t evaluate(column::observable<ull_t> n, column::observable<ull_t> fast,
149+
column::observable<ull_t> full) const override {
150+
return (n.value() >= m_threshold) ? fast.value() : full.value();
151+
}
152+
void adjust_threshold(ull_t threshold) {
153+
m_threshold = threshold;
154+
}
155+
156+
protected:
157+
ull_t m_threshold;
117158
};
118-
auto v_selected = df.define(column::definition<>(), v);
159+
auto n_f_best = df.define(column::definition<Factorial>(n_threshold), n, n_f_fast, n_f_full);
160+
// time elapsed = t(n) + { t(n_fast) if n >= 10, t(n_slow) if n < 10 }
161+
// :)
162+
163+
// advanced: access per-thread instance
164+
dataflow::node::invoke([](Factorial* n_f){n_f->adjust_threshold(20);}, n_f_best);
119165
@endcpp
120166

121167
@see
122168
- queryosity::column::definition (API)
123169
- queryosity::column::definition<Out(Ins...)> (ABC)
170+
- queryosity::column::observable<T> (API)
124171

125172
@section guide-selection Applying selections
126173

127174
Call queryosity::dataflow::filter() or queryosity::dataflow::weight() to initiate a selection in the cutflow, and apply subsequent selections from existing nodes to compound them.
128175

129176
@cpp
130-
// -----------------------------------------------------------------------------
131-
// initiate a cutflow
132-
// -----------------------------------------------------------------------------
177+
auto [w, cat] = ds.read(dataset::column<double>("weight"), dataset::column<std::string>("category"));
178+
auto a = df.define(column::constant<std::string>("a"));
179+
auto b = df.define(column::constant<std::string>("b"));
180+
auto c = df.define(column::constant<std::string>("c"));
133181

134-
// pass all entries, apply a weight
182+
// initiate a cutflow
135183
auto weighted = df.weight(w);
136184

137-
// -----------------------------------------------------------------------------
138-
// compounding
139-
// -----------------------------------------------------------------------------
140185
// cuts and weights can be compounded in any order.
141-
142-
// ignore entry if weight is negative
143186
auto cut = weighted.filter(
144187
column::expression([](double w){return (w>=0;);}), w
145188
);
146189

147-
// -----------------------------------------------------------------------------
148-
// branching out
149-
// -----------------------------------------------------------------------------
150190
// applying more than one selection from a node creates a branching point.
151-
152-
auto cat = ds.read<std::string>("cat");
153-
154-
auto a = df.define(column::constant<std::string>("a"));
155-
auto b = df.define(column::constant<std::string>("b"));
156-
auto c = df.define(column::constant<std::string>("c"));
157-
158191
auto cut_a = cut.filter(cat == a);
159192
auto cut_b = cut.filter(cat == b);
160193
auto cut_c = cut.filter(cat == c);
161194

162-
// -----------------------------------------------------------------------------
163-
// merging
164-
// -----------------------------------------------------------------------------
165195
// selections can be merged based on their decision values.
166-
167196
auto cut_a_and_b = df.filter(cut_a && cut_b);
168197
auto cut_b_or_c = df.filter(cut_b || cut_c);
169198
@endcpp
@@ -178,23 +207,22 @@ using h1d = qty::hist::hist<double>;
178207
using h2d = qty::hist::hist<double,double>;
179208
using linax = qty::hist::axis::regular;
180209

181-
// instantiate a 1d histogram query filled with x over all entries
182-
auto q = df.make(query::plan<h1d>(linax(100,0.0,1.0))).fill(x).book(inclusive);
210+
// plan a 1d/2d histogram query filled with x/(x,y)
211+
auto q_1d = df.make(query::plan<h1d>(linax(100,0.0,1.0))).fill(x);
212+
auto q_2d = df.make(query::plan<h2d>(linax(100,0.0,1.0), linax(100,0.0,1.0))).fill(x,y);
183213

184-
// a plan can book multiple queries at a selection
185-
auto [q_a, q_b] = df.make(query::plan<h1d>(linax(100,0.0,1.0))).fill(x).book(cut_a, cut_b);
214+
// query at multiple selections
215+
auto [q_1d_a, q_1d_b] = q_1d.book(cut_a, cut_b);
186216

187-
// a selection can book multiple queries (of different types)
188-
auto qp_1d = df.make(query::plan<h1d>(linax(100,0.0,1.0))).fill(x);
189-
auto qp_2d = df.make(query::plan<hist2d>(linax(100,0.0,1.0),linax(100,0.0,1.0))).fill(x,y);
190-
auto [q_1d, q_2d] = sel.book(qp_1d, qp_2d);
217+
// multiple (different) queries at a selection
218+
auto [q_1d_c, q_2d_c] = c.book(q_1d, q_2d);
191219
@endcpp
192220

193221
Access the result of a query to turn all actions eager.
194222

195223
@cpp
196-
auto hx_a = q_a.result(); // takes a while -- dataset traversal
197-
auto hxy_sel = q_2d.result(); // instantaneous -- already completed
224+
auto hx_a = q_1d_a.result(); // takes a while
225+
auto hxy_c = q_2d_c.result(); // instantaneous
198226
@endcpp
199227

200228
@see
@@ -207,69 +235,93 @@ auto hxy_sel = q_2d.result(); // instantaneous -- already completed
207235
Call queryosity::dataflow::vary() to create varied columns.
208236
There are two ways in which variations can be specified:
209237

210-
1. **Automatic.** Specify a specific type of column to be instantiated along with the nominal+variations. Always ensures the lockstep+transparent propagation of variations.
211-
2. **Manual.** Provide existing instances of columns to be nominal+variations; any column whose output value type is compatible to that of the nominal can be set as a variation.
238+
1. **Pre-instantiation.** Provide the nominal argument and a mapping of variation name to alternate arguments.
239+
2. **Post-instantiation.** Provide existing instances of columns to be the nominal and its variations.
240+
- Any column whose data type is compatible with that of the nominal can be set as a variation.
212241

213242
Both approaches can (and should) be used in a dataflow for full control over the creation & propagation of systematic variations.
214243

215244
@cpp
216-
// automatic -- set and forget
245+
// pre-instantiation
217246

218247
// dataset columns are varied by different column names
219248
auto x = ds.vary(
220-
dataset::column("x_nom"),
221-
{"vary_x","x_var"}
222-
);
249+
dataset::column<double>("x_nom"),
250+
{{"shift_x","x_shifted"}, {"smear_x", "x_smeared"}}
251+
);
252+
// type = qty::lazy<column::fixed<double>>::varied
253+
// (column::fixed<dobule> is the concrete type)
223254

224255
// constants are varied by alternate values
225-
auto yes_or_no = df.vary(
256+
auto y = df.vary(
226257
column::constant(true),
227-
{"no", false}
258+
{{"no", false}}
228259
);
229260

230-
// expressions are varied by alternate expression+input columns
231-
auto b = df.vary(
232-
column::expression(),
233-
systematic::variation("vary_b", )
234-
)();
235-
236-
// definitions are varied by alternate constructor arguments+input columns
237-
auto defn = df.vary(
238-
column::definition<>(),
239-
systematic::variation("vary_c", )
240-
)();
261+
// expressions are varied by alternate expression and input columns
262+
auto x_pm_y = df.vary(
263+
column::expression([](double x, float y){return x+y;}),
264+
{
265+
{"minus_y", [](double x, float y){return x-y;}}
266+
}
267+
)(x, y);
241268

242-
// manual method -- man-handle as you see fit
269+
// post-instantiation
243270

244271
// note:
245272
// - different column types (column, constant, definition)
246273
// - different data types (double, int, float)
247274
auto z_nom = ds.read(dataset::column<double>("z"));
248275
auto z_fixed = df.define(column::constant<int>(100.0));
249276
auto z_half = df.define(column::expression([](float z){return z*0.5;}), z_nom);
250-
251-
// as long as their output values are compatible, any set of columns can be used
252277
auto z = systematic::vary(
253278
systematic::nominal(z_nom),
254279
systematic::variation("z_fixed", f_fixed),
255280
systematic::variation("z_half", z_half)
256-
);
281+
);
282+
// qty::lazy<column::valued<double>>::varied
283+
// (column::valued<double> is the common denominator base class)
284+
@endcpp
257285

258-
// systematic variations are propagated through selections...
259-
auto sel = df.filter(yes_or_no);
260286

261-
// ... and queries
262-
auto q = df.make(query::plan<>()).fill(x).book(yes_or_no);
263-
q.has_variation("vary_x"); // true, filled with x_var for all entries
264-
q.has_variation("no"); // true, filled with x_nom for no entries
287+
@cpp
288+
// definitions are varied by alternate constructor arguments+input columns
289+
// since the constructor arguments of an arbitrary class is not known a-priori,
290+
// each must be provided on its own, rather than in a mapping.
291+
auto defn = df.vary(
292+
column::definition<>(),
293+
systematic::variation("vary_c", )
294+
)();
295+
@endcpp
296+
297+
The list of variations in a (set of) action(s) can be always be checked as they are propagated through the dataflow.
298+
After the dust settles, the nominal and each varied result of a query can be accessed individually.
299+
300+
@cpp
301+
// check variations
302+
x.has_variation("shift_x"); // true
303+
x.has_variation("smear_x"); // true
304+
x.has_variation("no"); // false
305+
306+
yes_or_no.has_variation("shift_x"); // false
307+
yes_or_no.has_variation("smear_x"); // false
308+
yes_or_no.has_variation("no"); // true
309+
310+
systematic::get_variation_names(x, yes_or_no); // {"shift_x", "smear_x", "no"}
311+
312+
// propagation through selection and query
313+
auto sel = df.filter(yes_or_no);
314+
auto q = df.get(column::series(x)).book(sel);
265315

266-
// other variations play no role here.
267-
q.has_variation("vary_b"); // false
316+
q.has_variation("shift_x"); // true
317+
q.has_variation("smear_x"); // true
318+
q.has_variation("no"); // true
268319

269-
// nominal & varied results can be separately accessed
270-
auto q_nom_res = q.nominal().result();
271-
auto q_varx_res = q.["vary_x"].result();
272-
auto q_none_res = q.["no"].result();
320+
// access nominal+variation results
321+
q.nominal().result(); // {x_nom_0, ..., x_nom_N}
322+
q.["shift_x"].result(); // {x_shifted_0, ..., x_shifted_N}
323+
q.["smear_x"].result(); // {x_smeared_0, ..., x_smeared_N}
324+
q.["no"].result(); // {}
273325
@endcpp
274326

275327
@see

0 commit comments

Comments
 (0)