Replies: 5 comments 9 replies
-
Summary
Design Discussion
ExtractorsJsonpathExtractor - what we already have in Horreum
RunMetadataExtractor - access columns other than data
LabelValueExtractor - extract from previous label values
Extractor names are not unique for a label so that they can override using last in first out. This allows users to create multiple extractors to support changes to the run data format where known data changes location. Label Value CreationThe PoC creates all Label Values in Horreum not in the database (no sql inserts). Labels are sorted using Kahn Directed Acyclic Graph sort to ensure label values are available when a LabelValueExtractor needs them. The PoC still creates a json object from all the extracted values like current Horreum. There are two key sql operations: io.hyperfoil.tools.exp.horreum.svc.LabelService#calculateExtractedValuesWithIterated with m as (select e.name, e.dtype, e.jsonpath, e.foreach, e.column_name, lv.data as lv_data, lv.iterated as lv_iterated, r.data as run_data, r.metadata as run_metadata from extractor e left join label_values lv on e.target_id = lv.label_id, run r where e.parent_id = :label_id and (lv.run_id = :run_id or lv.run_id is null) and r.id = :run_id),
n as (select m.name, m.dtype, m.jsonpath, m.foreach, m.lv_iterated ,m.lv_data, (case
when m.dtype = 'JsonpathExtractor' and m.jsonpath is not null then jsonb_path_query_array(m.run_data,m.jsonpath::::jsonpath)
when m.dtype = 'RunMetadataExtractor' and m.jsonpath is not null and m.column_name = 'metadata' then jsonb_path_query_array(m.run_metadata,m.jsonpath::::jsonpath)
when m.dtype = 'LabelValueExtractor' and m.jsonpath is not null and m.jsonpath != '' and m.lv_iterated then extract_path_array(m.lv_data,m.jsonpath::::jsonpath)
when m.dtype = 'LabelValueExtractor' and m.jsonpath is not null and m.jsonpath != '' then jsonb_path_query_array(m.lv_data,m.jsonpath::::jsonpath)
when m.dtype = 'LabelValueExtractor' and (m.jsonpath is null or m.jsonpath = '') then to_jsonb(ARRAY[m.lv_data])
else '[]'::::jsonb end) as found from m)
select n.name as name,(case when jsonb_array_length(n.found) > 1 then n.found else n.found->0 end) as data, n.lv_iterated as lv_iterated from n which then uses a custom sql function CREATE OR REPLACE FUNCTION extract_path_array(_arr jsonb, _path jsonpath)
RETURNS jsonb
LANGUAGE sql STABLE PARALLEL SAFE AS
'
with bag as (SELECT jsonb_path_query_array(elem,_path) as data FROM jsonb_array_elements(_arr) elem)
select jsonb_agg((case when jsonb_array_length(data) > 1 then data else data->0 end)) from bag;
';; Label Value InteractionHorreum currently allows a run to create multiple datasets and thus multiple rows in the dataset table. The PoC replaces this with target_schemas on Labels so that the Label Values are claimed to adhere to a given schema and users can fetch all label_values that implement that schema io.hyperfoil.tools.exp.horreum.svc.LabelService#getBySchema or get a map of sub-label values for each label value that implements a schema. Getting the sub-label maps is the equivalent of getting the ExtractedLabelValues for datasets. io.hyperfoil.tools.exp.horreum.svc.LabelService#labelValues(java.lang.String, long, java.util.List<java.lang.String>, java.util.List<java.lang.String>) |
Beta Was this translation helpful? Give feedback.
-
I think the name |
Beta Was this translation helpful? Give feedback.
-
User WorkflowsHow does a user replace the Datasets view where a single run upload could have multiple rows?Users create If a user has two labels with a common target_schema how do they extract label values without duplicating labels?if Label_A and Label_B both have a How does the user handle if the run data changes over time?This depends on the type and complexity of the change. If Label_A used an extractor named UUID to extract from |
Beta Was this translation helpful? Give feedback.
-
Hey, I think I got the overall approach and how it is intended to work (and to be implemented). What I still do not have fully clear is this sentence
especially
Why do we need to mark a label to be "iterating" or "non-iterating" for subsequent labels, I mean IIUC the idea is to encode the result of iterating labels as an array (therefore a single label_value with an array as value, right?), given that shouldn't be a responsability of the label that will use the previous one to decide how to manage the previous result (i.e., the input, if that is an array)? In that way it can decide to apply a This way the label should specify how it will treat the source label input, i.e., if the current label itself is a I don't get if I am over-simplifying the approach or if I am missing some use cases that would make this approach not working. Another thing that I think could be useful, at least from my perspective, is a sort of ER diagram showing how data would be mapped into tables' records. Especially for the scenario where we replace the existing dataset with this new model, having a concrete ER diagram showing how the data would like like in the database could be very helpful to have a clear picture of the |
Beta Was this translation helpful? Give feedback.
-
@willr3 Thanks for writing this up. I agree this would a be great step forward to simplifying the current concepts and reduce the cognitive load of users
in order to make it simpler for users, should we have a consistent syntax:
if we were to do this, we need reserved words for labelName, i.e. can not be
If we want to reduce the cognitive load, do we want to support / encourage the shortcuts? or make it explicit what users want, this would simplify the code slightly as well
The current proposal simplifies the reducer function.
|
Beta Was this translation helpful? Give feedback.
-
Problem:
The dataset -> Label -> [Change, Experiment, Report, etc] workflow is detrimental
Proposal:
Change label extractors so that they can process input from either the initial run or from the output of other labels. This gives users the ability to define a pipeline of data processing that is as simple or complex as they need to handle their data and decreases the model complexity inside Horreum.
Labels would have 3 forms of extractors:
runDataExtractor : $.json.path.in.data
This is the extractor we already have and it runs a jsonpath against the run data uploaded by the user.
metadataExtractor : {metadata}:$.json.path.in.metadata
This is an extractor that accesses information in the Horreum run table that is outside the run data uploaded by the user. There is not a well defined way for view_components to access the run’s ID. That is currently handled with custom additional parameters to the render function. This would allow an extractor with {id} to put the run’s id value into a standard method of data access for view_components.
This type of extractor would also codify access to any run metadata using json path extraction.
labelValueExtractor : labelName:$.json.path.in.data
This is the new composite extractor that runs against the output of another Label. If defining the extractor in a single string it is labelName with an optional colon separator followed by an optional jsonpath.
labelName
The simplest would be just the labelName which would pass the label_value directly to the current label’s combination function.
labelName:$.json.path.in.data
This combines the labelName with a jsonpath. It extracts the value found by the jsonpath from the output of the named label (label_value)
labelName[*]
The [*] suffix means we want horreum to iterate over the values from the named label and run the combination function N times where N is the number of values from the named label. This is easy to do when the label_value is an array. If the label_value is an object do we assume the user wants to iterate over entries? {key: [key], value: [value]}?
The resulting N values from the N executions of the combination function would be stored as an array in the label_values table but we would need to store an indicator that the array is actually separate values. If this label is passed as input to another label then that other label would automatically trigger N times because from the user perspective the label has N values from the N executions of the previous label.
Note: Consider adding the [*] at the end of the extractor jsonpath to indicate it is an iterating extractor. This allows iteration in one Label rather than having to first extract a value in LabelA then iterate on LabelA’s label_value in LabelB. The downside to this notation is that the suffix is not universal to all jsonpath dialects so it would not be compatible with all possible persistence options (e.g. jsopath-ng and jq).
Label States
Labels will be in different states depending on the type and number of extractors they use.
Iterating Extractor with non-iterating extractor(s)
If a label has an iterating labelValueExtractor and other non-iterating extractors then the user needs to tell us if they want the values from the other extractors passed into the combination function every time it runs or only the first time it runs.
I propose we do this with a configuration option (because there are only 2 options) similar to how Grafana allows configuring transforms from a list of selectable transforms. We would create some info-graphics that illustrate how the data is handled and naturally have copy text available to describe it.
Multiple Iterating Extractors
If a label has more than one iterating labelValueExtractor then the user has to inform us if they want N x N executions of the combination function or Nn where Nn is the maximum length of the iterating functions array.
Maximum Length
If they choose the maximum length option do we ask the user what they want passed into the combination function for the shorter iterating label_value?. Do they want no value (null) or do they want Horreum to repeat previous iterations of the value? I would suspect no value and perhaps that is our default behavior but we should consider users may have other needs.
There is a possible condition where the label also has non-itetering extractors that produce a single value. We have to ask what the user wants to do for those values when the iterating labels run the combination function on the second and subsequent iterations.
N x N Combinations
The users could want to run the combination function for every permutation of the iterating extractors. This seems unlikely but we should support this option.
Initial Implementation
This change to label extractors will allow us to replace several redundant concepts within Horreum. Our initial implementation will replace Datasets with Labels where Tests are producing more than one dataset from a run (approximately 10% of tests).
Dataset
The existing transformation_extractor and transformation_function will become the extractor and combination function for a new label (labelA). This new label will be an intermediary label with an array output because existing tranformation_functions output an array to indicate multiple datasets.
Horreum would then need a labelB with an iterating labelValueExtractor on labelA to indicate to subsequent labels (labelC…) that the label_value should be treated as distinct entries and labelC should run on each entry in the label_value for labelB.
This change means label_values must store a state indicator to tell subsequent labels that the value contains multiple distinct entries in an array.
UI Changes
Changing the data mode will naturally change how we represent data to the user.
Test Dataset
Datasets would no longer exist. Users would instead create “schema views'' which are tables that show a row for each label_value from the test that target a specific schema based on the label.target_schema. The table would include fixed columns (e.g. the run id where the label_value originated) as well as columns based on any label_values derived from the label_value with the target schema.
Horreum would need to either track the label dependency graph (it is a graph not a tree) to know which labels come from label_values that match the target schema or perform that graph walk when the user wants to add a column to the table view. The graph walk would be based on the label and label_extractor tables, not label_values.
Implementation Requirements
Extractor Encoding
Users should be able to easily change extractors and this means changing them between types (runData, labelValue, metadata). I think a good first implementation is to allow users to define the extractors as a sting and for Horreum to identify the type from the string. Changing existing extractors would potentially delete and create a new extractor if it changes type. A shared table would help change between encodings but isn’t strictly necessary.
Label Loop Detection
Changing an extractor or Label name could create a dependency loop. It is technically possible for a dependency loop to resolve but the risk of infinite looping means we should detect and prevent loops. I propose we limit labels to uni-directional dependencies with branch and merge options but no loops.
Label Value Tracking
Iterating Label Indicator
We plan to store values from iterating Labels in a single entry in label_values by encoding them as an array. We need to differentiate between arrays from iterating Labels and arrays from non-iterating Labels because Labels that depend on iterating Labels will need to create multiple values.
Label Source Indicator
Changing the UI to creating tables from label_values based on the label target_schema means we need to be able to identify which label_value are derived from those target_schema label_values. Non-iterating extractors store the label source in the extractor but iterating extractors create a different source index for each subsequent label.
If LabelX depends on N iterating labelValueExtractors that extract from Label_alpha, Label_beta… then Horreum will have to store a reference to the index in the label_values for Label_alpha and Label_beta for each index in the label_values for LabelX. There could be N references for each entry in the label_values for Label_X where N is the number of extractors for Label_X (if all extractors are iterating labelValueExtractors).
Beta Was this translation helpful? Give feedback.
All reactions