Skip to content

Commit d2859f7

Browse files
authored
Merge pull request #16 from hawkfish/keyed-unfold
Keyed unfold
2 parents c8b680a + 6b6d1b0 commit d2859f7

File tree

7 files changed

+353
-105
lines changed

7 files changed

+353
-105
lines changed

docs/transforms/reshapes/fold.rst

Lines changed: 58 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Fold: Rotate many fields into one field
1111
and a field of tags containing the original input field names.
1212

1313
This can be generalised to multiple output fields where the input fields
14-
are broken up into equal-sized groups, each of which is mapped to one of the output fields.
14+
are broken up into equal-sized *groups*, each of which is mapped to one of the output fields.
1515
In the latter case, the tags can be supplied by the caller or generated by concatenating the field names.
1616

1717
.. py:attribute:: pipeline
@@ -44,52 +44,81 @@ Usage
4444
.. code-block:: python
4545
4646
Fold(p, ('Sales 1992', 'Sales 1993', 'Sales 1994',),
47-
('Year', 'Sales',), ('1992', '1993', '1994',))
48-
Fold(p, ('Sales 1992', 'Sales 1993', 'Sales 1994', 'Profit 1992', 'Profit 1993', 'Profit 1994',),
49-
('Year', 'Sales', 'Profit',), ('1992', '1993', '1994',))
47+
('Year', 'Sales',),
48+
('1992', '1993', '1994',))
49+
Fold(p, ('Sales 1992', 'Sales 1993', 'Sales 1994',
50+
'Profit 1992', 'Profit 1993', 'Profit 1994',),
51+
('Year', 'Sales', 'Profit',),
52+
('1992', '1993', '1994',))
5053
5154
Examples
5255
^^^^^^^^
5356

54-
Single Fold
55-
-----------
57+
Single Group
58+
------------
59+
60+
The first Usage example is a case where Sales values have been pivoted by Year,
61+
so that the Sales for each Year is in a separate field.
5662

5763
.. csv-table:: Input
58-
:header: "Key", "Sales 1992", "Sales 1993", "Sales 1994"
64+
:header: "Dept", "Sales 1992", "Sales 1993", "Sales 1994"
5965
:align: left
6066

61-
0, "S-0-1992", "S-0-1993", "S-0-1994"
62-
1, "S-1-1992", "S-1-1993", "S-1-1994"
67+
Houseware, "S-H-1992", "S-H-1993", "S-H-1994"
68+
Auto, "S-A-1992", "S-A-1993", "S-A-1994"
69+
70+
In order to graph Sales by Department and Year, the table needs a Year field.
71+
``Fold`` takes the list of Sales fields to combine ("fold") as its inputs
72+
and the fields to put them in as the outputs.
73+
The first output field is the "Tags" field, which contains the value used to
74+
identify the original field.
75+
In this example, this is the Year of the field.
76+
77+
After Folding, each Sales value appears in a separate row tagged by Year:
6378

6479
.. csv-table:: Output
65-
:header: "Key", "Year", "Sales"
80+
:header: "Dept", "Year", "Sales"
6681
:align: left
6782

68-
0, 1992, "S-0-1992"
69-
0, 1993, "S-0-1993"
70-
0, 1994, "S-0-1994"
71-
1, 1992, "S-1-1992"
72-
1, 1993, "S-1-1993"
73-
1, 1994, "S-1-1994"
83+
Home, 1992, "S-H-1992"
84+
Home, 1993, "S-H-1993"
85+
Home, 1994, "S-H-1994"
86+
Auto, 1992, "S-A-1992"
87+
Auto, 1993, "S-A-1993"
88+
Auto, 1994, "S-A-1994"
7489

75-
Multiple Folds
76-
--------------
90+
Multiple Groups
91+
---------------
92+
93+
The second Usage example is a related case where multiple measures (Sales and Profit)
94+
have been pivoted by Year so that the Sales and Profits for each Year are in separate fields.
7795

7896
.. csv-table:: Input
79-
:header: "Key", "Sales 1992", "Sales 1993", "Sales 1994", "Profit 1992", "Profit 1993", "Profit 1994"
97+
:header: "Dept", "Sales 1992", "Sales 1993", "Sales 1994", "Profit 1992", "Profit 1993", "Profit 1994"
8098
:align: left
81-
:widths: 1, 8, 8, 8, 8, 8, 8
99+
:widths: 1, 10, 10, 10, 10, 10, 10
100+
101+
Home, "S-H-1992", "S-H-1993", "S-H-1994", "P-H-1992", "P-H-1993", "P-H-1994"
102+
Auto, "S-A-1992", "S-A-1993", "S-A-1994", "P-A-1992", "P-A-1993", "P-A-1994"
103+
104+
In order to do an analysis comparing Sales and Profit by Year,
105+
the table needs to have each record contain the Year, Sales and Profit.
106+
This means that there are two groups that need to be Folded: Sales and Profit,
107+
and the value from each group needs to be tagged by Year.
108+
To express this, each group is listed in order in the *inputs*
109+
and the values are mapped to the corresponding *tag* value and *output* field.
110+
In this example, the Year is again the first *output* field,
111+
and the following *output* fields are the groups in the order given by the *inputs*.
82112

83-
0, "S-0-1992", "S-0-1993", "S-0-1994", "P-0-1992", "P-0-1993", "P-0-1994"
84-
1, "S-1-1992", "S-1-1993", "S-1-1994", "P-1-1992", "P-1-1993", "P-1-1994"
113+
After Folding, each Sales and Profit pair appears in a separate row tagged by Year:
85114

86115
.. csv-table:: Output
87-
:header: "Key", "Year", "Sales", "Profit"
116+
:header: "Dept", "Year", "Sales", "Profit"
88117
:align: left
89118

90-
0, 1992, "S-0-1992", "P-0-1992"
91-
0, 1993, "S-0-1993", "P-0-1993"
92-
0, 1994, "S-0-1994", "P-0-1994"
93-
1, 1992, "S-1-1992", "P-1-1992"
94-
1, 1993, "S-1-1993", "P-1-1993"
95-
1, 1994, "S-1-1994", "P-1-1994"
119+
Home, 1992, "S-H-1992", "P-H-1992"
120+
Home, 1993, "S-H-1993", "P-H-1993"
121+
Home, 1994, "S-H-1994", "P-H-1994"
122+
Auto, 1992, "S-A-1992", "P-A-1992"
123+
Auto, 1993, "S-A-1993", "P-A-1993"
124+
Auto, 1994, "S-A-1994", "P-A-1994"

docs/transforms/reshapes/unfold.rst

Lines changed: 130 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ Unfold: Rotate one field to many
77
88
The ``Unfold`` transform unfolds (pivots) a set of fields.
99
Simple unfolding consists of rotating a single input field into multiple output fields.
10-
This can be generalised to multiple input fields where the output fields are broken up into equal-sized groups,
10+
11+
This can be generalised to multiple input fields where the output fields are broken up into equal-sized *groups*,
1112
and each group is generated from one of the input fields.
1213
``Unfold`` is the inverse of :py:class:`Fold`.
1314

@@ -21,73 +22,167 @@ Unfold: Rotate one field to many
2122

2223
The list of fields to be unfolded.
2324
They will be dropped from the output, so use :py:class:`Copy` to preserve them.
24-
Each input field contains the values for an entire output group.
25+
The first field is the *tag* field and is used to identify wich element of the group the row belongs to.
26+
Each subsequent input field contains the values for an entire group.
2527

2628
.. py:attribute:: outputs
2729
:type: tuple(str)
2830

29-
The output fields receiving the unfolded fields.
31+
The output fields receiving the unfolded input fields.
3032
The output fields are broken into equal-sized groups, one per input field.
3133
The number of *inputs* must be an even multiple of the number of *outputs*.
3234
They cannot overwrite existing fields, so use :py:class:`Drop` to remove unwanted fields.
3335

34-
Limitations
35-
^^^^^^^^^^^
36-
The current implementation assumes that the unfolded values are contiguous.
37-
That is, all the input rows for a single output row will arrive sequentially and in order.
38-
This is the order generated by :py:class:`Fold`, so it is suggested that for now ``Unfold``
39-
only be used to undo the actions of :py:class:`Fold`.
36+
.. py:attribute:: tags
37+
:type: dict(any,int)
38+
39+
The optional mapping from tag values to group positions.
40+
If not provided, it will be generated sequentially from the values in the first record.
41+
42+
``Unfold`` can rotate data where the output rows are generated from non-consecutive input rows.
43+
To identify output rows, the remaining fields (called the *fixed* fields) are used as a key
44+
for accumulating the values of a row.
45+
When a row is complete, it is output.
46+
47+
Because the rows for an output field can appear at any point,
48+
the *tags* are used to assign fields to output columns.
49+
The first time a tag is seen, it is assigned to the next group position,
50+
so the order of the tags in the first record must match the layout of the groups.
4051

4152
Usage
4253
^^^^^
4354

4455
.. code-block:: python
4556
46-
Unfold(p, ('Year', 'Sales',), ('Sales 1992', 'Sales 1993', 'Sales 1994',))
47-
Unfold(p, ('Year', 'Sales', 'Profit',), ('Sales 1992', 'Sales 1993', 'Sales 1994', 'Profit 1992', 'Profit 1993', 'Profit 1994',))
57+
Unfold(p, ('Year', 'Sales',),
58+
('Sales 1992', 'Sales 1993', 'Sales 1994',))
59+
Unfold(p, ('Year', 'Sales', 'Profit',),
60+
('Sales 1992', 'Sales 1993', 'Sales 1994',
61+
'Profit 1992', 'Profit 1993', 'Profit 1994',))
4862
4963
Examples
5064
^^^^^^^^
5165

52-
Single Fold
53-
-----------
66+
Single Group
67+
------------
68+
69+
The first Usage example is a case where a single measure (Sales) has been tagged by Year,
70+
so that each Sales value is in a separate row:
5471

5572
.. csv-table:: Input
56-
:header: "Key", "Year", "Sales"
73+
:header: "Dept", "Year", "Sales"
5774
:align: left
5875

59-
0, 1992, "S-0-1992"
60-
0, 1993, "S-0-1993"
61-
0, 1994, "S-0-1994"
62-
1, 1992, "S-1-1992"
63-
1, 1993, "S-1-1993"
64-
1, 1994, "S-1-1994"
76+
Home, 1992, "S-H-1992"
77+
Home, 1993, "S-H-1993"
78+
Home, 1994, "S-H-1994"
79+
Auto, 1992, "S-A-1992"
80+
Auto, 1993, "S-A-1993"
81+
Auto, 1994, "S-A-1994"
82+
83+
In order to have all the Sales values for a Dept in a single record,
84+
the table needs to have all the Sales for that Dept rotated into the same row.
85+
``Unfold`` takes the tags and the field containing the values as its inputs
86+
and the fields to rotate them to them in as the outputs.
87+
88+
The first *input* field is the "Tags" field, which contains the value used to
89+
identify the original row.
90+
In this example, this is the Year of the field.
91+
This tag is used to track which group field an input row belongs to.
92+
The tags are tracked in order, and they must have the same number as the inputs.
93+
94+
After Unfolding, each Sales value appears in a separate field, with the Year in the field name:
6595

6696
.. csv-table:: Output
67-
:header: "Key", "Sales 1992", "Sales 1993", "Sales 1994"
97+
:header: "Dept", "Sales 1992", "Sales 1993", "Sales 1994"
6898
:align: left
6999

70-
0, "S-0-1992", "S-0-1993", "S-0-1994"
71-
1, "S-1-1992", "S-1-1993", "S-1-1994"
100+
Home, "S-H-1992", "S-H-1993", "S-H-1994"
101+
Auto, "S-A-1992", "S-A-1993", "S-A-1994"
102+
103+
Multiple Groups
104+
---------------
72105

73-
Multiple Folds
74-
--------------
106+
The second Usage example is a related case where multiple measures (Sales and Profit)
107+
have been tagged by Year so that the Sales and Profits for each Year are in separate fields.
75108

76109
.. csv-table:: Input
77-
:header: "Key", "Year", "Sales", "Profit"
110+
:header: "Dept", "Year", "Sales", "Profit"
78111
:align: left
79112

80-
0, 1992, "S-0-1992", "P-0-1992"
81-
0, 1993, "S-0-1993", "P-0-1993"
82-
0, 1994, "S-0-1994", "P-0-1994"
83-
1, 1992, "S-1-1992", "P-1-1992"
84-
1, 1993, "S-1-1993", "P-1-1993"
85-
1, 1994, "S-1-1994", "P-1-1994"
113+
Home, 1992, "S-H-1992", "P-H-1992"
114+
Home, 1993, "S-H-1993", "P-H-1993"
115+
Home, 1994, "S-H-1994", "P-H-1994"
116+
Auto, 1992, "S-A-1992", "P-A-1992"
117+
Auto, 1993, "S-A-1993", "P-A-1993"
118+
Auto, 1994, "S-A-1994", "P-A-1994"
119+
120+
In order to have all the Sales and Profit values for a Dept in a single record,
121+
the table needs to have all the Sales and Profit values for that Dept rotated into the same row.
122+
This means that there are two groups that need to be Unfolded: Sales and Profit,
123+
and the value from each group needs to be rotated into the appropriate group field.
124+
125+
To express this, each group is listed in order in the *outputs*
126+
and the *inputs* are mapped to the corresponding *tag* value and *output* field.
127+
In this example, the Year is again the first *output* field,
128+
and the following *output* fields are the groups in the order given by the *inputs*.
129+
130+
After Unfolding, each Sales and Profit value appears in a separate field:
86131

87132
.. csv-table:: Output
88-
:header: "Key", "Sales 1992", "Sales 1993", "Sales 1994", "Profit 1992", "Profit 1993", "Profit 1994"
133+
:header: "Dept", "Sales 1992", "Sales 1993", "Sales 1994", "Profit 1992", "Profit 1993", "Profit 1994"
89134
:align: left
90135
:widths: 1, 8, 8, 8, 8, 8, 8
91136

92-
0, "S-0-1992", "S-0-1993", "S-0-1994", "P-0-1992", "P-0-1993", "P-0-1994"
93-
1, "S-1-1992", "S-1-1993", "S-1-1994", "P-1-1992", "P-1-1993", "P-1-1994"
137+
Home, "S-H-1992", "S-H-1993", "S-H-1994", "P-H-1992", "P-H-1993", "P-H-1994"
138+
Auto, "S-A-1992", "S-A-1993", "S-A-1994", "P-A-1992", "P-A-1993", "P-A-1994"
139+
140+
Interleaved Records
141+
-------------------
142+
143+
Another powerful use case for ``Unfold`` is to assemble records that may be interleaved.
144+
In this example, the values of two fields appear mixed in the file, but identified by output Row and Column:
145+
146+
.. csv-table:: Input
147+
:header: "Row", "Column", "Data"
148+
:align: left
149+
150+
0,0,"#BLENDs"
151+
1,0,5
152+
2,0,6
153+
3,0,7
154+
4,0,8
155+
5,0,9
156+
6,0,10
157+
7,0,"Total"
158+
0,1,"#Queries"
159+
1,1,1
160+
2,1,11
161+
3,1,85
162+
4,1,449
163+
5,1,1511
164+
6,1,9216
165+
7,1,11273
166+
167+
To assemble the rows, we Unfold the Data column into a single group,
168+
using the Column field as the tags to identify the group field:
169+
170+
.. code-block:: python
171+
172+
Unfold(p, ('Column', 'Data',), ('BLENDs', '#Queries',),
173+
{'BLENDs': 0, '#Queries': 1})
174+
175+
The result is a table containing the eight interleaved fields reassembled using the tags to identify the output group:
176+
177+
.. csv-table:: Input
178+
:header: "Row", "#BLENDs", "#Queries"
179+
:align: left
180+
181+
0,#BLENDs,#Queries
182+
1,5,1
183+
2,6,11
184+
3,7,85
185+
4,8,449
186+
5,9,1511
187+
6,10,9216
188+
7,Total,11273

0 commit comments

Comments
 (0)