A filter plugin for Embulk to filter out columns
- columns: columns to retain (array of hash)
- name: name of column (required)
- src: src column name to be copied (optional, default is
name
) - default: default value used if input is null (optional)
- type: type of the default value (required for
default
) - format: special option for timestamp column, specify the format of the default timestamp (string, default is
default_timestamp_format
) - timezone: special option for timestamp column, specify the timezone of the default timestamp (string, default is
default_timezone
)
- add_columns: columns to add (array of hash)
- name: name of column (required)
- src: src column name to be copied (either of
src
ordefault
is required) - default: value of column (either of
src
ordefault
is required) - type: type of the default value (required for
default
) - format: special option for timestamp column, specify the format of the default timestamp (string, default is
default_timestamp_format
) - timezone: special option for timestamp column, specify the timezone of the default timestamp (string, default is
default_timezone
)
- drop_columns: columns to drop (array of hash)
- name: name of column (required)
- default_timestamp_format: default timestamp format for timestamp columns (string, default is
%Y-%m-%d %H:%M:%S.%N %z
) - default_timezone: default timezone for timestamp columns (string, default is
UTC
)
Say input.csv is as follows:
time,id,key,score
2015-07-13,0,Vqjht6YE,1370
2015-07-13,1,VmjbjAA0,3962
2015-07-13,2,C40P5H1W,7323
filters:
- type: column
columns:
- {name: time, default: "2015-07-13", format: "%Y-%m-%d"}
- {name: id}
- {name: key, default: "foo"}
reduces columns to only time
, id
, and key
columns as:
time,id,key
2015-07-13,0,Vqjht6YE
2015-07-13,1,VmjbjAA0
2015-07-13,2,C40P5H1W
Note that column types are automatically retrieved from input data (inputSchema).
Say input.csv is as follows:
time,id,key,score
2015-07-13,0,Vqjht6YE,1370
2015-07-13,1,VmjbjAA0,3962
2015-07-13,2,C40P5H1W,7323
filters:
- type: column
add_columns:
- {name: d, type: timestamp, default: "2015-07-13", format: "%Y-%m-%d"}
- {name: copy_id, src: id}
add d
column, and copy_id
column which is a copy of id
column as:
time,id,key,score,d,copy_id
2015-07-13,0,Vqjht6YE,1370,2015-07-13,0
2015-07-13,1,VmjbjAA0,3962,2015-07-13,1
2015-07-13,2,C40P5H1W,7323,2015-07,13,2
Say input.csv is as follows:
time,id,key,score
2015-07-13,0,Vqjht6YE,1370
2015-07-13,1,VmjbjAA0,3962
2015-07-13,2,C40P5H1W,7323
filters:
- type: column
drop_columns:
- {name: time}
- {name: id}
drop time
and id
columns as:
key,score
Vqjht6YE,1370
VmjbjAA0,3962
C40P5H1W,7323
For type: json column, you can specify JSONPath for column's name as:
- {name: $.payload.key1}
- {name: "$.payload.array[0]"}
- {name: "$.payload.array[*]"}
- {name: $['payload']['key1.key2']}
EXAMPLE:
Following operators of JSONPath are not supported:
- Multiple properties such as
['name','name']
- Multiple array indexes such as
[1,2]
- Array slice such as
[1:2]
- Filter expression such as
[?(<expression>)]
Note that type: timesatmp
for add_columns
or columns
is not available because Embulk's type: json
cannot have timestamp column inside.
Also note that renameing or copying of json paths by src
option is only partially supported yet. The parent json path must be same like:
- {name: $.payload.foo.dest, src: $.payload.foo.src}
I mean that below example does not work yet ($.payload.foo
and $.payload.bar
)
- {name: $.payload.foo.dest, src: $.payload.bar.src}
Run example:
$ ./gradlew classpath
$ embulk preview -I lib example/example.yml
Run test:
$ ./gradlew test
Run test with coverage reports:
$ ./gradlew test jacocoTestReport
open build/reports/jacoco/test/html/index.html
Run checkstyle and findbugs:
$ ./gradlew check
Run only checkstyle:
$ ./gradlew checkstyleMain
$ ./gradlew checkstyleTest
Run only findbugs:
$ ./gradlew findbugsMain
$ ./gradlew findbugsTest
Release gem:
$ ./gradlew gemPush