Skip to content

Commit

Permalink
Move to a new independent repo
Browse files Browse the repository at this point in the history
  • Loading branch information
mercutiomontague committed Nov 10, 2016
0 parents commit 7d16044
Show file tree
Hide file tree
Showing 64 changed files with 5,293 additions and 0 deletions.
3 changes: 3 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
gem 'aws-sdk'
gem 'guard'
gem 'guard-aws-s3'
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# narp

narp is a program for scalable transformation of very large data sets. It does this by processing a
DSL and then generating a HIVE program.

# Usage


# Dependencies

- treetop
- aws-sdk

# Changelog

2 changes: 2 additions & 0 deletions Rakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
require 'bundler'
Bundler::GemHelper.install_tasks
65 changes: 65 additions & 0 deletions features/basic.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
@basic
Feature: Parse the definition basic elements of the Narp language
In order to allow users to specify basic building elements
As a developer
I should be able to run this scenario to prove that the defintion is correctly interpretted

@string
Scenario Outline: Providing a string definition
Given an input <input>
When parsed by BasicG
Then I have a String at the root
And the value is <value>

Examples:
| input | value |
| 3"blue" | blueblueblue |
| 'blue\tyellow' | blue yellow |
| 2'blue\tyellow' | blue yellowblue yellow |
| x"73616D706C65" | sample |
| "\x73amp\x6C\x65" | sample |
| 'blue' | blue |
| 'blue #3' | blue #3 |


Scenario Outline: Providing a regular expression definition
Given an input <input>
When parsed by BasicG
Then I have a Regex at the root
And the regexp should match <value> with a value of <match>

Examples:
| input | value | match |
| /\S+/ | blue | b |
| /\l./ | blue | lu |
| /[[:digit:]+]/ | go23bat | 23 |
| /87[[:alpha:]]{2}/ | shax 87code | 87co |
| /\S+\t\S/ | love it | love i |


@current
Scenario Outline: Providing a numeric definition
Given an input <input>
When parsed by BasicG
Then I have a <class> at the root
And the value is <value>

Examples:
| input | class | value |
| 23 | OrdinalLiteral | 23 |
| -23 | IntegerLiteral | -23 |
| 23.33 | FloatLiteral | 23.33 |
| 2,333 | EditedNumeric | 2333 |
| 8,333.3 | EditedNumeric | 8333.3 |




Scenario Outline: Providing an invalid regular expression definition should cause Parse Error
Given an input <input>
Then parsing by BasicG should raise ParseError

Examples:
| input |
| /blue |
| /a(/ |
70 changes: 70 additions & 0 deletions features/condition.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
@condition
Feature: Parse the conditions
In order to allow users to affect what gets generated by Narp
As a developer
I should be able to run this scenario to prove that the defintion is correctly interpretted

# @current
# Scenario: providing a numeric valued expression
# Given an input /condition my_cond int6 < 10
# And the app has numeric fields int5,int6
# When parsed by ConditionG --v
# Then the condition is called my_cond
# And the prettied expression is "blue" match u

Scenario Outline: providing an arithmetic expression
Given an input /condition <name> <condition>
When parsed by ConditionG
Then the condition is called <name>
And the hql is <hql>

Examples:
| name | condition | hql |
| my_cond | 23+ 79= 102 | 23 + 79 = 102 |
| my_cond | 79 lt 102 - (25+33) | 79 < 102 - (25 + 33) |
| y_cond | 23+71eq94 | 23 + 71 = 94 |
| b_cond | (23+71)* 23=94 | (23 + 71) * 23 = 94 |
| c_cond | (23+71)*( 51+4)=94| (23 + 71) * (51 + 4) = 94 |
| d_cond | (23+71)/( 51-4)+5 ne 94 | (23 + 71) / (51 - 4) + 5 != 94 |
| e_cond | (23+71)/( 51-4)+5 < 94 | (23 + 71) / (51 - 4) + 5 < 94 |
| f_cond | (23+71)/( 51-4)+5 ge 78-(24*93) | (23 + 71) / (51 - 4) + 5 >= 78 - (24 * 93) |
| g_cond | (23+71) ge 78-(24*93) or 5 < 3 | (23 + 71) >= 78 - (24 * 93) OR 5 < 3 |
| g_cond | ((23+71) ge 78 or 5 < 3) | ((23 + 71) >= 78 OR 5 < 3) |
| i_cond | ((23+71) ge 78+15) or (5 < 3) | ((23 + 71) >= 78 + 15) OR (5 < 3) |
| j_cond | (((23+71) ge 78+15) and (5 < 3)) | (((23 + 71) >= 78 + 15) AND (5 < 3)) |


Scenario Outline: providing a character expression
Given an input /condition <name> <condition>
When parsed by ConditionG
Then the condition is called <name>
And the hql is <hql>

Examples:
| name | condition | hql |
| my_cond | 'blue' nc 'green' | LOCATE('green', 'blue') = 0|
| b_cond | 'blue '' goo ' mt 'green' | 'blue '' goo ' = 'green' |
| c_cond | "blue "" goo " ct "green" | LOCATE('green', 'blue "" goo ') > 0 |
| d_cond | "blue" mt /u/ | 'blue' RLIKE 'u' |


@current
Scenario Outline: Providing a character/numeric expression with field references
Given an input /condition <name> <condition>
And an existing app that is reinitialized
And the app has numeric fields <numeric_field_list>
And the app has character fields <character_field_list>
When parsed by ConditionG
Then the condition is called <name>
And the hql is <hql>

Examples:
| name | condition |numeric_field_list | character_field_list | hql |
| b_cond | int6 + 5 > 10 | int6, int9 | [] | lhs_int6 + 5 > 10|
| c_cond | 5"blue" ct "green" or int6 < 10 | int6, int9 | [] | LOCATE('green', 'blueblueblueblueblue') > 0 OR lhs_int6 < 10|
| d_cond | int6 < 10 AND 5"blue" ct "green" | int6, int9 | [] | lhs_int6 < 10 AND LOCATE('green', 'blueblueblueblueblue') > 0 |
| e_cond | 'blue' mt 'green' AnD ch5 mt /ye/ | [] | ch5, ch6 | 'blue' = 'green' AND lhs_ch5 RLIKE 'ye' |
| f_cond | "blue "" goo " mt /ue/ AND ch5 mt /\d+/ | [] | ch6, ch5 | 'blue "" goo ' RLIKE 'ue' AND lhs_ch5 RLIKE '\d+' |
| g_cond | (3"blue" ct "green" aND int6 < 9) or cha5 mt /yE/i | int6, int9 | ch4,cha5,col1 | (LOCATE('green', 'blueblueblue') > 0 AND lhs_int6 < 9) OR LOWER(lhs_cha5) RLIKE 'ye'|
| i_cond | cha5 = " " | int6, int9 | ch4,cha5,col1 | lhs_cha5 = ' ' |

89 changes: 89 additions & 0 deletions features/derived_field.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
@derived_field
Feature: Parse the derived fields
In order to allow users to affect what gets generated by Narp
As a developer
I should be able to run this scenario to prove that the defintion is correctly interpretted

# Scenario: providing an character expression
# Given an existing app that is reinitialized
# And the app has numeric fields fn1, fn2, fn3
# And the app has character fields fc1, fc2
# And the app has conditions cond2, cond5
# And an input /derivedfield calc1 fn1 + 92.5
# When parsed by DerivedFieldG --verbose

@current
Scenario Outline: providing a character/numeric expression
Given an input /derivedfield <expression>
And an existing app that is reinitialized
And the app has numeric fields fn1, fn2, fn3
And the app has character fields fc1, fc2
When parsed by DerivedFieldG
Then the column expression is <column_expression>
And the sequence is <sequence>

Examples:
| expression | column_expression | sequence |
| calc1 fc1 | lhs_fc1 AS calc1 | null |
| calc2 fc2 23 compress ascii | TRIM(CAST(lhs_fc2 AS VARCHAR(23))) AS calc2 | ascii |
| calc2 fc2 character 28 compress ascii | TRIM(CAST(lhs_fc2 AS VARCHAR(28))) AS calc2 | ascii |
| calc2 fc2 54 character compress ascii | TRIM(CAST(lhs_fc2 AS VARCHAR(54))) AS calc2 | ascii |
| calc3 23,000 en 6 compress | TRIM(CAST(23000 AS VARCHAR(6))) AS calc3 | null |
| calc4 23 uinteger 8 | CAST(23 AS VARCHAR(8)) AS calc4 | null |
| calc5 92.5 float 4 | CAST(92.5 AS VARCHAR(4)) AS calc5 | null |
| calc5 fn1 + 92.5 float 4 | CAST(lhs_fn1 + 92.5 AS VARCHAR(4)) AS calc5 | null |
| calc6 13,292.5 extract /(\d+).+(\d+)/ '#1k' compress | TRIM(CONCAT('', REGEXP_EXTRACT(13292.5, '(\\\\\d+).+(\\\\\d+)', 1), 'k')) AS calc6 | null |
| calc7 29,333.53 en 10 4/1 | CONCAT(CAST(SPLIT(29333.53, '\\.')[0] AS VARCHAR(4)), '.', CAST(SPLIT(29333.53, '\\.')[1] AS VARCHAR(1))) AS calc7 | null |
| calc8 29,333.53 En 10 4 | CAST(SPLIT(29333.53, '\\.')[0] AS VARCHAR(4)) AS calc8 | null |


Scenario Outline: providing a character regex
Given an input /derivedfield <name> <expression>
And an existing app that is reinitialized
And the app has numeric fields fn1, fn2, fn3
And the app has character fields fc1, fc2
When parsed by DerivedFieldG
Then the name is <name>
And the column expression is <column_expression>

Examples:
| name | expression | column_expression |
| calc1 | 'bluecheese' extract /(.+)cheese/i 'cheese: #1' truncate | RTRIM(CONCAT('cheese: ', REGEXP_EXTRACT(LOWER('bluecheese'), '(.+)cheese', 1))) AS calc1 |
| calc2 | fc1 extract /(.+)chee(.+)/i 'cheese: #2; then #1' | CONCAT('cheese: ', REGEXP_EXTRACT(LOWER(lhs_fc1), '(.+)chee(.+)', 2), '; then ', REGEXP_EXTRACT(LOWER(lhs_fc1), '(.+)chee(.+)', 1)) AS calc2 |

Scenario Outline: providing an if expression
Given an input /derivedfield <name> <expression>
And an existing app that is reinitialized
And the app has numeric fields fn1, fn2, fn3
And the app has character fields fc1, fc2
And the app has conditions cond2, cond5, cond7
When parsed by DerivedFieldG
Then the name is <name>
And the column expression is <column_expression>

Examples:
| name | expression | column_expression |
| calc1 | if cond2 then 25.3 else 56 | CASE WHEN _cond2_ THEN 25.3 ELSE 56 END AS calc1 |
| calc2 | if cond2 then if cond5 then 22 + fn1 else 23 else 56 | CASE WHEN _cond2_ THEN CASE WHEN _cond5_ THEN 22 + lhs_fn1 ELSE 23 END ELSE 56 END AS calc2 |
| calc3 | if cond2 then if cond5 then 22 + fn1 else if cond7 then 15+fn3 else 0 else 56 | CASE WHEN _cond2_ THEN CASE WHEN _cond5_ THEN 22 + lhs_fn1 ELSE CASE WHEN _cond7_ THEN 15 + lhs_fn3 ELSE 0 END END ELSE 56 END AS calc3 |


Scenario Outline: A derived expression referencing another derived_expression
Given an input /derivedfield <name> <expression>
And an existing app that is reinitialized
And the app has numeric fields fn1, fn2, fn3
And the app has character fields fc1, fc2
And the app has derived fields fd1, fd2
And the app has conditions cond2, cond5, cond7
When parsed by DerivedFieldG
Then the name is <name>
And the column expression is <column_expression>

Examples:
| name | expression | column_expression |
| calc2 | fd1 + 25.3 + 56 + fd2 | (_fd1_) + 25.3 + 56 + (_fd2_) AS calc2 |
| calc3 | fn1 + 25.3 + fd2 * 19 | lhs_fn1 + 25.3 + (_fd2_) * 19 AS calc3 |
| calc4 | if cond2 then fd1 + 25.3 else 56 + fd2 | CASE WHEN _cond2_ THEN (_fd1_) + 25.3 ELSE 56 + (_fd2_) END AS calc4 |
| calc5 | if cond2 then fn1 / 25.3 else 56 + fd2 | CASE WHEN _cond2_ THEN lhs_fn1 / 25.3 ELSE 56 + (_fd2_) END AS calc5 |
| calc6 | fd1 compress | TRIM((_fd1_)) AS calc6 |

132 changes: 132 additions & 0 deletions features/fields.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
@fields
Feature: Parse the definition for fields
In order to allow users to specify fields in an input file
As a developer
I should be able to run this scenario to prove that the defintion is correctly interpretted

Scenario Outline: providing a name, a fixed position and a data type
Given an input /fields <name> <position> <data_type> <length>
When parsed by FieldsG
And I am examining the 1st field
Then the name is <name>
And the starting byte position is <byte_position> and offset is <offset> bits
And the datatype is <data_type> <default_data_type>
And it is <dt_length> bytes long

Examples:
| name | position | data_type |length |byte_position | offset | dt_length | default_data_type |
| my_col | 23 | character | 15 | 23 | null | 15 | |
| yourcol | 15B3 | integer | | 15 | 3 | null | |
| his_col | 82B9 | float | | 82 | 9 | null | |
| her_col | 82B9 | | | 82 | 9 | null | character |


Scenario Outline: providing a name and a supported datetime datatype
Given an input /fields my_col 92B3 <data_type> <format>
When parsed by FieldsG
And I am examining the 1st field
Then I have a Field at the root
And the datatype is <data_type>
And it has these datetime pieces <pieces>

Examples:

| data_type | format |pieces |
| datetime | year | year |
| datetime | year/mon | year,mon |
| datetime | yy-mm0-dd0 hh0:mi0:se0 | yy,mm0,dd0,hh0,mi0,se0 |
| datetime | yy-mm-dd0 hh0:mi0:se0 | yy,mm,dd0,hh0,mi0,se0 |
| datetime | yy-mnth | yy,mnth |
| datetime | yy-mnth-ddth | yy,mnth,ddth |
| datetime | yy-mnth-dd0 hh0:mi0:se0 | yy,mnth,dd0,hh0,mi0,se0 |
| datetime | yy-mnth-dd hh0:mi0:se0 | yy,mnth,dd,hh0,mi0,se0 |
| datetime | yy-mnth-day hh:mi0:se0 | yy,mnth,day,hh,mi0,se0 |
| datetime | yy-mnth-day hr:mi:se0 | yy,mnth,day,hr,mi,se0 |
| datetime | yy-mnth-day hr:mi:se | yy,mnth,day,hr,mi,se |

Scenario Outline: providing a name and a delimited position
Given an input /fields my_col <start> <stop> <format>
When parsed by FieldsG
And I am examining the 1st field
Then the start field is <start_field> with byte offset of <start_offset>
And the stop field is <stop_field> with byte offset of <stop_offset>
And the datatype is <data_type> <default_data_type>
And it has these datetime pieces <pieces>

Examples:
| start | stop | format |start_field | start_offset| stop_field | stop_offset | data_type | pieces | default_data_type |
| 23:1 | | integer | 23 | 1 | null | null |integer | [] | |
| 41: | | | 41 | null | null | null | | [] | character |
| 41: | -72: | integer | 41 | null | 72 | null |integer | [] | |
| 41: | - 72:15 | integer | 41 | null | 72 | 15 |integer | [] | |
| 83:3 | - 22:0 | integer | 83 | 3 | 22 | 0 |integer | [] | |
| 83:3 | -92:0 | character | 83 | 3 | 92 | 0 | character | [] | |
| 83:3 | | datetime yy/mm-dd | 83 | 3 | null | null | datetime | yy,mm,dd | |
| 96:1 | -101: | datetime yy/mm-dd0 hh | 96 | 1 | 101 | null | datetime | yy,mm,dd0,hh | |


Scenario: providing a name and a precision
Given an input /fields my_col 14:1 float 4 /1
When parsed by FieldsG
And I am examining the 1st field
Then I have a Field at the root
And I have 1 Field
And the name is my_col
And the start field is 14 with byte offset of 1
And the datatype is float
And the precision is 5 and the scale is 1

Scenario Outline: providing a name and a sequence
Given an input /fields my_col 14:1 character <collation>
And the app has collations <collation_list>
When parsed by FieldsG
And I am examining the 1st field
Then the collation is <collation>

Examples:
| collation | collation_list |
| ascii | [] |
| myascii | yourascii,myascii |


Scenario Outline: providing a name and a unsupported format
Given an input /fields my_col 29b9 <format>
When parsed by FieldsG
And I am examining the 1st field
Then the datatype should raise ArgumentError

Examples:
| format |
| lz |
| lp |
| tp |
| zd |
| ls |
| ts |
| an |
| pd |

@current
Scenario: Parsing two fields
Given an input /fields My_col 91B3 your_col 25
When parsed by FieldsG
And I am examining the 1st field
Then the name is My_col
And the starting byte position is 91 and offset is 3 bits
And the datatype is character
And I am examining the 2nd field
Then the name is your_col
And the starting byte position is 25 and offset is null bits
And the datatype is character

Scenario Outline: Things that should cause a parse error
Given an input /fields <input>
And the app has collations <collation_list>
Then parsing by FieldsG should raise ParseError

Examples:
| input | collation_list | message |
| my_col 14:1 character myascii | blueshoose | referencing unknown collation |
| my_col 14:1 character myascii | [] | referencing unknown collation |


Loading

0 comments on commit 7d16044

Please sign in to comment.