Move to a new independent repo

mercutiomontague · Nov 10, 2016 · 7d16044 · 7d16044
commit 7d16044
Show file tree

Hide file tree

Showing 64 changed files with 5,293 additions and 0 deletions.
diff --git a/Gemfile b/Gemfile
@@ -0,0 +1,3 @@
+gem 'aws-sdk'
+gem 'guard'
+gem 'guard-aws-s3'
diff --git a/README.md b/README.md
@@ -0,0 +1,15 @@
+# narp
+
+narp is a program for scalable transformation of very large data sets.  It does this by processing a 
+DSL and then generating a HIVE program.
+
+# Usage
+
+
+# Dependencies
+
+ - treetop 
+ - aws-sdk
+
+# Changelog
+
diff --git a/Rakefile b/Rakefile
@@ -0,0 +1,2 @@
+require 'bundler'
+Bundler::GemHelper.install_tasks
diff --git a/features/basic.feature b/features/basic.feature
@@ -0,0 +1,65 @@
+@basic
+Feature: Parse the definition basic elements of the Narp language
+  In order to allow users to specify basic building elements
+  As a developer
+  I should be able to run this scenario to prove that the defintion is correctly interpretted
+
+  @string
+  Scenario Outline: Providing a string definition
+    Given an input <input>
+    When parsed by BasicG 
+    Then I have a String at the root
+    And the value is <value>
+
+    Examples:
+      | input               | value             |
+      | 3"blue"           | blueblueblue      |
+      | 'blue\tyellow'    | blue	yellow      |
+      | 2'blue\tyellow'   | blue	yellowblue	yellow      |
+      | x"73616D706C65"   | sample            |
+      | "\x73amp\x6C\x65" | sample						|
+      | 'blue'            | blue              |
+      | 'blue #3'           | blue #3           |
+
+
+  Scenario Outline: Providing a regular expression definition
+    Given an input <input>
+    When parsed by BasicG 
+    Then I have a Regex at the root
+    And the regexp should match <value> with a value of <match>
+
+    Examples:
+      | input               | value             | match     |
+      | /\S+/               | blue              | b         |
+      | /\l./               | blue              | lu        |
+      | /[[:digit:]+]/      | go23bat           | 23        |
+      | /87[[:alpha:]]{2}/  | shax 87code       | 87co      |
+      | /\S+\t\S/            | love	it          | love	i   |
+
+
+  @current
+  Scenario Outline: Providing a numeric definition
+    Given an input <input>
+    When parsed by BasicG 
+    Then I have a <class> at the root
+    And the value is <value>
+
+    Examples:
+      | input   | class             | value    |
+      | 23      | OrdinalLiteral    | 23         |
+      | -23     | IntegerLiteral    | -23         |
+      | 23.33   | FloatLiteral      | 23.33        |
+      | 2,333   | EditedNumeric     | 2333        |
+      | 8,333.3 | EditedNumeric     | 8333.3        |
+
+
+
+
+  Scenario Outline: Providing an invalid regular expression definition should cause Parse Error
+    Given an input <input>
+    Then parsing by BasicG should raise ParseError 
+
+    Examples:
+      | input               | 
+      | /blue               | 
+      | /a(/              | 
diff --git a/features/condition.feature b/features/condition.feature
@@ -0,0 +1,70 @@
+@condition
+Feature: Parse the conditions
+  In order to allow users to affect what gets generated by Narp
+  As a developer
+  I should be able to run this scenario to prove that the defintion is correctly interpretted
+
+  # @current
+  # Scenario: providing a numeric valued expression
+  #   Given an input /condition my_cond int6 < 10
+	# 	And the app has numeric fields int5,int6
+  #   When parsed by ConditionG --v
+  #   	Then the condition is called my_cond 
+  #     And the prettied expression is "blue" match u
+
+  Scenario Outline: providing an arithmetic expression 
+    Given an input /condition <name> <condition>
+    When parsed by ConditionG 
+   	Then the condition is called <name> 
+   	And the hql is <hql>
+
+  	Examples:
+  	|  name		  | condition 				|  hql |
+  	| my_cond	  | 23+ 79= 102			 	|  	23 + 79 = 102 |
+  	| my_cond	  | 79 lt 102 - (25+33)			 	| 79 < 102 - (25 + 33) |
+    | y_cond	  | 23+71eq94			 	  |  	23 + 71 = 94 |
+    | b_cond	  | (23+71)* 23=94		|  	(23 + 71) * 23 = 94 |
+    | c_cond	  | (23+71)*( 51+4)=94|  	(23 + 71) * (51 + 4) = 94 |
+    | d_cond	  | (23+71)/( 51-4)+5 ne 94 |  	(23 + 71) / (51 - 4) + 5 != 94  |
+    | e_cond	  | (23+71)/( 51-4)+5 < 94  |  	(23 + 71) / (51 - 4) + 5 < 94   |
+    | f_cond	  | (23+71)/( 51-4)+5 ge 78-(24*93) |  	(23 + 71) / (51 - 4) + 5 >= 78 - (24 * 93)  |
+    | g_cond	  | (23+71) ge 78-(24*93) or 5 < 3 |  	(23 + 71) >= 78 - (24 * 93) OR 5 < 3 |
+    | g_cond	  | ((23+71) ge 78 or 5 < 3) |  	((23 + 71) >= 78 OR 5 < 3) |
+    | i_cond	  | ((23+71) ge 78+15) or (5 < 3) |  	((23 + 71) >= 78 + 15) OR (5 < 3) |
+    | j_cond	  | (((23+71) ge 78+15) and (5 < 3)) |  	(((23 + 71) >= 78 + 15) AND (5 < 3)) |
+
+
+  Scenario Outline: providing a character expression 
+    Given an input /condition <name> <condition>
+    When parsed by ConditionG 
+   	Then the condition is called <name> 
+    And the hql is <hql>
+
+  	Examples:
+  	|  name		  | condition 				              |  hql |
+    | my_cond	  | 'blue' nc 'green'			 	        | LOCATE('green', 'blue') = 0|
+    | b_cond	  | 'blue '' goo ' mt 'green'			 	| 'blue '' goo ' = 'green' |
+  	| c_cond	  | "blue "" goo " ct "green"			 	| LOCATE('green', 'blue "" goo ') > 0 |
+    | d_cond	  | "blue" mt /u/			 	            | 'blue' RLIKE 'u' |
+
+
+  @current
+  Scenario Outline: Providing a character/numeric expression with field references
+    Given an input /condition <name> <condition>
+    And an existing app that is reinitialized
+		And the app has numeric fields <numeric_field_list>
+		And the app has character fields <character_field_list>
+    When parsed by ConditionG 
+   	Then the condition is called <name> 
+    And the hql is <hql>
+
+  	Examples:
+  	|  name		  | condition 				                  |numeric_field_list  | character_field_list |  hql |
+  	| b_cond	  | int6 + 5 > 10                       |	int6, int9         | []                   | lhs_int6 + 5 > 10|
+  	| c_cond	  | 5"blue" ct "green" or int6 < 10    |	int6, int9         | []                   | LOCATE('green', 'blueblueblueblueblue') > 0 OR lhs_int6 < 10|
+    | d_cond	  | int6 < 10 AND 5"blue" ct "green"    |	int6, int9         | []                   | lhs_int6 < 10 AND LOCATE('green', 'blueblueblueblueblue') > 0 |
+    | e_cond	  | 'blue' mt 'green' AnD ch5 mt /ye/	  | []                 | ch5, ch6             | 'blue' = 'green' AND lhs_ch5 RLIKE 'ye' |
+  	| f_cond	  | "blue "" goo " mt /ue/ AND ch5 mt /\d+/		| []           | ch6, ch5             | 'blue "" goo ' RLIKE 'ue' AND lhs_ch5 RLIKE '\d+' |
+  	| g_cond	  | (3"blue" ct "green" aND int6 < 9) or cha5 mt /yE/i   |	int6, int9         | ch4,cha5,col1         | (LOCATE('green', 'blueblueblue') > 0 AND lhs_int6 < 9) OR LOWER(lhs_cha5) RLIKE 'ye'|
+    | i_cond	  | cha5 = "     "                      |	int6, int9         | ch4,cha5,col1         | lhs_cha5 = '     ' |
+
diff --git a/features/derived_field.feature b/features/derived_field.feature
@@ -0,0 +1,89 @@
+@derived_field
+Feature: Parse the derived fields 
+  In order to allow users to affect what gets generated by Narp
+  As a developer
+  I should be able to run this scenario to prove that the defintion is correctly interpretted
+
+  # Scenario: providing an character expression 
+  #   Given an existing app that is reinitialized
+	# 	And the app has numeric fields fn1, fn2, fn3
+	# 	And the app has character fields fc1, fc2
+  #   And the app has conditions cond2, cond5
+  #   And an input /derivedfield calc1 fn1 + 92.5 
+  #   When parsed by DerivedFieldG --verbose
+
+  @current
+  Scenario Outline: providing a character/numeric expression 
+    Given an input /derivedfield <expression>
+    And an existing app that is reinitialized
+		And the app has numeric fields fn1, fn2, fn3
+		And the app has character fields fc1, fc2
+    When parsed by DerivedFieldG 
+    Then the column expression is <column_expression>
+    And the sequence is <sequence> 
+
+    Examples:
+      | expression                                   | column_expression            							| sequence 	| 
+      | calc1 fc1                                    | lhs_fc1 AS calc1   														| null			|
+      | calc2 fc2 23 compress ascii                  | TRIM(CAST(lhs_fc2 AS VARCHAR(23))) AS calc2   	| ascii 		|
+      | calc2 fc2 character 28 compress ascii        | TRIM(CAST(lhs_fc2 AS VARCHAR(28))) AS calc2   	| ascii 		|
+      | calc2 fc2 54 character compress ascii        | TRIM(CAST(lhs_fc2 AS VARCHAR(54))) AS calc2   	| ascii 		|
+      | calc3 23,000 en 6 compress                   | TRIM(CAST(23000 AS VARCHAR(6))) AS calc3   | null      |
+      | calc4 23 uinteger 8                          | CAST(23 AS VARCHAR(8)) AS calc4            | null      |
+      | calc5 92.5 float 4                           | CAST(92.5 AS VARCHAR(4)) AS calc5          | null      |
+      | calc5 fn1 + 92.5 float 4                     | CAST(lhs_fn1 + 92.5 AS VARCHAR(4)) AS calc5          | null      |
+      | calc6 13,292.5 extract /(\d+).+(\d+)/ '#1k' compress   | TRIM(CONCAT('', REGEXP_EXTRACT(13292.5, '(\\\\\d+).+(\\\\\d+)', 1), 'k')) AS calc6 | null |
+      | calc7 29,333.53 en 10 4/1 | CONCAT(CAST(SPLIT(29333.53, '\\.')[0] AS VARCHAR(4)), '.', CAST(SPLIT(29333.53, '\\.')[1] AS VARCHAR(1))) AS calc7 | null |
+      | calc8 29,333.53 En 10 4 | CAST(SPLIT(29333.53, '\\.')[0] AS VARCHAR(4)) AS calc8 | null |
+
+
+  Scenario Outline: providing a character regex 
+    Given an input /derivedfield <name> <expression>
+    And an existing app that is reinitialized
+    And the app has numeric fields fn1, fn2, fn3
+		And the app has character fields fc1, fc2
+    When parsed by DerivedFieldG
+   	Then the name is <name> 
+    And the column expression is <column_expression>
+
+    Examples:
+      | name      | expression                                      | column_expression                                                                       |
+      | calc1     | 'bluecheese' extract /(.+)cheese/i 'cheese: #1' truncate | RTRIM(CONCAT('cheese: ', REGEXP_EXTRACT(LOWER('bluecheese'), '(.+)cheese', 1))) AS calc1 |
+      | calc2     | fc1 extract /(.+)chee(.+)/i 'cheese: #2; then #1' | CONCAT('cheese: ', REGEXP_EXTRACT(LOWER(lhs_fc1), '(.+)chee(.+)', 2), '; then ', REGEXP_EXTRACT(LOWER(lhs_fc1), '(.+)chee(.+)', 1)) AS calc2 |
+
+  Scenario Outline: providing an if expression 
+    Given an input /derivedfield <name> <expression>
+    And an existing app that is reinitialized
+		And the app has numeric fields fn1, fn2, fn3
+		And the app has character fields fc1, fc2
+    And the app has conditions cond2, cond5, cond7
+    When parsed by DerivedFieldG 
+   	Then the name is <name> 
+    And the column expression is <column_expression>
+
+    Examples:
+      | name      | expression                         | column_expression                                              |
+      | calc1     | if cond2 then 25.3 else 56         | CASE WHEN _cond2_ THEN 25.3 ELSE 56 END AS calc1   |
+      | calc2     | if cond2 then if cond5 then 22 + fn1 else 23 else 56         | CASE WHEN _cond2_ THEN CASE WHEN _cond5_ THEN 22 + lhs_fn1 ELSE 23 END ELSE 56 END AS calc2   |
+      | calc3     | if cond2 then if cond5 then 22 + fn1 else if cond7 then 15+fn3 else 0 else 56         | CASE WHEN _cond2_ THEN CASE WHEN _cond5_ THEN 22 + lhs_fn1 ELSE CASE WHEN _cond7_ THEN 15 + lhs_fn3 ELSE 0 END END ELSE 56 END AS calc3   |
+
+
+  Scenario Outline: A derived expression referencing another derived_expression
+    Given an input /derivedfield <name> <expression>
+    And an existing app that is reinitialized
+		And the app has numeric fields fn1, fn2, fn3
+		And the app has character fields fc1, fc2
+		And the app has derived fields fd1, fd2
+    And the app has conditions cond2, cond5, cond7
+    When parsed by DerivedFieldG 
+   	Then the name is <name> 
+    And the column expression is <column_expression>
+
+    Examples:
+      | name      | expression                                  | column_expression                                                                 |
+      | calc2     | fd1 + 25.3 + 56 + fd2                       |  (_fd1_) + 25.3 + 56 + (_fd2_) AS calc2    |
+      | calc3     | fn1 + 25.3 + fd2 * 19                       |  lhs_fn1 + 25.3 + (_fd2_) * 19 AS calc3    |
+      | calc4     | if cond2 then fd1 + 25.3 else 56 + fd2      | CASE WHEN _cond2_ THEN (_fd1_) + 25.3 ELSE 56 + (_fd2_) END AS calc4    |
+      | calc5     | if cond2 then fn1 / 25.3 else 56 + fd2      | CASE WHEN _cond2_ THEN lhs_fn1 / 25.3 ELSE 56 + (_fd2_) END AS calc5    |
+      | calc6     | fd1 compress | TRIM((_fd1_)) AS calc6 |
+
diff --git a/features/fields.feature b/features/fields.feature
@@ -0,0 +1,132 @@
+@fields
+Feature: Parse the definition for fields
+  In order to allow users to specify fields in an input file
+  As a developer
+  I should be able to run this scenario to prove that the defintion is correctly interpretted
+
+    Scenario Outline: providing a name, a fixed position and a data type
+    Given an input /fields <name> <position> <data_type> <length>
+    When parsed by FieldsG
+    And I am examining the 1st field
+    Then the name is <name> 
+    	And the starting byte position is <byte_position> and offset is <offset> bits
+    	And the datatype is <data_type> <default_data_type>
+    	And it is <dt_length> bytes long
+
+		Examples:
+		|  name		| position | data_type  |length |byte_position | offset | dt_length | default_data_type |
+		| my_col	| 23			 | character  | 15 		|		23				 | null 	| 15  			|                   |
+		| yourcol	| 15B3		 | integer  	| 			|	15					 | 3 	 	 	| null			|                   | 
+		| his_col | 82B9		 | float   		| 			|		82				 | 9			| null			|                   |
+		| her_col | 82B9		 |   		      | 			|		82				 | 9			| null      | character         |
+
+
+    Scenario Outline: providing a name and a supported datetime datatype 
+    Given an input /fields my_col 92B3 <data_type> <format>
+    	When parsed by FieldsG 
+      And I am examining the 1st field
+    	Then I have a Field at the root
+    	  And the datatype is <data_type> 
+			  And it has these datetime pieces <pieces>
+
+ 		Examples:
+
+ 		| data_type | format									|pieces 									|
+ 		| datetime 	| year                    | year                    |
+ 		| datetime 	| year/mon                | year,mon                |
+ 		| datetime 	| yy-mm0-dd0 hh0:mi0:se0  | yy,mm0,dd0,hh0,mi0,se0 	|
+ 		| datetime 	| yy-mm-dd0 hh0:mi0:se0   | yy,mm,dd0,hh0,mi0,se0 	|
+ 		| datetime 	| yy-mnth									| yy,mnth									|
+ 		| datetime 	| yy-mnth-ddth            | yy,mnth,ddth            |
+ 		| datetime 	| yy-mnth-dd0 hh0:mi0:se0 | yy,mnth,dd0,hh0,mi0,se0 |
+ 		| datetime 	| yy-mnth-dd hh0:mi0:se0  | yy,mnth,dd,hh0,mi0,se0  |
+ 		| datetime 	| yy-mnth-day hh:mi0:se0  | yy,mnth,day,hh,mi0,se0  |
+ 		| datetime 	| yy-mnth-day hr:mi:se0   | yy,mnth,day,hr,mi,se0   |
+ 		| datetime 	| yy-mnth-day hr:mi:se    | yy,mnth,day,hr,mi,se    |
+
+    Scenario Outline: providing a name and a delimited position
+    Given an input /fields my_col <start> <stop> <format>
+    When parsed by FieldsG 
+    And I am examining the 1st field
+    Then the start field is <start_field> with byte offset of <start_offset> 
+      And the stop field is <stop_field> with byte offset of <stop_offset> 
+   	  And the datatype is <data_type> <default_data_type>
+		  And it has these datetime pieces <pieces>
+
+	Examples: 
+	| start | stop 			| format 		|start_field 	| start_offset| stop_field | stop_offset | data_type | pieces 			| default_data_type |
+  | 23:1  |      			| integer 	| 23          |		1					| null			 | null				 |integer		 | []  					|                   |
+	| 41:  	|      			|  	        | 41          |		null			| null			 | null				 | 	         | []						| character         |
+  | 41:  	| -72: 			| integer 	| 41          |		null			| 72			 	 | null				 |integer 	 | []						|                   |
+  | 41:  	| - 72:15  	| integer 	| 41          |		null			| 72			 	 | 15				 	 |integer 	 | []						|                   |
+	| 83:3 	| -   22:0  | integer 	| 83          |		3					| 22			 	 | 0				 	 |integer 	 | []						|                   |
+	| 83:3 	| -92:0 		| character	| 83          |		3					| 92			 	 | 0			 		 | character | []						|                   |
+	| 83:3 	| 	| datetime yy/mm-dd	| 83          |		3					| null			 | null			 	 | datetime  | yy,mm,dd   	|                   |
+	| 96:1 	| -101:	| datetime yy/mm-dd0 hh	| 96  |		1				 	| 101			 	 | null 			| datetime	 | yy,mm,dd0,hh	|                   |
+
+
+    Scenario: providing a name and a precision
+    Given an input /fields my_col 14:1 float 4 /1
+    When parsed by FieldsG 
+      And I am examining the 1st field
+    Then I have a Field at the root
+    	And I have 1 Field 
+    	And the name is my_col
+    	And the start field is 14 with byte offset of 1
+    	And the datatype is float 
+			And the precision is 5 and the scale is 1
+
+    Scenario Outline: providing a name and a sequence 
+    Given an input /fields my_col 14:1 character <collation> 
+		  And the app has collations <collation_list>
+    When parsed by FieldsG 
+      And I am examining the 1st field
+    Then the collation is <collation> 
+
+		Examples:
+		| collation 					| collation_list 				|
+		| ascii								| []										|
+		| myascii							| yourascii,myascii 		|
+
+
+  Scenario Outline: providing a name and a unsupported format 
+    Given an input /fields my_col 29b9 <format> 
+    When parsed by FieldsG 
+      And I am examining the 1st field
+    Then the datatype should raise ArgumentError
+
+		Examples:
+		| format 	|
+		| lz			|
+		| lp			| 
+		| tp 			| 
+		| zd			| 
+		| ls			| 
+		| ts			| 
+		| an			| 
+		| pd			| 
+
+  @current
+  Scenario: Parsing two fields
+    Given an input /fields My_col 91B3 your_col 25 
+    When parsed by FieldsG 
+    And I am examining the 1st field
+    Then the name is My_col
+    	And the starting byte position is 91 and offset is 3 bits
+    	And the datatype is character
+    And I am examining the 2nd field
+    Then the name is your_col
+    	And the starting byte position is 25 and offset is null bits
+    	And the datatype is character
+
+  Scenario Outline: Things that should cause a parse error 
+    Given an input /fields <input> 
+		And the app has collations <collation_list>
+    Then parsing by FieldsG should raise ParseError 
+
+		Examples:
+		| input 																		| collation_list | message 														|
+		|	my_col 14:1 character myascii							| blueshoose     | referencing unknown collation			|
+		|	my_col 14:1 character myascii							| []     | referencing unknown collation			|
+
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		require 'bundler'
		Bundler::GemHelper.install_tasks