Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tsv export #1

Open
wants to merge 37 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
df8b0a1
add export_table and bigquery_export
bramboomen Apr 15, 2024
a1242b9
add export functionality by providing a query file instead of table name
bramboomen Apr 15, 2024
b3f66a5
add export_database function
bramboomen Apr 15, 2024
e040a04
add echo & verbose messages to export_table/database scripts
bramboomen Apr 15, 2024
80d1386
fix bugs after testing export_table/database
bramboomen Apr 15, 2024
1e5fa8d
Add load_bigquery_table function
bramboomen Apr 16, 2024
0a52790
Move changes for bigquery upload to separate branch
bramboomen Apr 16, 2024
b3a8e71
Increment etl-tooling version number
bramboomen Apr 16, 2024
47bc830
Increment etl-tooling version number
bramboomen Apr 16, 2024
86b4fa6
fixes for export table
bramboomen Apr 25, 2024
608c02e
Add error logging to export table
bramboomen Apr 25, 2024
31c3a5d
Add function csv_analyze_data
bramboomen Apr 26, 2024
1256945
Add CsvAnalyzer exe
bramboomen Apr 26, 2024
73b68f7
Bugfixes
bramboomen Apr 26, 2024
362340a
table_export: Output types to subfolder
bramboomen Apr 26, 2024
ed3e1c7
add export_table and bigquery_export
bramboomen Apr 15, 2024
26f9c03
add export functionality by providing a query file instead of table name
bramboomen Apr 15, 2024
a811a2c
add export_database function
bramboomen Apr 15, 2024
0d56602
add echo & verbose messages to export_table/database scripts
bramboomen Apr 15, 2024
b403287
fix bugs after testing export_table/database
bramboomen Apr 15, 2024
3b5b0d5
Move changes for bigquery upload to separate branch
bramboomen Apr 16, 2024
193b584
fixes for export table
bramboomen Apr 25, 2024
606ceac
Add error logging to export table
bramboomen Apr 25, 2024
d368b2d
Add function csv_analyze_data
bramboomen Apr 26, 2024
f9d9dc7
Add CsvAnalyzer exe
bramboomen Apr 26, 2024
f0c2471
Bugfixes
bramboomen Apr 26, 2024
b4fc74b
table_export: Output types to subfolder
bramboomen Apr 26, 2024
d4e9b0f
Add load_bigquery_table function
bramboomen Apr 16, 2024
0511ad6
generate_bigquery_schema: sanitize data_types before conversion
bramboomen Apr 26, 2024
ca5f9c6
add bq_exe to executables.bat
bramboomen Apr 26, 2024
e24a4bb
csv_analyze_data: fix :check_variables
bramboomen Apr 26, 2024
f5bcae7
export_table: fix types_file path
bramboomen Apr 26, 2024
fa211e8
Merge branch 'bigquery_upload' into tsv_export
bramboomen Sep 24, 2024
bc0938c
Fix merge error
bramboomen Sep 24, 2024
61535ff
Update export table/database
bramboomen Sep 27, 2024
3ae16b1
Change Mssql to Bigquery data types
bramboomen Sep 27, 2024
8581ac3
Improve BQ logging
bramboomen Sep 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 19 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# CWTS ETL tooling
Version: 8.0.0
Version: 8.1.0

## Description

Expand All @@ -25,7 +25,7 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable
| `archive_pipeline` | v1.0.0 |
| `aws_download_folder` | v1.0.0 |
| `bcp_data` | v1.0.2 |
| `check_errors` | v0.3.2 |
| `check_errors` | v0.3.3 |
| `classification_create_classification` | v1.0.0 |
| `classification_create_labeling` | v1.0.0 |
| `classification_create_vosviewer_maps` | v1.0.0 |
Expand All @@ -36,14 +36,17 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable
| `credentials` | dev |
| `curl_download_file` | v1.3.0 |
| `executables` | v1.1.1 |
| `export_database` | dev |
| `export_table` | dev |
| `extract_noun_phrases` | v1.0.0 |
| `folder` | v1.0.6 |
| `folder` | v1.0.7 |
| `get_datetime` | v1.0.0 |
| `generate_database_documentation` | v0.1.0 |
| `grant_access_cwts_group` | v2.0.0 |
| `json_analyze_data` | v1.0.0 |
| `json_parse_data` | v1.1.1 |
| `load_database` | v1.0.0 |
| `load_bigquery_table` | dev |
| `log_runtime` | v0.0.1 |
| `notify` | v1.0.0 |
| `notify_errors` | v0.1.0 |
Expand Down Expand Up @@ -90,6 +93,8 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable
- Add wait.bat :sleep_subprocess

### check_errors
- v0.3.4
- `%export_log_folder%` added for export_table function
- v0.3.3
- `%backup_log_folder%` added for backup-tooling
- v0.3.2
Expand Down Expand Up @@ -165,6 +170,10 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable
- v1.1.0
- rename `%read_data_exe%` to `%readdata_exe%`

### export_database

### export_table

### extract_noun_phrases

- v1.0.0
Expand All @@ -173,6 +182,11 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable

### folder

- v1.0.8
- add `%bigquery_log_folder%`
- v1.0.7
- add `%export_data_folder%`
- add `%export_log_folder%`
- v1.0.6
- add `%publicationclassification_log_folder%`
- add `%publicationclassificationlabeling_log_folder%`
Expand Down Expand Up @@ -209,6 +223,8 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable
- v1.0.0
- The value of `%erase_previous%` should be set to `erase_previous` instead of `true`

### load_bigquery_table

### load_database

- v1.0.0
Expand Down
2 changes: 2 additions & 0 deletions functions/check_errors.bat
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,11 @@ set error_string=---------------------------------------------------------------

call :check_errors "Backup" "%backup_log_folder%" error
call :check_errors "BCP" "%bcp_log_folder%" error
call :check_errors "Bigquery" "%bigquery_log_folder%" error
call :check_errors "Classification" "%classification_log_folder%" error
call :check_errors "Documentatie Generator" "%database_documentatie_generator_log_folder%" error
call :check_errors "Download" "%download_log_folder%" error
call :check_errors "Export" "%export_log_folder%" error
call :check_errors "Json Parser" "%json_parser_log_folder%" error
call :check_errors "LargeFileSplitter" "%large_file_splitter_log_folder%" error
call :check_errors "NPExtractorDB" "%noun_phrase_extractor_log_folder%" error
Expand Down
70 changes: 70 additions & 0 deletions functions/csv_analyze_data.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
@echo off
:: =======================================================================================
:: Main
::: Use csv_analyzer_exe to analyse csv files.
::: As this is usually called as an asynchronous process using `start` this script
::: sends a signal when the process has finished.

:: Global variables
::: csv_analyzer_sample_lines: Number of csv lines to use for type detection
::: csv_analyzer_output_columns: select string for types file output

:: Input variables
::: 1. input_file: location of the csv files
::: 2. output_file: output folder for types files.

:: Executables
::: csv_analyzer_exe
:: =======================================================================================
setlocal

set input_file=%~1
set output_file=%~2
set output_folder=%~dp2

call :check_variables 2 %*

echo analyze csv file: %input_file%
%csv_analyzer_exe% ^
--input_file %input_file% ^
--output_file %output_file% ^
%csv_analyzer_sample_lines_arg% ^
%csv_analyzer_output_columns_arg%

:: Send signal to waiting processes
call %functions_folder%\wait.bat :send %~f0

endlocal
goto:eof
:: =======================================================================================


:: =======================================================================================
:check_variables
:: =======================================================================================

:: Set functions_folder to location of this script
set functions_folder=%~dp0
:: Set program_folder to relative location of this script
set programs_folder=%~dp0\..\programs

:: Get executable paths
call %programs_folder%\executables.bat

:: Check number of input variables
call %functions_folder%\variable.bat :check_parameters %*

:: Validate input variables
call %functions_folder%\variable.bat :check_file input_file
call %functions_folder%\variable.bat :check_variable output_file
call %functions_folder%\variable.bat :create_folder output_folder

if defined csv_analyzer_sample_lines (
set csv_analyzer_sample_lines_arg=--sample_size %csv_analyzer_sample_lines%
)
if defined csv_analyzer_output_columns (
set csv_analyzer_output_columns_arg=--output_columns "%csv_analyzer_output_columns%"
)

goto:eof
:: =======================================================================================
101 changes: 101 additions & 0 deletions functions/export_database.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
:: =======================================================================================
:: Main
::: Export a database from sql server to tsv files

::: First run all scripts in export_sql_folder using the file name as table name
::: Then run the default export for all tables in the database for which there is
::: no exported tsv file yet.

:: global variables
::: server
::: export_table_include_header
::: export_table_include_types

:: input variables
::: 1. db_name: name of the database to export
::: 2. export_sql_folder: sql folder containing sql files corresponding to table names
::: which contain sql code to export the table
::: 3. output_folder: folder where the output files should be placed
::: 4. log_folder: log folder for this function
:: =======================================================================================
setlocal

set db_name=%~1
set export_sql_folder=%~2
set output_folder=%~3
set log_folder=%~4

call :check_variables 4 %*

set sqlcmd_exe=sqlcmd -S %server% -d %db_name% -E -m 1 -y0

echo Export database %db_name%

set "table_query=select table_name from information_schema.tables where table_schema = 'dbo' order by table_name"
call %sqlcmd_exe% -Q "set nocount on; %table_query%" -o "%output_folder%\table_export.conf"
if exist "%export_sql_folder%" (
for /f %%f in ('dir /b /ON "%export_sql_folder%\*.sql"') do (
call :export_table %export_sql_folder%\%%f
)
)
for /f %%t in (%output_folder%\table_export.conf) do (
call :export_table %%t
)

endlocal
goto:eof
:: =======================================================================================


:: =======================================================================================
:export_table
::: export the table if the exported table does not already exist
:: =======================================================================================
set table_or_file=%~1

if exist %table_or_file% (
for %%f in (%table_or_file%) do set table_name=%%~nf
) else (
set table_name=%table_or_file%
)
set output_file=%output_folder%\%table_name%.tsv

if not exist %output_file% (
call %functions_folder%\export_table.bat ^
"%db_name%" ^
"%table_or_file%" ^
"%output_folder%" ^
"%log_folder%"
)
goto:eof
:: =======================================================================================


:: =======================================================================================
:check_variables
:: =======================================================================================

:: set functions_folder to location of this script
set functions_folder=%~dp0
:: set program_folder to relative location of this script
set programs_folder=%~dp0\..\programs

:: get executable paths
call %programs_folder%\executables.bat

:: check number of input parameters
call %functions_folder%\variable.bat :check_parameters %*

:: validate global variables
call %functions_folder%\variable.bat :check_variable server

:: validate input variables
call %functions_folder%\variable.bat :check_variable db_name
call %functions_folder%\variable.bat :check_variable export_sql_folder
call %functions_folder%\variable.bat :create_folder output_folder
call %functions_folder%\variable.bat :create_folder log_folder
call %functions_folder%\variable.bat :default_variable export_table_include_header false
call %functions_folder%\variable.bat :default_variable export_table_include_types false

goto:eof
:: =======================================================================================
120 changes: 120 additions & 0 deletions functions/export_table.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
:: =======================================================================================
:: Main
::: Export a table from sql server to a tsv file

:: global variables
::: server
::: export_table_include_header
::: export_table_include_types

:: input variables
::: 1. db_name: name of the database to query
::: 2. table_or_file: name of the table to export
::: or a query file with the sql statement that outputs a single table
::: 3. output_folder: folder where the output files should be placed
::: 4. log_folder: log folder for this function
:: =======================================================================================
setlocal

set db_name=%~1
set table_or_file=%~2
set output_folder=%~3
set log_folder=%~4

call :check_variables 4 %*

echo Export table %db_name%..%table_name% (%table_query_file%)

call %powershell_7_exe% "& %functions_folder%\export_table\export_table.ps1" ^
"-server %server%" ^
"-db_name %db_name%" ^
"-table_name %table_name%" ^
"-input_file %table_query_file%" ^
"-output_file %output_file%" ^
"-log_folder %log_folder%" ^
"%no_header_arg%" ^
"%verbose_arg%"

if "%export_table_include_types%" == "true" (
call %powershell_7_exe% "& %functions_folder%\export_table\export_table.ps1" ^
"-server %server%" ^
"-db_name %db_name%" ^
"-table_name %table_name%" ^
"-input_file %functions_folder%\export_table\export_types.sql" ^
"-output_file %types_file%" ^
"-log_folder %log_folder%" ^
"%verbose_arg%"
)
if "%export_table_include_types%" == "analyze" (
set csv_analyzer_output_columns='%table_name%' as table_name, column_name, mssql_type as data_type, max_length, is_nullable
call %functions_folder%\csv_analyze_data.bat ^
"%output_file%" ^
"%types_file%"
)

endlocal
goto:eof
:: =======================================================================================


:: =======================================================================================
:check_variables
:: =======================================================================================

:: set functions_folder to location of this script
set functions_folder=%~dp0
:: set program_folder to relative location of this script
set programs_folder=%~dp0\..\programs

:: get executable paths
call %programs_folder%\executables.bat

:: check number of input parameters
call %functions_folder%\variable.bat :check_parameters %*

:: validate global variables
call %functions_folder%\variable.bat :check_variable server

:: validate input variables
call %functions_folder%\variable.bat :check_variable db_name
call %functions_folder%\variable.bat :check_variable table_or_file
call %functions_folder%\variable.bat :create_folder output_folder
call %functions_folder%\variable.bat :create_folder log_folder
call %functions_folder%\variable.bat :default_variable export_table_include_header false
call %functions_folder%\variable.bat :default_variable export_table_include_types false

set types_output_folder=%output_folder%\types
if "%export_table_include_types%" == "true" (
call %functions_folder%\variable.bat :create_folder types_output_folder
)

if exist %table_or_file% (
::: Export by sql script mode
set table_query_file=%table_or_file%
for %%f in (%table_or_file%) do set table_name=%%~nf
if "%export_table_include_types%" == "true" (
set export_table_include_types=analyze
)
) else (
::: Export table mode
set table_name=%table_or_file%
set table_query_file=%functions_folder%\export_table\export_table.sql
)
set output_file=%output_folder%\%table_name%.tsv
set types_file=%types_output_folder%\%table_name%_types.tsv

if "%verbose%" == "true" (
set verbose_arg=-Verbose
)
if "%export_table_include_header%" == "false" (
set no_header_arg=-NoHeader
)


call %functions_folder%\variable.bat :check_variable table_name
call %functions_folder%\variable.bat :check_file table_query_file
call %functions_folder%\variable.bat :check_variable output_file
call %functions_folder%\variable.bat :check_variable types_file

goto:eof
:: =======================================================================================
Loading