A customizable workflow to auto-archive large dataset to DME object storage. This standalone application can be installed and run on any windows/linux client machine where the data to be archived is accessible. The collections and data sets to be archived are uploaded to NCI DME Data Management Environment at a scheduled interval or for a one time archival. Fault tolerance and multi-threading capabilities are built-in to achieve reliability and high throughput. It can be customized to derive the DME archival path and extract metadata for collection and data object to be supplied to DME based on the folder structure and file naming convention and/or user provided mapping data.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a client machine.
Java 8, Maven, Git
This will tell you how to get a development env running
To check out the project from git, do:
$ git clone https://github.com/CBIIT/HPC_DME_APIs
$ git clone https://github.com/CBIIT/dme-archival-workflow
Download ojdbc6 from group com.oracle.database.jdbc (version 11.2.0.4) and install into your local maven repository.
Navigate to the project and build:
$ cd HPC_DME_APIs/src
$ mvn -pl "hpc-server/hpc-domain-types,hpc-server/hpc-common,hpc-server/hpc-dto" clean install
$ cd dme-archival-workflow
$ mvn clean install -DskipTests
Please run the script to generate the token.
$ sh dme-sync-generate-token.sh
You will be prompted for your username and password and which environment you will be running the workflow against.
env=[dev|uat|prod|prod2|prod3|prod4|prod_bp]
Prod config depends on which DME API server to connect to.
The application uses an oracle database:
To access the database, use tool such as Datagrip
In src/main/resources/application.properties, update the following with the correct password
spring.datasource.password=<updated here>
Run the application locally:
Check if the dme-sync-[verison].jar entry in dme-sync.sh matches with the version built. If not, update it.
$ sh dme-sync.sh
To verify that the application is running, open the browser and access the default app page (Work in progress):
http://localhost:8888/home
This instruction will cover how to deploy this on a client machine
Before installing it on the remote machine, application.propterties
shall be configured properly.
See Customizing the Workflow section.
Upload the jar, bash script, configuration file (application.properties) to the client machine:
$ scp target/dme-sync-<version>.jar user@remotemachine:~/
$ scp dme-sync-generate-token.sh user@remotemachine:~/
$ scp dme-sync.sh user@remotemachine:~/
$ scp application.properties user@remotemachine:~/
Log into the remote machine and start the application
$ ssh user:pass@remotemachine
$ sh dme-sync-generate-token.sh
$ sh dme-sync.sh
On your local machine, perfrom the following to tunnel port 8888.
$ ssh -L 8887:localhost:8888 user@remotemachine
Now the default app page can be accessed from your local browser:
http://localhost:8887/home
The following properties can be set in application.properties
file:
-
dmesync.source.base.dir=<dir>
- The app will scan the directory specified
-
dmesync.source.base.dir.folders=<comma separeted list of folders>
- If specified, app will only scan these folders under the base dir.
-
dmesync.work.base.dir=<dir>
- The app will use this as a working directory for tar/untar and compression
-
dmesync.destination.base.dir=<collection>
- The base collection path in DME
-
dmesync.doc.name=[hitif|cmm|default]
- The doc name used for the custom business logic to build DME path, collection metadata and metadata for the object.
- Default:
default
-
dmesync.noscan.rerun=[true|false]
- If
true
, instead of scanning for files under the base dir, it will reprocess files from the database. - Default:
false
- If
-
dmesync.tar=[true|false]
- If
true
, it will tar the collection specified bydmesync.preprocess.depth
fromdmesync.source.base.dir
. - Default:
false
- If
-
dmesync.tar.exclude.folder=[folder1,folder2]
- If specified, folders that matches the folder names in a comma separated list will be excluded from the tar ball.
-
dmesync.untar=[true|false]
- If
true
, it will untar the collection specified bydmesync.preprocess.depth
fromdmesync.source.base.dir
. - Default:
false
- If
-
dmesync.compress=[true|false]
- If
true
, the data object will be compressed prior to upload. - Default:
false
- If
-
dmesync.preprocess.depth=[1,2,...|-1]
- If
dmesync.tar
ordmesync.untar
is true,dmesync.preprocess.depth
will be used to determine the collection which requires tar or untar. - If -1 is specified with tar option, it will tar the leaf folder only.
- For example:
dmesync.source.base.dir = /home/user/instrument ├── instrument # Depth 0 │ ├── pi_a # Depth 1 │ ├── project1 # Depth 2 │ ├── sample1 # Depth 3 │ ├── sample2 # Depth 3 │ ├── project2 # Depth 2 │ ├── pi_b # Depth 1 │ ├── project1 # Depth 2 │ ├── sample1 # Depth 3 │ ├── sample2 # Depth 3 │ ├── project2 # Depth 2
- If
-
dmesync.exclude.pattern=[glob]
- If specified, object with the specified pattern will be excluded.
- Comma-separated multiple patterns can be specified.
- If both include and exclude pattern is applicable for a file/folder, the file/folder will be excluded as the exclusion takes precedence.
- For example:
**/pi_b/** - Will exclude any file/folders which has pi_b folder in the path.
-
dmesync.include.pattern=[glob]
- If specified, only object with the specified pattern will be include.
- Comma-separated multiple patterns can be specified.
- For example:
**/pi_a/** - Will include file/folders which has pi_a folder in the path. instrument/**/project1/** - Will include file/folders which starts with instrument folder directly under the base dir, and has project1 folder after any parent folder.
-
dmesync.dryrun=[true|false]
- If true, only records the files to be processed in the local DB without running the workflow.
- Default:
false
-
dmesync.cleanup=[true|false]
- If true, the tar file created under the dmesync.work.base.dir will be removed upon successful upload.
- Default:
false
-
dmesync.verify.prev.upload=[none|local]
- If
none
, it does not check whether it has previously been uploaded. - If
local
, it will check the local db if it has previously been uploaded, and skip the file. - Default:
none
- If
-
dmesync.cron.expression=[cron expression]
- This expression is not used if
dmesync.run.once.and.shutdown
flag istrue
- For example:
0 0/5 * * * ? //every 5 minutes 0 0 0 1 1 ? //Jan 1st of the year format: Sec Min Hour Day Mon SUN-SUN:0-7
- This expression is not used if
-
dmesync.run.once.and.shutdown=[true|false]
- Also see
dmesync.run.once.run_id
- If
true
, once the run has completed regardless of any failures, the application will shutdown. - Default:
false
- Also see
-
dmesync.run.once.run_id=<run id>
- If
dmesync.run.once.and.shutdown=true
, the user must supply a unique run id for this run. - Recommended run id example:
Run_YYYYMMDDHHMISS
- If
-
dmesync.last.modified.days=[1,2,...]
- If specified, if modified date of the file/folder is within the number of days specified, it will not be archived.
-
dmesync.replace.modified.files=[true|false]
- If
true
, the system will compare the modified date against the last uploaded and reupload if modified.
- If
-
dmesync.tar.file.exist=<filename>
- If specified, it will check whether a file with the specified file name exists before tar operation is performed.
-
dmesync.tar.file.exist.ext=<ext>
- If specified, it will check whether a file with the specified file extension exists before tar operation is performed.
-
dmesync.file.exist.under.basedir=[true|false]
- Check if the marker file specified in dmesync.tar.file.exist.ext is directly under the base directory.
-
dmesync.file.exist.under.basedir.depth=[1,2,...]
- If specified, it will check for the file under the specified depth from the basedir.
-
dmesync.admin.emails=<comma separated email addrresses>
- Once a run completes, the run result will be emailed to this address.
-
dmesync.additional.metadata.excel=<file path to the metadata file>
- If specified, application will load the custom metadata excel file supplied by the user.
-
spring.main.web-environment=[true|false]
- If
true
, enables the web environment.
- If
Optionally, override system defaults for concurrent file processing with the following parameters.
- Number of threads to process the files concurrently
spring.jms.listener.concurrency=<min number of threads> spring.jms.listener.max-concurrency=<max number of threads>
- Number of threads to upload multi-part upload file parts concurrently
dmesync.multipart.threadpoolsize=<number of threads>
The DOC specific metadata entries and collection name mapping can be inserted into the mapping tables:
Please see sample data.sql
provided to supply custom metadata mapping and collection name mapping.
Refer to Required mapping for customized DOC section for further
configuration on existing DOCs.
id | collection_type | map_key | map_value |
---|---|---|---|
1 | PI | pi_a | PI_A_NAME |
2 | PI | pi_b | PI_B_NAME |
id | collection_type | collection_name | map_key | map_value |
---|---|---|---|---|
1 | PI | PI_A_NAME | full_name | Jane Doe |
2 | PI | PI_A_NAME | someemail@address |
This feature is only available if you are running the workflow as a GROUP_ADMIN role.
Once the collection/data objects are archived, if user permission (READ, WRITE, OWN) or bookmark needs to be added,
it could be specified in the following table. The entries can be loaded by including it in the data.sql
file
for the metadata entries.
id | path | user_id | permission | create_bookmark | created | error |
---|---|---|---|---|---|---|
1 | /DME/path1 | usera | READ | Y | "Y" if created | Error if any |
2 | /DME/path2 | userb | WRITE | N | "Y" if created | Error if any |
For the example above in the table, for id 1, READ permission will be given to usera for /DME/path1, and a bookmark called "path1" will be created for usera for /DME/path1. For id 2, WRITE permission will be given to userb for /DME/path2 and no bookmark will be created.
From the Web interface, you can export any run into a Excel (xlsx) file. The file will be generated in your application log directory and emailed to the address specified in application.properties file. If the runId is not specified, it will export the latest runId.
http://localhost:8888/export
http://localhost:8888/export/{runId}
- Spring Boot - Framework used
- Maven - Dependency Management
For license details for this project, see LICENSE.txt file
The following information is required in the collection_name_mapping.
- PI collection_type, key and value (key will be the user folder, value is the PI collection in DME to map to)
- User collection_type, key and value (key will be the user folder, value is the User collection in DME to map to)
The following information is required in the metadata_mapping.
-
For PI collection_type:
- collection_type:
PI
- pi_name
- pi_email
- institute
- lab
- branch
- collection_type:
-
For User collection_type:
- collection_type:
User
- name
- branch
- collection_type:
The following information is required in the metadata_mapping.
-
For PI collection_type:
- collection_type:
PI_Lab
- pi_name
- affiliation
- pi_id
- collection_type:
-
For Project collection_type:
- collection_type:
Project
- project_name
- project_number
- start_date
- method
- description
- publications
- collection_type: