Skip to content
Shariq edited this page Apr 15, 2015 · 14 revisions

Overview & Process

The Python-based ingestion script is an overhaul of the original BASH scripts which have been used in the past to generate and execute commands to send data to data queues that EDEX pulls from.

The first thing the script does when run is create a locally stored log file specific to the task. The log file includes the task name and a date-time-stamp in the file name. It logs exactly what arguments were passed at the command-line and logs every action attempted and completed by the script.

When the script is run, it creates and Ingestor object; this object in turn creates a ServiceManager object which is linked to the Ingestor. The Ingestor is responsible for parsing inputted data and generating and executing ingestsender commands. The ServiceManager is responsible for monitoring the EDEX system and associated services (like Postgres, QPIDD) and recovering from crashes if necessary.

When the ServiceManager is created, the EDEX log files are processed using zgrep to create new log files that only contain "Finished" messages. These log files are used during the ingestion process to determine whether or not a file has already been ingested.

Once the EDEX logs have been processed, the script begins executing the task specified in on the command-line.

Tasks

The most common task is to ingest data specified by a CSV file. These are the steps it follows:

  1. The Ingestor parses the CSV and loads all of the parameters on each line into its own Ingestor queue.
  2. This queue loading function finds the files that match the file mask for each line. If no files are found, the parameter set is treated as failed and tracked in a separate queue.
  3. The files are filtered by applying age and date filters (if specified).
  4. The files are checked against the processed EDEX logs using a two-tiered search: first the file mask is checked; if no matches are found, all files matching the mask are queued. If any matches are found, the individual files are checked against those matches.
  5. The Ingestor writes out the ingestsender commands to a commands log file.
  6. The Ingestor begins iterating through its own queue to perform the ingestion.
  7. Before each command is executed, the Ingestor's ServiceManager checks to see if all of the EDEX services are running.
  8. After the Ingestor's queue is emptied out, it writes out a new CSV file with all of the failed parameter sets.
  9. An email is sent to the notification email that the ingestion script has finished sending commands to the QPID queues.

TODO

Everything except for config.yml parsing and email notifications are in a single file, which is starting to get cumbersome. Different parts of the application need to be broken out into different modules for better organization.

Clone this wiki locally