Skip to content

Development Reference Notes

NucciTheBoss edited this page Jun 30, 2021 · 4 revisions

Core assumptions:

  • User is only going to use up to 8 nodes (this number was chosen since it is the most manageable for right now).
  • Data being sent into the pipeline is pre-cleaned, has no headers, and there is no categorical data.
  • The target variable holds the index position of list[-1] in the data file.
  • The data sets being sent into the pipeline are in tabular CSV format.

Preprocessing Stage Reference:

Manager Node:

  • Manager takes in macro XML file.
    • Processes the macro file and processes it using BeautifulSoup.
    • Converts macro into a dictionary split into three sections: train, attack, and clean.
  • After getting job control dictionary, splits the dictionary into three control sections: train, attack, and clean.

Worker Nodes:

  • Waits for greenlight to begin from manager node.
    • 0 means no-go and abort.
    • 1 means go-ahead and no abort
  • Log that greenlight was received from the manager node.
  • Create directory data/.logs/$worker-# for storing log files.

Training Stage Reference:

Core notes:

  • Format of task directive:
["dataset_name", "dataset_path", "model_name", "algorithm", {"algorithm": "parameter_as_string"}, "path_to_algorithm_plugin", "data_manip_name", "datamanip_tag", {"datamanip_parameters": "as_a_string"}]
  • Positional values of task list:
    • 0: dataset_name
    • 1: dataset_path
    • 2: model_name
    • 3: algorithm
    • 4: algorithm_parameters
    • 5: absolute_path_to_algorithm_plugin
    • 6: data_manipulation_name
    • 7: data_manipulation_tag
    • 8: data_manipulation_parameters
  • If there are more nodes than directives, numpy will send an empty list [] to the extra worker nodes.

Manager Node:

  • Broadcast out to nodes if skipping or proceeding with training stage.
    • 0 means no skip and continue with the training stage.
    • 1 means skip training stage and go to attack stage.
  • Take train dictionary and unwrap into a more easily processable dictionary.
  • Verify the data set is callable. If the pipeline cannot find the data set, raise a FileNotFoundError exception.
  • Create the directory path data/$dataset_name/models.
  • Send unwrapped dictionary to scattershot.generate_train() to create directive list for worker nodes.
  • Send directive list to scattershot.slice() to slice up directive list amongst available worker nodes.
  • Send task list to workers using scattershot.delegate().

Worker Nodes:

  • Log whether or not it is performing the training stage.
  • Log task list received from the manager node.
  • If the task list received from the manager is empty, return status 1 to the manager. This will inform the manager that the worker is good to move onto the next stage.
  • Once task list is received, perform user specified manipulations on the data. The current available manipulation options are the following:
    • XGBoost
    • Random Forest
    • Principle Component Analysis
    • Candlestick
    • None <- This option is reserved for users who just want to tune the hyperparameters.
  • Save a copy of the manipulation if the user so desires. Can remove this option from custom installations if storage space is limited.
  • Create parameter dictionary to send to the plugin. The parameter dictionary will look like the following:
{
    "dataset_name": "/path/to/dataset",
    "model_name": "mymodelsname",
    "dataframe": pandas.DataFrame,
    "model_params": {},
    "save_path": "/path/to/save/model/to/",
    "log_path": "path/to/log/model/statistics",
    "manip_info": ("manip_name", "manip_tag")
}

Attack Stage Reference:

Currently focusing on developing the training stage, but I have a general idea of how this will work.

Clean Stage Reference:

Once models start getting trained and data starts coming back, I can start the prototyping of this stage!