Development Reference Notes

Core assumptions:

User is only going to use up to 8 nodes (this number was chosen since it is the most manageable for right now).
Data being sent into the pipeline is pre-cleaned, has no headers, and there is no categorical data.
The target variable holds the index position of list[-1] in the data file.
The data sets being sent into the pipeline are in tabular CSV format.

Preprocessing Stage Reference:

Manager Node:

Manager takes in macro XML file.
- Processes the macro file and processes it using BeautifulSoup.
- Converts macro into a dictionary split into three sections: train, attack, and clean.
After getting job control dictionary, splits the dictionary into three control sections: train, attack, and clean.

Worker Nodes:

Waits for greenlight to begin from manager node.
- 0 means no-go and abort.
- 1 means go-ahead and no abort
Log that greenlight was received from the manager node.
Create directory data/.logs/$worker-# for storing log files.

Training Stage Reference:

Core notes:

Format of task directive:

["dataset_name", "dataset_path", "model_name", "algorithm", {"algorithm": "parameter_as_string"}, "path_to_algorithm_plugin", "data_manip_name", "datamanip_tag", {"datamanip_parameters": "as_a_string"}]

Positional values of task list:
- 0: dataset_name
- 1: dataset_path
- 2: model_name
- 3: algorithm
- 4: algorithm_parameters
- 5: absolute_path_to_algorithm_plugin
- 6: data_manipulation_name
- 7: data_manipulation_tag
- 8: data_manipulation_parameters
If there are more nodes than directives, numpy will send an empty list [] to the extra worker nodes.

Manager Node:

Broadcast out to nodes if skipping or proceeding with training stage.
- 0 means no skip and continue with the training stage.
- 1 means skip training stage and go to attack stage.
Take train dictionary and unwrap into a more easily processable dictionary.
Verify the data set is callable. If the pipeline cannot find the data set, raise a FileNotFoundError exception.
Create the directory path data/$dataset_name/models.
Send unwrapped dictionary to scattershot.generate_train() to create directive list for worker nodes.
Send directive list to scattershot.slice() to slice up directive list amongst available worker nodes.
Send task list to workers using scattershot.delegate().

Worker Nodes:

Log whether or not it is performing the training stage.
Log task list received from the manager node.
If the task list received from the manager is empty, return status 1 to the manager. This will inform the manager that the worker is good to move onto the next stage.
Once task list is received, perform user specified manipulations on the data. The current available manipulation options are the following:
- XGBoost
- Random Forest
- Principle Component Analysis
- Candlestick
- None <- This option is reserved for users who just want to tune the hyperparameters.
Save a copy of the manipulation if the user so desires. Can remove this option from custom installations if storage space is limited.
Create parameter dictionary to send to the plugin. The parameter dictionary will look like the following:

{
    "dataset_name": "/path/to/dataset",
    "model_name": "mymodelsname",
    "dataframe": pandas.DataFrame,
    "model_params": {},
    "save_path": "/path/to/save/model/to/",
    "log_path": "path/to/log/model/statistics",
    "manip_info": ("manip_name", "manip_tag")
}

Attack Stage Reference:

Currently focusing on developing the training stage, but I have a general idea of how this will work.

Clean Stage Reference:

Once models start getting trained and data starts coming back, I can start the prototyping of this stage!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Reference Notes

Core assumptions:

Preprocessing Stage Reference:

Manager Node:

Worker Nodes:

Training Stage Reference:

Core notes:

Manager Node:

Worker Nodes:

Attack Stage Reference:

Clean Stage Reference:

Clone this wiki locally