-
Notifications
You must be signed in to change notification settings - Fork 0
Development Reference Notes
NucciTheBoss edited this page Jun 30, 2021
·
4 revisions
- User is only going to use up to 8 nodes (this number was chosen since it is the most manageable for right now).
- Data being sent into the pipeline is pre-cleaned, has no headers, and there is no categorical data.
- The target variable holds the index position of
list[-1]
in the data file. - The data sets being sent into the pipeline are in tabular CSV format.
- Manager takes in macro XML file.
- Processes the macro file and processes it using
BeautifulSoup
. - Converts macro into a dictionary split into three sections: train, attack, and clean.
- Processes the macro file and processes it using
- After getting job control dictionary, splits the dictionary into three control sections: train, attack, and clean.
- Waits for greenlight to begin from manager node.
- 0 means no-go and abort.
- 1 means go-ahead and no abort
- Log that greenlight was received from the manager node.
- Create directory
data/.logs/$worker-#
for storing log files.
- Format of task directive:
["dataset_name", "dataset_path", "model_name", "algorithm", {"algorithm": "parameter_as_string"}, "path_to_algorithm_plugin", "data_manip_name", "datamanip_tag", {"datamanip_parameters": "as_a_string"}]
- Positional values of task list:
-
0:
dataset_name
-
1:
dataset_path
-
2:
model_name
-
3:
algorithm
-
4:
algorithm_parameters
-
5:
absolute_path_to_algorithm_plugin
-
6:
data_manipulation_name
-
7:
data_manipulation_tag
-
8:
data_manipulation_parameters
-
0:
- If there are more nodes than directives,
numpy
will send an empty list[]
to the extra worker nodes.
- Broadcast out to nodes if skipping or proceeding with training stage.
- 0 means no skip and continue with the training stage.
- 1 means skip training stage and go to attack stage.
- Take train dictionary and unwrap into a more easily processable dictionary.
- Verify the data set is callable. If the pipeline cannot find the data set, raise a
FileNotFoundError
exception. - Create the directory path
data/$dataset_name/models
. - Send unwrapped dictionary to
scattershot.generate_train()
to create directive list for worker nodes. - Send directive list to
scattershot.slice()
to slice up directive list amongst available worker nodes. - Send task list to workers using
scattershot.delegate()
.
- Log whether or not it is performing the training stage.
- Log task list received from the manager node.
- If the task list received from the manager is empty, return status
1
to the manager. This will inform the manager that the worker is good to move onto the next stage. - Once task list is received, perform user specified manipulations on the data. The current available manipulation options are the following:
- XGBoost
- Random Forest
- Principle Component Analysis
- Candlestick
-
None
<- This option is reserved for users who just want to tune the hyperparameters.
- Save a copy of the manipulation if the user so desires. Can remove this option from custom installations if storage space is limited.
- Create parameter dictionary to send to the plugin. The parameter dictionary will look like the following:
{
"dataset_name": "/path/to/dataset",
"model_name": "mymodelsname",
"dataframe": pandas.DataFrame,
"model_params": {},
"save_path": "/path/to/save/model/to/",
"log_path": "path/to/log/model/statistics",
"manip_info": ("manip_name", "manip_tag")
}
Currently focusing on developing the training stage, but I have a general idea of how this will work.
Once models start getting trained and data starts coming back, I can start the prototyping of this stage!
Jespipe-v0.0.1 Wiki - Copyright © 2021 - Jason C. Nucciarone, Eric Inae, Sheila Alemany