- To train on batch.
- To schedule the batch.
If you need to train your machine learning model regularly, the batch training pattern is applicable. In the pattern, you will define the training as a batch job and configure the trigger and schedule in a job management server. The server will execute the batch job. One of the easiest ways to define it is with Linux crontab, and it is also possible to make it with services in cloud. Or, it is possible to use a job management server.
The pattern is one of the most common architecture to train a model offline. The workflow would be like:
- Retrieve data from DWH (may need data cleansing)
- Preprocess data
- Train
- Evaluate
- Build model to prediction server
- Store the model and server, and record the evaluation
You may need to consider error handling of each step with the service level objective.
If you need to keep updating the prediction model everyday or every batch, which means the service level is quite high, you would need retry policy for errors, or send an alert to an operating team. If you don't have to keep updated, you may just alert or record an error, and rekick later.
It is recommended to make it possible to record failed job and rekick or troubleshoot from the log. Among the workflow above, there may have an irregular or unexpected data in the DWH, such as integer saved as char or invalid value range, that require some data anomaly filtering or data cleansing. It is difficult to rerun the job if there is anomaly data. You may need to filter the data beforehand or manually exclude it.
For the steps 2. to 4., there may have possibility of the model evaluation may not be good enough for production use. In that case, you may need to reconsider preprocess or hyperparameter of the training and tune them to fit to the current dataset.
For the 5. and 6., it may be system issue of building or recording error. You may need to review the system components, server, storage, database, network, middleware and so on, to find the root cause and mitigation.
- Retrain and update model regularly.
- Need to consider job error.
- It is difficult to make a full-automatic workflow.
- Job management method or software.
- Error handling.