Skip to content
Matthieu Di Mercurio edited this page Apr 6, 2023 · 3 revisions

Data prep

Data collection is always key in building an accuracte model. Data quality always wins over fancy modelling and we should make sure to craft a representative dataset before building complex models. ML can also help to build and cleaning a dataset. A simple example is to train an image classifier, identify the images with the lowest confidence score to manually check if some wrong images of different classes are part of the dataset.

Exploratory data analysis can be done directly in Python in Jupyter Notebooks or dumping the data in a sql database to simplify processing and visualization. In the future, we might want to setup an ML database with dbt for data transformation and a simple visualization tool.

So far we haven't had to manage too many versions of training datasets and pre-processing so we're using a couple of files with good naming conventions. Though managing data sources might become an issue in the future. We'll see how far we can stretch the naming convention method coupled with storage in Google Cloud Storage. If we start hitting some limitations, we can evaluate git lfs (limited to 2Gb of storage on Github) or DVC. If the data is stored in a sql database, dbt new versioning feature and python models might be an option. One simple trick used in evmxo was to create a single large dataset but only train on a fraction of it. The fraction was saved as a model hyper-parameter to make it easy to understand what was used and how to reproduce the results.

emvxo

Exploratory analysis was mostly done in notebooks. We setup a Postgres DB (evmx1) and loaded some data but it was not used beyond some simple analysis.

With evmxo, we treated the sequence of requests as a big string, trying to predict the next sequence of tokens. When using these methods based on text, a lot of different tokenization methods exist and have been tested on language (HuggingFace, Pytorch) but we needed to create our own tokenizer. What we tried:

  1. Creating tokens with a concatenation of contract address and key (simplified example: 'A:123K:456' 'A:123K:457')
  2. Creating tokens from contract address and key ('A:123' 'K:456' 'A:123' 'K:457')
  3. Creating tokens from contract address, key prefix and key suffix ('A:123' 'KP:45' 'KS:6' 'K:7'). This method did not repeat tokens that were already present. For example when two consecutive keys have the same address and prefix, only the suffix would be included in the string.

#3 seemed to give the most accurate results but would have to be tested in conditions closer to production to confirm that it is indeed the best method. A potential issue with #3 is that a failure early on could make the entire sequence fail. If the first predicted token is a wrong address but the rest of the sequence is predicted accurately, the entire sequence would be wrong while accuracy as measured during training remains high.

Training

The best approach to iterate on model training is to setup an initial benchmark with a quick model to train and iterate on it by adding more data, testing different models and hyperparameters. The steps will vary between different projects but in general the best process is to go in this order:

  • Improving the data set
  • Testing different models
  • Hyperparameter tuning
  • Increasing dataset size Steps are ordered based on the potential impact they can have on the model performance. Increasing the size of the dataset is often best as the last step since it requires more processing time.

Jupyter notebooks are the best way to interactively run and keep track of different experiments. Even though a few options exist to get diffs of notebooks, managing version history of experiments is not as easy as any code. Model training is dependent on the data source, preprocessing steps, model used and hyperparameters. Keeping all of this in a single notebook without worrying about history makes it easier. In evmxo each new experiment was saved as new notebook but only the main model was pushed to Github.

Important note for future development: Github released Jupyter Notebook rich diff in preview in March 2023. Guide to enable it

A couple of potential challenges:

  • Environment management: A few libraries are required to train the model. This could be managed through poetry
  • Compute power: Even though MacBook Pros have great CPUs, the improvements in performance switching to GPUs is not significant. Training large scale models on large datasets would require Nvidia GPUs running in the cloud.

Kubeflow

Kubeflow would be an option to address the issues above and potentially make deployment easier. It enables running notebooks on GCP with multiple users, environment. Resources available to the notebook can also be configured to get access to more GPUs. The Getting Started guide has a few great intro videos about Kubeflow, explaining each component and how to deploy them.

evmxo

Similar to data preparation, we used the quick and simple approach to train the model. The prediction was modelled following language modelling methods, trying to complete a sequence of tokens.

The best performance was achieved by adapting the LSTM from the fast.ai book from chapter 12. The model yields a good balance between accuracy (>80%) and inference time (< 10ms).

Embeddings are built from scratch since the tokens are not standard language tokens. Training actual language models would have been easier as we would have re-used some existing models. HuggingFace offers a large library of pre-trained models that can be used as-is or fine-tuned to a specific use case.

Transformers would have been good models for this task but the inference time is much longer so these models would not have been viable in prod. Inference on a simple proof-of-concept inspired by HuggingFace's tutorial was taking more than 100ms.

Good resources to learn more about NLP: Fast.ai NLP course: Chapter 7 and 8 on translation, seq2seq and transformers Pytorch NLP tutorials

Deployment / Serving

Pytorch models can be deployed in many ways. This document summarizes the ones that we might want to use in the future.

Pytorch optimizations

  1. Make sure to set the model in evaluation mode with eval()
  2. Run the prediction in inference mode (preferred to no_grad)
  3. Use Pytorch 2 nightly build (at time of writing) and compile the model
  4. Directly use the model(data) or forward function instead of wrapper functions like predict

Performance benchmarks

TorchServe

TorchServe is probably the easiest way to deploy a model. Once trained, the model state_dict can be saved along with its vocabulary and architecture. The torch-model-archiver cli tool can be used to create an archive of the model containing all of the information required to serve it. An example of model archiving can be found in this notebook. TorchServe then loads the model and serves it as a REST API. A custom handler can be created if additional pre- or post-processing are required, see evmxHandler.py.

Golang packages

A few packages exist to train and serve models directly in golang. Gorgonia seems to be the most active. Compared to Pytorch, it has a smaller community, fewer available examples. It requires writting neural nets at a lower level, more code and a better understanding of how neural nets actually work. Gotch and gotorch are options to use the C++ Pytorch bindings in Go and serve models created in Pytorch. Though this seems to be a convoluted approach and requires a few extra steps that would slow down the development process. Pytorch being written in C++, the inference time would not be significantly slower in golang.

ONNX

From Pytorch ONNX doc:

Open Neural Network eXchange (ONNX) is an open standard format for representing machine learning models. The torch.onnx module can export PyTorch models to ONNX. The model can then be consumed by any of the many runtimes that support ONNX.

On evmxo, ONNX did not bring significant savings on inference time, likely due to the small size of the model and small batch size used.

CI/CD

To be investigated: how do we setup a pipeline to re-train models as new data becomes available. The current model is fairly simple so data collection, preparation, training and deployment could all be contained in a Python job running on a schedule but we might want to move this to an actual orchestrator in the future (eg: Kubeflow pipeline).

TorchServe allows to deploy different versions of models side by side to do canary rollouts or A/B testing.

Monitoring: Tracking performance over time

Not investigated yet but this will be something important when we start running models in production. Data drift can easily go unnoticed so it would be important to have some alerting.

Next steps

Not prioritized yet

  • Setup Postgres DB with dbt and visualization tool
  • Test Git LFS and DVC
  • Test Jupyter Notebook rich diff
  • Setup Kubeflow
  • Evaluate options for CI/CD
  • Evaluate options for performance monitoring