Skip to content

Machine learning [how build a working model from scratch]

fab edited this page Jan 3, 2024 · 36 revisions

I built a custom DevGPT to write most of the code I'm currently publishing on GitHub and I am using it to build a model suitable for domain safety ranking from scratch.

I will use the rank score to filter out safe predicted FQDNs from the release blacklist to provide additional accuracy and reduce false positives.

Since I can feed the machine learning pipeline with fresh data anytime (dataset with millions of blacklisted and whitelisted domains and subdomains aka FQDNs), I planned to build a model to predict badness score for new submitted FQDNs. The rank score is in the 1-100 range where 1 means really safe and 100 means really bad.

I then started by using a subset of the entire dataset (25000 good + 25000 bad items instead of millions of them).

Afterthat I built a simple ensemble pipeline to find the most accurate method for training and inference.

I tested the most popular and easy-to-implement methods in this context like RandomForest, GradientBoosting, ExtraTrees, LogisticRegression and SVC:

    classifiers = {
        "RandomForest": RandomForestClassifier(random_state=42),
        "GradientBoosting": GradientBoostingClassifier(random_state=42),
        "ExtraTrees": ExtraTreesClassifier(random_state=42),
        "LogisticRegression": LogisticRegression(random_state=42, max_iter=2000),
        "SVC": SVC(probability=True, random_state=42)
    }

Let's describe all those methods one by one:

  1. RandomForest Classifier
  • Type: Ensemble Learning Method
  • Description: RandomForest is a type of ensemble learning method that constructs a multitude of decision trees during training. For classification tasks, it outputs the class that is the mode of the classes of individual trees.
    • Strengths:
      • Handles both numerical and categorical data well.
      • Robust to overfitting as it averages the results of many decision trees.
      • Good performance in a wide range of problems.
    • Weaknesses:
      • Can be less interpretable compared to a single decision tree.
      • Performance may degrade with very noisy data.
  1. GradientBoosting Classifier
  • Type: Ensemble Learning Method
  • Description: GradientBoosting builds an additive model in a forward stage-wise fashion, allowing optimization of an arbitrary differentiable loss function. It builds the model in a stage-wise fashion like other boosting methods do but generalizes them by allowing optimization of an arbitrary differentiable loss function.
    • Strengths:
      • Often provides predictive accuracy that cannot be trumped.
      • Lots of flexibility as it can optimize different loss functions and provides several hyperparameter tuning options.
    • Weaknesses:
      • Can overfit if the number of trees is too large.
      • Sensitive to noisy data and outliers.
      • Requires careful tuning of parameters and may take longer to train.
  1. ExtraTrees Classifier
  • Type: Ensemble Learning Method
  • Description: ExtraTrees (Extremely Randomized Trees) Classifier fits a number of randomized decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control overfitting.
    • Strengths:
      • Reduces variance more effectively than RandomForest by using random thresholds for each feature rather than searching for the best possible thresholds.
      • Typically faster to train than RandomForest.
    • Weaknesses:
      • Like RandomForest, can be less interpretable.
      • Might not perform well on data with strong linear relationships.
  1. Logistic Regression
  • Type: Regression-based Classifier
  • Description: Despite its name, Logistic Regression is used for binary classification problems. It models the probability of a default class (e.g., class labeled '1').
    • Strengths:
      • Simple, efficient, and easy to implement.
      • Performs well with linearly separable classes.
      • Outputs probabilities, which can be a useful feature.
    • Weaknesses:
      • Assumes linearity between dependent and independent variables.
      • Can struggle with complex relationships in data.
      • Vulnerable to overfitting if the data is highly dimensional.
  1. SVC (Support Vector Classifier)
  • Type: Kernel-based Classifier
  • Description: SVC is a powerful, versatile machine learning algorithm, capable of performing linear or nonlinear classification, regression, and even outlier detection. It is one of the best out-of-the-box classifiers.
    • Strengths:
      • Effective in high-dimensional spaces.
      • Versatile as different kernel functions can be specified for the decision function.
    • Weaknesses:
      • Can be inefficient on large datasets.
      • Requires careful tuning of parameters and selection of the kernel.
      • The choice of kernel and regularization can have a large impact on the performance of the algorithm.

The best method is choosed by performing RandomSearch instead of GridSearch:

        random_search = RandomizedSearchCV(clf, params[name], n_iter=20, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1, random_state=42)
        random_search.fit(X_res, y_res)

by using all of the following parameters (I'm running this project on a Dell R620 48 cores, 128GB ram, no GPU server):

    params = {
        "RandomForest": {'n_estimators': sp_randint(100, 500), 'max_depth': sp_randint(10, 50), 'min_samples_split': sp_randint(2, 11)},
        "GradientBoosting": {'n_estimators': sp_randint(100, 300), 'learning_rate': uniform(0.01, 0.2), 'max_depth': sp_randint(3, 10)},
        "ExtraTrees": {'n_estimators': sp_randint(100, 500), 'max_depth': sp_randint(10, 50), 'min_samples_split': sp_randint(2, 11)},
        "LogisticRegression": {'C': uniform(0.01, 100), 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']},
        "SVC": {'C': uniform(0.1, 10), 'kernel': ['linear', 'rbf', 'poly']}
    }

to find the most suitable approach.

I then focus on the elected approach to increase accuracy.

The best model after more than 12 hours of training evaluations:

Best model: SVC(C=7.319987722668247, probability=True, random_state=42), with parameters: {'C': 7.319987722668247, 'kernel': 'rbf'}

Let's train the model with such approach:

Loading and preprocessing data...
Vectorizing domains...
Applying SMOTE for class imbalance...
Training the SVC model...
......................................*...............*
User
optimization finished, #iter = 53354
obj = -10549.506943, rho = -0.657859
nSV = 27740, nBSV = 87
Total nSV = 27740

Here's what each part of the message means:

  • optimization finished, #iter = 53354 This shows that the optimization process within the SVM training has finished. #iter = 53354 indicates the number of iterations the algorithm took to converge. In this case, it went through 53,354 iterations.

  • obj = -10549.506943 This is the final value of the objective function that the SVM was minimizing. In SVM training, the goal is often to minimize a loss function, which in this case settled at approximately -10549.51.

  • rho = -0.657859 rho is a parameter in the decision function used by the SVM. It's involved in the calculation of the decision boundary for the classifier.

  • nSV = 27740, nBSV = 87 nSV stands for the number of support vectors. Support vectors are the data points that lie closest to the decision surface (or hyperplane) and are critical to defining the position and orientation of the hyperplane. Here, the model has 27,740 support vectors. nBSV stands for the number of bounded support vectors, which are support vectors that lie exactly on the margin of the classifier. There are 87 of them in the model.

  • Total nSV = 27740 This reaffirms the total number of support vectors used in the model.

What this means for my model

  • The large number of iterations and support vectors suggest that the model is dealing with a complex, high-dimensional dataset.
  • The successful completion of training is good news, but it's crucial to evaluate the model's performance on a hold-out test set to understand how well it generalizes to unseen data.

Now that the model is trained, I should test it with my test dataset to evaluate its performance metrics (like accuracy, precision, recall, F1-score). These metrics will give a better understanding of how well the model is likely to perform in a real-world scenario.

The model scores are good then it's time to make inference and to score all blacklisted domains saving results to a file. Reading this file I will have a clear picture on how the model perform against the real world blacklist dataset (3,5M FQDNs).

In first instance I used to pass some FQDNs via simple array then I used the blacklist.txt file. Inference speed was around 30 domains per seconds then I optimized it and make it using all available cores. After further optimizations the inference speed is now quite good and usable on real world applications:

Processed all domains - 2037.37 domains/sec

Here a sample of the created scores file:

aviator-ui-v1-128424389.eu-west-1.elb.cryptohosting.eu:98
746-ynl-087.mktoweb.com:98
rollandmeds.com:60
atdmt.com.73879.9623.302br.net:99
facedokgroup.000webhostapp.com:99
however-maintain.shop:99
btwvjqtvha.duckdns.org:99
kustomkutslandscaping.com:96
eu.bitcoin.com:99
tesla-res.ddns.net:98

Whenever it looks good for sure some FQDNs are ranked safer than others and I want to spot the safest ones (in the blacklist scored subset) by using simple stats scripts. To extract some stats from the created file can help me investigate on how much the model is working for a real world scenario: low scores in the file will reveal ranking errors and can show up additional informations and relations useful to improve for a more robust ranking model or workflow.

Different approaches can be followed now.. then I need to think and dig a little bit to avoid time waste in the next stages of the process.

In the meanwhile I just extracted all scores under 51 to a separate file and as expected a reasonable 10% error on predictions was revealed despite the model accuracy metrics reported are pretty insane (close to 100% in all metrics):

Total domains processed: 3653229
Domains with score under 51: 370301

I want now to process with the model a long list of safe domains and save to a new file then I will iterate over prediction errors in the opposite way (saving scores 51 to 100) to have both sides of the prediction issues.

..to be continued schema