Crafting static and dynamic models for data exfiltration detection via DNS traffic analysis using . Static model trained on batch data Static_dataset.csv while dynamic model simulates a continuous streamKafka_dataset.csv [that should treat as a data stream (local Kafka Server) which will be used to evaluate the dynamic model]. Rigorous analysis, feature engineering, and model training conducted. Implementation part of AI for Cyber Security Master's assignment at the University of Ottawa, 2023.
- Required libraries: scikit-learn, pandas, matplotlib.
- Execute cells in a Jupyter Notebook environment.
- The uploaded code has been executed and tested successfully within the Google Colab environment.
Task is to enhanced data exfiltration detection through DNS traffic analysis : 1 / 0.
- 'timestamp': The time at which the data was recorded.
- 'FQDN_count': The count of fully qualified domain names.
- 'subdomain_length': The length of the subdomain.
- 'upper': The count of uppercase characters.
- 'lower': The count of lowercase characters.
- 'numeric': The count of numeric characters.
- 'entropy': Entropy value.
- 'special': The count of special characters.
- 'labels': The count of labels.
- 'labels_max': Maximum count of labels.
- 'labels_average': Average count of labels.
- 'longest_word': The longest word in the subdomain.
- 'sld': Second-level domain.
- 'len': Length of the subdomain.
- 'subdomain': The subdomain.
- 'Target Attack' : Target Attack label, where 1 indicates an attack and 0 indicates no attack
-
Static Model
-
Data Analysis:
-
Loaded and explored the "Static_dataset.csv."
-
Utilized various statistical tools and visualizations to understand feature distributions, identify imbalances, and assess the characteristics of numerical and categorical variables.
-
Employed histograms, QQ plots, and boxplots for a comprehensive analysis of numerical features.
-
Examined the count of attack and non-attack cases for categorical features through count plots.
-
-
Feature Engineering and Data Cleaning:
- Analyzed the dataset for string variables and performed necessary transformations.
- Addressed missing values within the dataset , duplicate rows , drop unnecessary features.
- Applied embedding techniques to encode categorical variables, maintaining interpretability.
-
Feature Filtering/Selection:
-
Model Selection: - Splitting data to train ,test. - Apply Normalization using StandardScaler. - Chose three machine learning models for evaluation: Random Forest, Logistic Regression, and XGBoost. - Configured each model with default parameters.
-
Evaluation performance: - Using F1-score, get the Best Feature Selection/ Model
Number of Best Feature:
- Best F1-score is using Mutual Information on Random Forest Model.
selected_features=['FQDN_count','entropy','labels','labels_average','longest_word','lower','sld','special']
-
Hyperparameter Tuning & Model evaluation: using selected_features from Mutual Information.
Best hyperparameters for Random Forest: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200} Best hyperparameters for Logistic Regression: {'C': 0.1, 'penalty': 'l2', 'solver': 'newton-cg'} Best hyperparameters for XGB Extreme X Gradient Boosting: {'colsample_bytree': 0.8, 'learning_rate': 0.3, 'max_depth': 5, 'min_child_weight': 1, 'n_estimators': 200, 'subsample': 1.0}```
-
Champion Static Model :
-
Save the Champion Model for the Dynamic phase.
-
-
**Dynamic Model
-
Kafka Consumer Setup:
- Created a Kafka consumer instance for 'ml-raw-dns' topic, connecting to a Kafka broker on 'localhost:9092'.
- Configured the consumer to start from the earliest offset and use manual offset committing.
-
Data Retrieval and Adjustment:
- Implemented a function to retrieve 1000 records from the Kafka consumer.
- Utilized the retrieved data to create a DataFrame with predefined columns.
-
Data Cleaning: as done in Static Model.
- Defined functions for adjusting and cleaning data, including converting categorical values to numerical indices.
- Dropped unnecessary columns and converted the DataFrame to a consistent data type.
-
Model Loading and Retraining:
- Loaded a pre-trained Random Forest model from a pickle file.
- Initialized both static and dynamic models with the loaded model.
-
Dynamic Model Evaluation and Retraining:
- Simulated continuous data processing over 199 iterations.
- Evaluated the dynamic model's F1 score without retraining for each iteration.
- Retrained the dynamic model if its F1 score fell below 0.80 and updated it with new training data.
-
Static Model Evaluation:
- Evaluated the F1 score of the static model for each iteration without retraining.
-
Performance Comparison Visualization:
-