-
Anaconda : Ensure that Anaconda is installed on your computer. Download from Anaconda
check the installation by running the following command in the terminal:conda
Note: In windws, you have to use Anaconda Prompt to run the above command and use conda commands
-
Visual Studio Code : Ensure that Visual Studio Code is installed on your computer. Download from VS Code
-
Creating a Conda Environment
Open your terminal (or Anaconda Prompt for Windows users) and execute the following command to create a new Conda environment. Replace env_name with your preferred name for the environment.conda create --name env_name python
-
Activating the Environment
Activate the newly created environment by running:conda activate env_name
-
Installing Packages π οΈ
Install the required packages using the following commands (make sure you are in the activated environment step 2.)pip install pandas plotly scikit-learn dash numpy seaborn matplotlib tqdm
-
Open the Project
- Open the project directory in Visual Studio Code by selecting open folder from the file menu. Make sure you are in the root directory of the project soruce code which contains the README.md file.
-
Selecting the Python Interpreter/kernel from the environment created now.
-
Ensure that VS Code uses the Python interpreter from this Conda environment:
-
Open a Python or Notebook file. Click on the Python version in the status bar at the bottom or use Ctrl+Shift+P/Cmd+Shift+P and search for "Python: Select Interpreter".
-
Choose the interpreter from the "env_name" environment.
-
Run this command
python project_final.py
-
- The dashboard starts running in localhost 8050. Open the browser and type http://localhost:8050/ to view the dashboard.
-
Running Jupyter Notebooks
-
To run a Jupyter Notebook:
-
Open the .ipynb file. Execute cells individually or run the entire notebook using the play button.
-
The UNSW-NB15 dataset is a comprehensive and modern computer network security dataset released in 2015 by the University of New South Wales. It contains realistic normal and abnormal network activities that are vital for research in network intrusion detection. The dataset is widely used for developing and testing machine learning models for cybersecurity applications.
- Total Records: 2,540,044 instances.
- Features: 49 features, including 47 non-target features and 2 target features (label, attack_cat).
- Attack Categories: 10 different types of attacks including Normal, Generic, Exploits, Fuzzers, DoS, Reconnaissance, Backdoor, Shellcode, Worms, and Analysis.
- Data Types: The dataset includes basic, content, time, and general-purpose features derived from network flow records.
- Class Imbalance: The dataset shows a significant imbalance, with 87% of instances belonging to the "Normal" category.
- Class Overlap: There is a considerable overlap between some attack classes, making accurate classification challenging.
The dataset is publicly available for download from the official UNSW website. You can access it here.
- The bar charts highlight significant class imbalances in the UNSW-NB15 dataset, with the "Normal" attack class dominating at 87%.
- The Mahalanobis Distance Heatmap effectively showcases the distances between class centroids after applying different scaling techniques, helping to visualize class separations and overlaps.
- The heatmap revealed that min-max scaling improved separation between classes, reducing overlap, while standard scaling still exhibited some overlap.
-
The Elastic Net Algorithm identified 25 important features, significantly improving visualization clarity.
-
Random Forest selected 35 features, though it included some less relevant ones, making Elastic Net the preferred method.
- PCA: Visualized class overlaps in 2D and 3D, revealing significant overlaps in attack classes.
- t-SNE: Provided finer details, effectively showing the clustering of similar instances in the dataset.
- LDA: Revealed cluttered attack class distributions, confirming the presence of overlaps.
- K-means Intercluster Distance Map: Demonstrated that certain attack classes (e.g., "Fuzzers") completely overlap with "Normal," indicating poor separation between classes.
- The dashboard integrates all visualizations, allowing for interactive exploration of class imbalances, overlaps, and feature importance.
- This comprehensive visual tool aids in understanding the dataset's challenges before model development.
The "NetViz" project follows a structured approach to visualize and analyze network intrusions using the UNSW-NB15 dataset. The methodology consists of several key steps:
- Data Collection: The UNSW-NB15 dataset was obtained and consolidated from multiple CSV files into a single dataset for analysis.
- Data Cleaning: Removed missing values and unnecessary columns, ensuring the dataset was clean and ready for processing.
- Label Encoding: Nominal features were converted into numerical values using label encoding, except for the target column (
attack_cat
). - Feature Scaling: Applied various scaling techniques, including min-max scaling, robust scaling, and standard scaling, to normalize the dataset. This step was critical for ensuring consistency across feature values.
- Elastic Net Algorithm: Used to select the most important features by minimizing the impact of redundant and irrelevant features. This method helped to reduce complexity and improve visualization clarity.
- Random Forest Algorithm: Employed as an alternative feature selection method, but found to be less effective due to the inclusion of unnecessary features.
- Principal Component Analysis (PCA): Implemented to reduce the dimensionality of the dataset and visualize the data in 2D and 3D. PCA helped in identifying significant overlaps among different attack classes.
- Linear Discriminant Analysis (LDA): Applied to enhance the separation between different classes, particularly in 2D and 3D visualizations, though it revealed some cluttered distributions.
- t-SNE (t-Distributed Stochastic Neighbor Embedding): Used for capturing finer details in the data, showing how similar instances cluster together.
- Mahalanobis Distance Heatmap: Created to visualize the distances between class centroids after applying different scaling techniques. This helped in identifying class separations and overlaps.
- K-means Clustering: Performed to analyze the inter-cluster distances, revealing significant overlaps between certain attack classes, particularly between "Fuzzers" and "Normal."
- Interactive Dashboard: Developed using Plotly Dash to integrate all visualizations into a single interactive platform. The dashboard provides comprehensive insights into class imbalances, feature importance, and cluster distributions.
- Analysis of Results: The visualizations and analyses were evaluated to uncover critical issues such as class imbalances and overlaps, which are essential for improving network intrusion detection models.
- Final Insights: Conclusions were drawn based on the effectiveness of the visualizations and the identified challenges in the dataset.
This methodology ensures a thorough exploration of the dataset, providing valuable insights that contribute to the development of more robust network intrusion detection systems.
Charan Gajjala Chenchu |
Divija Kalluri |
This project is licensed under the MIT License - see the License file for details.
-
Stahnke, J., Dork, M., MΓΌller, B., & Thom, A. (2016). Probing Projections: Interaction Techniques for Interpreting Arrangements and Errors of Dimensionality Reductions. IEEE Transactions on Visualization and Computer Graphics, 22(1), 629β638. https://doi.org/10.1109/TVCG.2015.2467717
-
Janarthanan, T., & Zargari, S. (2017). Feature selection in UNSW-NB15 and KDDCUP'99 datasets. IEEE Xplore, June 2017. https://ieeexplore.ieee.org/abstract/document/8001537
-
Kanimozhi, V., & Jacob, P. (2019). UNSW-NB15 dataset feature selection and network intrusion detection using deep learning. International Journal of Recent Technology and Engineering, 7, 443-446.
-
van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579-2605.
-
Zoghi, Z., & Serpen, G. (2021). UNSW-NB15 Computer Security Dataset: Analysis through Visualization.
-
Moustafa, N., & Slay, J. (2015). UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). Military Communications and Information Systems Conference (MilCIS), 2015. https://research.unsw.edu.au/projects/unsw-nb15-dataset
-
Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
-
Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 9(3), 90-95.
-
McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference, 51-56.
-
Oliphant, T. E. (2006). A Guide to NumPy. Trelgol Publishing.
-
Seaborn: Statistical Data Visualization. (n.d.). Retrieved from https://seaborn.pydata.org
-
TQDM: A Fast, Extensible Progress Bar for Python and CLI. (n.d.). Retrieved from https://tqdm.github.io