Skip to content

Exploratory Data Analysis (EDA) on the Android Malware Detection dataset (Kaggle) to uncover patterns in network traffic from benign and malicious Android apps. Includes statistical analysis, visualizations, and an interactive dashboard using Plotly & Dash. IP geolocation is enhanced using GeoLite2-City.mmdb.

Notifications You must be signed in to change notification settings

ananyapattaje/Android_Malware_Dataset_Analysis

Repository files navigation

🪼 Project Overview

This project performs detailed Exploratory Data Analysis (EDA) on the Android Malware Detection Dataset obtained from Kaggle. The dataset contains network traffic features extracted from both benign and malicious Android applications. The primary goal is to understand data characteristics, highlight potential patterns for malware detection, and visualize the findings.


🔗 🪼 Dataset Source

Kaggle: Android Malware Detection Dataset
GeoLite2-City.mmdb: for IP Geolocation was downloaded from MaxMind GeoLite2 Database.


🪼 How to Run

  1. 🐳 Clone the Repository

    git clone <repo-url>
    cd <repo-folder>
  2. 🐳 Create a Virtual Environment

    python -m venv venv
    source venv/bin/activate   # Windows: venv\Scripts\activate
  3. 🐳 Install Dependencies

    pip install -r requirements.txt
  4. 🐳 Launch Jupyter Notebook

    jupyter notebook
  5. 🐳 Open the file:

    notebooks/Android_Malware_EDA.ipynb
    

🪼 Key Analysis Performed in Notebook

🐳 Data Cleaning

  • Removed duplicates and null values.
  • Dropped 37+ irrelevant or noisy columns.
  • Standardized timestamps and column names.

🐳 Statistical Exploration

  • Summarized dataset distribution with .describe().
  • Identified skewed and outlier-heavy features (e.g., Flow Duration).

🐳 Visualization Highlights

  • Histograms for key features like Flow Duration.
  • Correlation Heatmaps for feature relationships.
  • Protocol Distribution Pie Charts.
  • Label Distribution bar plots showing malware family imbalance.
  • Time-based Trends of attacks.
  • Geo-Distribution of IP addresses (cities/countries).

🐳 Key Insights

Insight Observation
Data Quality Cleaned, no missing/duplicate records
Class Imbalance Malware families are imbalanced
Flow Duration Skewed with significant outliers
Protocol Usage TCP is the dominant protocol
Feature Reduction ~37 irrelevant features removed

🪼 Requirements

pandas
numpy
matplotlib
seaborn
plotly
dash
geopandas
scikit-learn

Dashboard Overview

Dashboard Overview1 Dashboard Overview12 Dashboard Overview13 Dashboard Overview4 Dashboard Overview5 Dashboard Overview6


🪼 Author

Ananya P S


About

Exploratory Data Analysis (EDA) on the Android Malware Detection dataset (Kaggle) to uncover patterns in network traffic from benign and malicious Android apps. Includes statistical analysis, visualizations, and an interactive dashboard using Plotly & Dash. IP geolocation is enhanced using GeoLite2-City.mmdb.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published