This project performs detailed Exploratory Data Analysis (EDA) on the Android Malware Detection Dataset obtained from Kaggle. The dataset contains network traffic features extracted from both benign and malicious Android applications. The primary goal is to understand data characteristics, highlight potential patterns for malware detection, and visualize the findings.
Kaggle: Android Malware Detection Dataset
GeoLite2-City.mmdb: for IP Geolocation was downloaded from MaxMind GeoLite2 Database.
-
🐳 Clone the Repository
git clone <repo-url> cd <repo-folder>
-
🐳 Create a Virtual Environment
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
-
🐳 Install Dependencies
pip install -r requirements.txt
-
🐳 Launch Jupyter Notebook
jupyter notebook
-
🐳 Open the file:
notebooks/Android_Malware_EDA.ipynb
- Removed duplicates and null values.
- Dropped 37+ irrelevant or noisy columns.
- Standardized timestamps and column names.
- Summarized dataset distribution with
.describe(). - Identified skewed and outlier-heavy features (e.g., Flow Duration).
- Histograms for key features like Flow Duration.
- Correlation Heatmaps for feature relationships.
- Protocol Distribution Pie Charts.
- Label Distribution bar plots showing malware family imbalance.
- Time-based Trends of attacks.
- Geo-Distribution of IP addresses (cities/countries).
| Insight | Observation |
|---|---|
| Data Quality | Cleaned, no missing/duplicate records |
| Class Imbalance | Malware families are imbalanced |
| Flow Duration | Skewed with significant outliers |
| Protocol Usage | TCP is the dominant protocol |
| Feature Reduction | ~37 irrelevant features removed |
pandas
numpy
matplotlib
seaborn
plotly
dash
geopandas
scikit-learnAnanya P S





