-
Notifications
You must be signed in to change notification settings - Fork 0
Machine Learning python project for clustering Android malware.
License
vladd-bit/ml-malware-clustering
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
SAMPLE TEST RESULTS GRAPHS ARE AVAILABLE IN THE test_results FOLDER, they are mainly there as an example of output. The project code is structured as follows: ├── malwareClustering.py <-- MAIN FILE, also defines configuration for directories for storage etc ├── README.txt ├── src <-- contains all project main code, clustering, analysis, plotting, etc │ ├── apkFeature.py <-- simple object made for storting apk data │ ├── algorithms <-- contains both unsupervised and supervised learning methods code │ │ ├── supervised.py │ │ └── unsupervised.py │ ├── config.py <-- configuration file for the application │ ├── dataAnalyzer.py <-- used to perform one-hot encoding │ ├── dataClusterer.py <-- constains clustering process │ ├── dataProcessor.py <-- extracts data from APK or exitsing feature vector of strings │ ├── errorLog.py <-- for outputting logs and errors │ ├── fileReader.py <-- reads all files necessary for the project │ └── util.py <-- evaluation score computation occurs here └── test_results <-- contains test graphs results from various scenarios In order to run the project software, the following steps must bet done: 1 - install python 3.7 2 - install sklearn, androguard, pandas, matplotlib, numpy via pip Or execute the following command Linux command for installation: sudo apt-get install python3.7 python3.7-dev python3-pip && python3.7 -m pip install numpy pandas sklearn matplotlib TO CONFIGURE THE LOCATION OF THE DREBIN DATASET PLEASE CHANGE THE FOLLOWING LINES INSIDE malwareClustering.py: input_drebin_directory = '/home/user/Projects/drebin/' <-- DREBIN APK ZIP FILE LOCATIONS output_directory = '/home/user/Projects/output_dir/' <-- WHERE ALL OUTPUT IS STORED feature_vector_directory = '/home/user/Projects/drebin/feature_vectors/*' <-- NEEDS TO HAVE THE FEATURES OF THE APKS FROM THE DREBIN DATASET labeled_apk_csv_file_path = '/home/user/Projects/drebin/sha256_family.csv' <-- DREBIN LABELS CSV FILE PATH To run the project all that needs to be done is to run the following command inside the project directory (where the malwareClustering.py is): python3.7 malwareClustering.py --help , will display a list of the parameters available To run clustering with the full dataset, please run the following command: python3.7 malwareClustering.py -nc adaptive -m extract_existing_feature analyze_data clustering cache_all plot_data <--results are stored in the output_dir/stats/ folder ################################################################################################################## Below are the parameter commands available: -h, --help show this help message and exit -p DREBIN_PATTERN, --drebin_pattern DREBIN_PATTERN -p folder names (if any) of where the drebin APKs are stored, will be used as pattern for searching e.g: drebin- (optional) -odir OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY -i input directory for the DREBIN APKS, (zip containing the manifest files only) -idir INPUT_DIRECTORY, --input_directory INPUT_DIRECTORY -o output directory for cluster statistics and data. -dcsv DREBIN_CSV_LOCATION, --drebin_csv_location DREBIN_CSV_LOCATION -drebin_csv /--drebin_csv_location location of the drebin csv that contains the labels of all APK samples -f FEATURE_VECTOR_DIRECTORY, --feature_vector_directory FEATURE_VECTOR_DIRECTORY -f feature vector directory -m --operation_mode [{extract_apk_feature,extract_existing_feature,learning,analyze_data,clustering,cache_all,load_cache,plot_data,visualize_dataset,threshold_dataset,recompute_stats,include_benign} -m operation mode, meaning extract data from APKs, apply machine learning on existing feature vectors or simply process data from APKs (features vectors)default: none -ls LOAD_PREVIOUS_STATS, --load_previous_stats LOAD_PREVIOUS_STATS -ls load previous stats from file -dhs [DOWNSAMPLE_THRESHOLD [DOWNSAMPLE_THRESHOLD ...]], --downsample_threshold [DOWNSAMPLE_THRESHOLD [DOWNSAMPLE_THRESHOLD ...]] only consider X number of sample from the thresholded number of classes, must be used in conjuction with -ths argument , -ths X 1, 1 means enable downsampling -ths [THRESHOLD_SAMPLES [THRESHOLD_SAMPLES ...]], --threshold_samples [THRESHOLD_SAMPLES [THRESHOLD_SAMPLES ...]] only consider malware classes that have above X samples, can be downsampled to the X samples passed if a second param is added : 1 or 0 -nc NUM_CLUSTERS, --num_clusters NUM_CLUSTERS -num_clusters number of clusters to be used in unsupervised methods, range (n, m)n,m >= 2, eg -nc "2|166" will iteratively cluster from 2 to 166 clusters in range (will take some time), -nc 23 44 66 will cluster only with 23 44 and 66 clusters -c [CLUSTERING_METHOD [CLUSTERING_METHOD ...]], --clustering_method [CLUSTERING_METHOD [CLUSTERING_METHOD ...]] -c clustering_method method type (e.g All, k-means , dbscan, etc #####################################################################################################################################
About
Machine Learning python project for clustering Android malware.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published