xtract-sampler

ML code to sample a file based on cheap, easily-attainable features of a file.

Training a model using a .csv:

python xtract_sampler_main.py --mode train --classifier ex1 --feature ex2 --label_csv ex3

ex1 should be either rf, svc, or logit for a random forest, support vector classification, or logistic regression model.
ex2 should be either head, rand, randhead to set the features as bytes from the head of the file, random bytes, or a mixture of both.
ex3 is the path to a .csv file with the file path, file size, and file label for files to train on.
Additional --head_bytes and --rand_bytes parameters can be passed to specify the number of bytes to take from the file (the default is 512 bytes if these parameters aren't passed).

python xtract_sampler_main.py --mode predict --trained_classifier ex1 --feature ex2 --predict_file ex3

ex1 is the path to a trained classifier, trained using the training mode of xtract_sampler_main.py.
ex2 is the type of feature that ex1 was trained on (head, rand, randhead).
- Note: If a --head_bytes or --rand_bytes value was passed during training, the same value should be passed during predicting.
ex3 is the path to the file to predict on.
- Alternatively, to predict on a directory, use --dirname ex3 instead of --predict_file ex3.

Two-phase automated training allows users to generate labels and save features for multiple directories before training on those features and labels.

python xtract_sampler_main.py --mode labels_features --dirname ex1 --features_outfile ex2 --csv_outfile ex3 --features ex4
- ex1 is the directory to generate labels from and to grab features from.
- ex2 is the name/path to the .pkl file to write file features to.
- ex3 is the name/path to the .csv file to write labels to.
- ex4 should be either head, rand, randhead to set the features as bytes from the head of the file, random bytes, or a mixture of both.
- Additional --head_bytes and --rand_bytes parameters can be passed to specify the number of bytes to take from the file (the default is 512 bytes if these parameters aren't passed).
Repeat step 1 with as many directories as you want. However, --features_outfile and --features must always be the same. Additionally if --head_bytes or --rand_bytes is passed, they must stay the same too.
python xtract_sampler_main.py --mode train --classifier ex1 --features ex2 --features_outfile ex3
- ex1 should be either rf, svc, or logit for a random forest, support vector classification, or logistic regression model.
- ex2 should be either head, rand, randhead for the features to be bytes from the head of the file, random bytes, or a mixture of both.
- ex3 is the name/path of the .pkl file passed to --features_outfile in steps 1 and 2.
  - Note: If a --head_bytes or --rand_bytes value was passed during steps 1 and 2, the same value should be passed here.

Models created using the training mode will be saved under the name classifier-feature-date.pkl where the classifier and feature are the values passed to the command line and date is the current date. Training a model will also create a .json file named classifier-feature-date.json that will contain training times and accuracy results about the trained model. To change the model name, pass --model_name ex1 where ex1 is the name of the file to save the model.
Predictions from the prediction mode will be saved under the name sampler_results.json. To change this, pass --results_file ex1 where ex1 is the name of the file to save prediction results.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
automated_training_results		automated_training_results
model_results		model_results
.gitignore		.gitignore
CLASS_TABLE.json		CLASS_TABLE.json
Dockerfile		Dockerfile
README.md		README.md
automated_training.py		automated_training.py
classify.py		classify.py
cloud_automated_training.py		cloud_automated_training.py
extpredict.py		extpredict.py
feature.py		feature.py
headbytes.py		headbytes.py
predict.py		predict.py
prediction_check.py		prediction_check.py
preprocess.py		preprocess.py
queues.py		queues.py
randbytes.py		randbytes.py
randhead.py		randhead.py
random_selection.py		random_selection.py
requirements.txt		requirements.txt
run.py		run.py
run_experiments.sh		run_experiments.sh
stop-words-en.txt		stop-words-en.txt
test_model.py		test_model.py
timeout.py		timeout.py
train_model.py		train_model.py
words_dictionary.json		words_dictionary.json
xtract_sampler_main.py		xtract_sampler_main.py