This is my try with the KDD Cup of 1999 using Python, Scikit-learn, and Spark. The dataset for this data mining competition can be found here.
You can find the complete description of the task here.
Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections.
A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.
Attacks fall into four main categories:
- DOS: denial-of-service, e.g. syn flood;
- R2L: unauthorized access from a remote machine, e.g. guessing password;
- U2R: unauthorized access to local superuser (root) privileges, e.g., various ``buffer overflow'' attacks;
- probing: surveillance and other probing, e.g., port scanning.
It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data. This makes the task more realistic. The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only.
Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants. Based on this idea, we will experiment with different machine learning approaches.
We will start by working on a reduced dataset (the 10 percent dataset provided).
There we will do some exploratory data analysis using Pandas
. Then we will
build a classifier using Scikit-learn
. Our classifier will just classify
entries into normal
or attack
. By doing so, we can generalise the model to
new attack types.
However, in our final approach we want to use clustering and anomality
detection. We want our model to be able to work well with unknown attack types
and also to give an approchimation of the closest attack type. Initially we will
do clustering using Scikit-learn
again and see if we can beat our previous
classifier.
Finally, we will use Spark
to implement the clustering approach on the
complete dataset containing around 5 million interactions.
import pandas
from time import time
col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
kdd_data_10percent = pandas.read_csv("/nfs/data/KDD99/kddcup.data_10_percent", header=None, names = col_names)
kdd_data_10percent.describe()
duration | src_bytes | dst_bytes | land | wrong_fragment | urgent | hot | num_failed_logins | logged_in | num_compromised | ... | dst_host_count | dst_host_srv_count | dst_host_same_srv_rate | dst_host_diff_srv_rate | dst_host_same_src_port_rate | dst_host_srv_diff_host_rate | dst_host_serror_rate | dst_host_srv_serror_rate | dst_host_rerror_rate | dst_host_srv_rerror_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 494021.000000 | 4.940210e+05 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | ... | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 |
mean | 47.979302 | 3.025610e+03 | 868.532425 | 0.000045 | 0.006433 | 0.000014 | 0.034519 | 0.000152 | 0.148247 | 0.010212 | ... | 232.470778 | 188.665670 | 0.753780 | 0.030906 | 0.601935 | 0.006684 | 0.176754 | 0.176443 | 0.058118 | 0.057412 |
std | 707.746472 | 9.882181e+05 | 33040.001252 | 0.006673 | 0.134805 | 0.005510 | 0.782103 | 0.015520 | 0.355345 | 1.798326 | ... | 64.745380 | 106.040437 | 0.410781 | 0.109259 | 0.481309 | 0.042133 | 0.380593 | 0.380919 | 0.230590 | 0.230140 |
min | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 4.500000e+01 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 255.000000 | 46.000000 | 0.410000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 5.200000e+02 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 255.000000 | 255.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 1.032000e+03 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 255.000000 | 255.000000 | 1.000000 | 0.040000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 58329.000000 | 6.933756e+08 | 5155468.000000 | 1.000000 | 3.000000 | 3.000000 | 30.000000 | 5.000000 | 1.000000 | 884.000000 | ... | 255.000000 | 255.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 38 columns
Now we have our data loaded into a Pandas
data frame. In order to get familiar
with our data, let's have a look at how the labels are distributed.
kdd_data_10percent['label'].value_counts()
smurf. 280790
neptune. 107201
normal. 97278
back. 2203
satan. 1589
ipsweep. 1247
portsweep. 1040
warezclient. 1020
teardrop. 979
pod. 264
nmap. 231
guess_passwd. 53
buffer_overflow. 30
land. 21
warezmaster. 20
imap. 12
rootkit. 10
loadmodule. 9
ftp_write. 8
multihop. 7
phf. 4
perl. 3
spy. 2
dtype: int64
Initially, we will use all features. We need to do something with our categorical variables. For now, we will not include them in the training features.
num_features = [
"duration","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate"
]
features = kdd_data_10percent[num_features].astype(float)
features.describe()
duration | src_bytes | dst_bytes | land | wrong_fragment | urgent | hot | num_failed_logins | logged_in | num_compromised | ... | dst_host_count | dst_host_srv_count | dst_host_same_srv_rate | dst_host_diff_srv_rate | dst_host_same_src_port_rate | dst_host_srv_diff_host_rate | dst_host_serror_rate | dst_host_srv_serror_rate | dst_host_rerror_rate | dst_host_srv_rerror_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 494021.000000 | 4.940210e+05 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | ... | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 |
mean | 47.979302 | 3.025610e+03 | 868.532425 | 0.000045 | 0.006433 | 0.000014 | 0.034519 | 0.000152 | 0.148247 | 0.010212 | ... | 232.470778 | 188.665670 | 0.753780 | 0.030906 | 0.601935 | 0.006684 | 0.176754 | 0.176443 | 0.058118 | 0.057412 |
std | 707.746472 | 9.882181e+05 | 33040.001252 | 0.006673 | 0.134805 | 0.005510 | 0.782103 | 0.015520 | 0.355345 | 1.798326 | ... | 64.745380 | 106.040437 | 0.410781 | 0.109259 | 0.481309 | 0.042133 | 0.380593 | 0.380919 | 0.230590 | 0.230140 |
min | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 4.500000e+01 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 255.000000 | 46.000000 | 0.410000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 5.200000e+02 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 255.000000 | 255.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 1.032000e+03 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 255.000000 | 255.000000 | 1.000000 | 0.040000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 58329.000000 | 6.933756e+08 | 5155468.000000 | 1.000000 | 3.000000 | 3.000000 | 30.000000 | 5.000000 | 1.000000 | 884.000000 | ... | 255.000000 | 255.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 38 columns
As we mentioned, we are going to reduce the outputs to normal
and attack
.
from sklearn.neighbors import KNeighborsClassifier
labels = kdd_data_10percent['label'].copy()
labels[labels!='normal.'] = 'attack.'
labels.value_counts()
attack. 396743
normal. 97278
dtype: int64
We are going to use a lot of distance-based methods here. In order to avoid some features distances dominate others, we need to scale all of them.
from sklearn.preprocessing import MinMaxScaler
features.apply(lambda x: MinMaxScaler().fit_transform(x))
features.describe()
duration | src_bytes | dst_bytes | land | wrong_fragment | urgent | hot | num_failed_logins | logged_in | num_compromised | ... | dst_host_count | dst_host_srv_count | dst_host_same_srv_rate | dst_host_diff_srv_rate | dst_host_same_src_port_rate | dst_host_srv_diff_host_rate | dst_host_serror_rate | dst_host_srv_serror_rate | dst_host_rerror_rate | dst_host_srv_rerror_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 494021.000000 | 4.940210e+05 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | ... | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 | 494021.000000 |
mean | 0.000823 | 4.363595e-06 | 0.000168 | 0.000045 | 0.002144 | 0.000005 | 0.001151 | 0.000030 | 0.148247 | 0.000012 | ... | 0.911650 | 0.739865 | 0.753780 | 0.030906 | 0.601935 | 0.006684 | 0.176754 | 0.176443 | 0.058118 | 0.057412 |
std | 0.012134 | 1.425228e-03 | 0.006409 | 0.006673 | 0.044935 | 0.001837 | 0.026070 | 0.003104 | 0.355345 | 0.002034 | ... | 0.253903 | 0.415845 | 0.410781 | 0.109259 | 0.481309 | 0.042133 | 0.380593 | 0.380919 | 0.230590 | 0.230140 |
min | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 6.489989e-08 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 0.180392 | 0.410000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 7.499542e-07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 1.488371e-06 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 0.040000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 1.000000 | 1.000000e+00 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 38 columns
By using Principal Component Analysis, we can reduce the dimensionality of our data and plot it into a two-dimensional space. The PCA will capture those dimensions with the maximum variance, reducing the information loss.
# TODO
Following the idea that new attack types will be similar to known types, let's
start by trying a k-nearest neighbours classifier. We must to avoid brute force
comparisons in the Nxd space at all costs. Being N the number of samples in our
data more than 400K, and d the number of features 38, we will end up with an
unfeasible modeling process. For this reason we pass algorithm = 'ball_tree'
.
For more on kNN performance, check [here](http://scikit-
learn.org/stable/modules/neighbors.html#choice-of-nearest-neighbors-algorithm).
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'ball_tree', leaf_size=500)
t0 = time()
clf.fit(features,labels)
tt = time()-t0
print "Classifier trained in {} seconds".format(round(tt,3))
Classifier trained in 2405.17 seconds
Now let's try the classifier with the testing data. First we need to load the
labeled test data. We wil also sample 10 percent of the entries. For that. we
will take advantage of the train_test_split
function in sklearn
.
kdd_data_corrected = pandas.read_csv("/nfs/data/KDD99/corrected", header=None, names = col_names)
kdd_data_corrected['label'].value_counts()
smurf. 164091
normal. 60593
neptune. 58001
snmpgetattack. 7741
mailbomb. 5000
guess_passwd. 4367
snmpguess. 2406
satan. 1633
warezmaster. 1602
back. 1098
mscan. 1053
apache2. 794
processtable. 759
saint. 736
portsweep. 354
ipsweep. 306
httptunnel. 158
pod. 87
nmap. 84
buffer_overflow. 22
multihop. 18
named. 17
sendmail. 17
ps. 16
xterm. 13
rootkit. 13
teardrop. 12
xlock. 9
land. 9
xsnoop. 4
ftp_write. 3
sqlattack. 2
loadmodule. 2
worm. 2
perl. 2
phf. 2
udpstorm. 2
imap. 1
dtype: int64
We can see that we have new attack labels. In any case, we will convert all of
the to the attack.
label.
kdd_data_corrected['label'][kdd_data_corrected['label']!='normal.'] = 'attack.'
kdd_data_corrected['label'].value_counts()
attack. 250436
normal. 60593
dtype: int64
Again we select features and scale.
from sklearn.cross_validation import train_test_split
kdd_data_corrected[num_features] = kdd_data_corrected[num_features].astype(float)
kdd_data_corrected[num_features].apply(lambda x: MinMaxScaler().fit_transform(x))
duration | src_bytes | dst_bytes | land | wrong_fragment | urgent | hot | num_failed_logins | logged_in | num_compromised | ... | dst_host_count | dst_host_srv_count | dst_host_same_srv_rate | dst_host_diff_srv_rate | dst_host_same_src_port_rate | dst_host_srv_diff_host_rate | dst_host_serror_rate | dst_host_srv_serror_rate | dst_host_rerror_rate | dst_host_srv_rerror_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
1 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
2 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
3 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
4 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.01 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
5 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
6 | 0.000000 | 0.000000 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.039216 | 0.011765 | 0.30 | 0.30 | 0.30 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
7 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.992157 | 0.99 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
8 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
9 | 0.000000 | 0.000004 | 0.000036 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0.278431 | 1.000000 | 1.00 | 0.00 | 0.01 | 0.01 | 0.00 | 0 | 0.00 | 0.00 |
10 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
11 | 0.000000 | 0.000004 | 0.000050 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0.011765 | 1.000000 | 1.00 | 0.00 | 0.33 | 0.07 | 0.33 | 0 | 0.00 | 0.00 |
12 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.01 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
13 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.988235 | 0.99 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
14 | 0.000017 | 0.000050 | 0.000063 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0.211765 | 0.152941 | 0.72 | 0.11 | 0.02 | 0.00 | 0.02 | 0 | 0.09 | 0.13 |
15 | 0.000000 | 0.000005 | 0.002650 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0.694118 | 1.000000 | 1.00 | 0.00 | 0.01 | 0.01 | 0.00 | 0 | 0.00 | 0.00 |
16 | 0.000000 | 0.000005 | 0.000681 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0.733333 | 1.000000 | 1.00 | 0.00 | 0.01 | 0.01 | 0.00 | 0 | 0.00 | 0.00 |
17 | 0.000000 | 0.000005 | 0.000145 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0.768627 | 1.000000 | 1.00 | 0.00 | 0.01 | 0.01 | 0.00 | 0 | 0.00 | 0.00 |
18 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.01 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
19 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
20 | 0.000000 | 0.000004 | 0.001775 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0.227451 | 1.000000 | 1.00 | 0.00 | 0.02 | 0.05 | 0.00 | 0 | 0.00 | 0.00 |
21 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.992157 | 0.99 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
22 | 0.000000 | 0.000004 | 0.000036 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
23 | 0.000000 | 0.000004 | 0.001699 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
24 | 0.000000 | 0.000004 | 0.003760 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
25 | 0.000000 | 0.000012 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.301961 | 0.129412 | 0.34 | 0.08 | 0.34 | 0.06 | 0.00 | 0 | 0.00 | 0.00 |
26 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
27 | 0.000000 | 0.000560 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.360784 | 0.172549 | 0.43 | 0.07 | 0.43 | 0.05 | 0.00 | 0 | 0.00 | 0.00 |
28 | 0.000000 | 0.000133 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.403922 | 0.211765 | 0.49 | 0.06 | 0.49 | 0.04 | 0.00 | 0 | 0.00 | 0.00 |
29 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
310999 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311000 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 0.996078 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311001 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311002 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311003 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311004 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311005 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311006 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311007 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311008 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311009 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311010 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311011 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311012 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311013 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311014 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311015 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311016 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311017 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311018 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311019 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311020 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311021 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311022 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311023 | 0.000000 | 0.000002 | 0.000020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311024 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311025 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311026 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311027 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311028 | 0.000000 | 0.000002 | 0.000028 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.000000 | 1.000000 | 1.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
311029 rows × 38 columns
Now we can sample the 10 percent of the test data (after we scale it). Although we also get training data, we don't need it in this case.
features_train, features_test, labels_train, labels_test = train_test_split(
kdd_data_corrected[num_features],
kdd_data_corrected['label'],
test_size=0.1,
random_state=42)
Now, do predictions using our classifier and the test data. kNN classifiers are slow compared to other methods due to all the comparisons required in order to make predictions.
t0 = time()
pred = clf.predict(features_test)
tt = time() - t0
print "Predicted in {} seconds".format(round(tt,3))
Predicted in 902.673 seconds
That took a lot of time. Actually, the more training data we use with a k-means classifier, the slower it gets to predict. It needs to compare the new data with all the points. Definitively we want some centroid-based classifier if we plan to use it in real-time detection.
And finally, calculate the R squared value using the test labels.
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print "R squared is {}.".format(round(acc,4))
R squared is 0.8202.
So finally, let's try our anomaly detection approach in the reduced dataset. We will start by doing k-means clustering. Once we have the cluster centers, we will use them to determine the labels of the test data (unlabeled).
Based on the assumption that new attack types will resemble old type, we will be able to detect those. Moreover, anything that falls too far from any cluster, will be considered anomalous and therefore a possible attack.
from sklearn.cluster import KMeans
k = 30
km = KMeans(n_clusters = k)
t0 = time()
km.fit(features)
tt = time()-t0
print "Clustered in {} seconds".format(round(tt,3))
Clustered in 331.394 seconds
Now we can check cluster sizes.
pandas.Series(km.labels_).value_counts()
0 262807
1 48555
15 38319
5 24427
3 20528
12 19508
28 17879
8 11524
26 9162
7 4941
17 4272
10 4215
25 3996
13 3513
2 2845
29 1951
11 1686
4 1640
24 1557
9 1341
14 1239
16 1230
6 1201
21 1116
18 1020
19 970
23 949
27 775
20 680
22 175
dtype: int64
Get labels for each cluster. Here, we go back to use the complete set of labels.
labels = kdd_data_10percent['label']
label_names = map(
lambda x: pandas.Series([labels[i] for i in range(len(km.labels_)) if km.labels_[i]==x]),
range(k))
Print labels for each cluster.
for i in range(k):
print "Cluster {} labels:".format(i)
print label_names[i].value_counts()
print
Cluster 0 labels:
smurf. 262805
normal. 2
dtype: int64
Cluster 1 labels:
neptune. 48551
portsweep. 4
dtype: int64
Cluster 2 labels:
normal. 2845
dtype: int64
Cluster 3 labels:
neptune. 20456
portsweep. 58
satan. 10
normal. 4
dtype: int64
Cluster 4 labels:
normal. 1510
pod. 60
smurf. 36
satan. 22
teardrop. 7
rootkit. 3
spy. 1
nmap. 1
dtype: int64
Cluster 5 labels:
normal. 23165
back. 1258
phf. 3
satan. 1
dtype: int64
Cluster 6 labels:
normal. 1201
dtype: int64
Cluster 7 labels:
normal. 4873
warezclient. 52
rootkit. 4
satan. 4
perl. 3
ipsweep. 2
loadmodule. 1
spy. 1
imap. 1
dtype: int64
Cluster 8 labels:
normal. 11424
smurf. 98
nmap. 2
dtype: int64
Cluster 9 labels:
normal. 1230
satan. 109
teardrop. 1
portsweep. 1
dtype: int64
Cluster 10 labels:
normal. 4153
guess_passwd. 49
back. 7
ipsweep. 3
portsweep. 2
neptune. 1
dtype: int64
Cluster 11 labels:
normal. 1177
ipsweep. 339
pod. 94
nmap. 22
warezmaster. 18
satan. 12
imap. 8
smurf. 3
land. 3
ftp_write. 2
rootkit. 2
guess_passwd. 2
multihop. 2
loadmodule. 1
portsweep. 1
dtype: int64
Cluster 12 labels:
normal. 19508
dtype: int64
Cluster 13 labels:
normal. 3500
back. 12
warezclient. 1
dtype: int64
Cluster 14 labels:
warezclient. 661
normal. 537
buffer_overflow. 22
ftp_write. 6
back. 5
loadmodule. 5
multihop. 2
rootkit. 1
dtype: int64
Cluster 15 labels:
neptune. 38189
nmap. 103
portsweep. 19
normal. 6
land. 1
guess_passwd. 1
dtype: int64
Cluster 16 labels:
satan. 1222
portsweep. 8
dtype: int64
Cluster 17 labels:
normal. 4237
satan. 26
portsweep. 9
dtype: int64
Cluster 18 labels:
portsweep. 934
ipsweep. 83
normal. 3
dtype: int64
Cluster 19 labels:
teardrop. 970
dtype: int64
Cluster 20 labels:
normal. 370
warezclient. 306
multihop. 2
warezmaster. 2
dtype: int64
Cluster 21 labels:
ipsweep. 813
normal. 118
nmap. 99
pod. 71
land. 14
multihop. 1
dtype: int64
Cluster 22 labels:
satan. 172
portsweep. 3
dtype: int64
Cluster 23 labels:
normal. 949
dtype: int64
Cluster 24 labels:
normal. 1497
pod. 29
smurf. 7
ipsweep. 7
satan. 6
nmap. 4
imap. 3
neptune. 2
teardrop. 1
portsweep. 1
dtype: int64
Cluster 25 labels:
normal. 3521
back. 464
buffer_overflow. 8
loadmodule. 2
guess_passwd. 1
dtype: int64
Cluster 26 labels:
normal. 9091
back. 71
dtype: int64
Cluster 27 labels:
normal. 760
pod. 10
land. 3
neptune. 2
dtype: int64
Cluster 28 labels:
smurf. 17841
normal. 33
satan. 5
dtype: int64
Cluster 29 labels:
normal. 1564
back. 386
phf. 1
dtype: int64
We can see that, in most clusters, there is a dominant label. It would be interesting to go cluster by cluster and analyise mayority labels, or how labels are split between different clusters (some with more dominance than others). All that would help us understand each type of attack! This is also a benefit of using a clustering-based approach.
- Get dominant labels
- Analyse cluster centers, specially for heterogeneous clusters containing
normal
. This will discover conflictive interactions.
We can now predict using our test data.
t0 = time()
pred = km.predict(kdd_data_corrected[num_features])
tt = time() - t0
print "Assigned clusters in {} seconds".format(round(tt,3))
Assigned clusters in 0.693 seconds
We can see that the assignment process is much faster than the prediction process with our kNN. But we still need to assign labels.
# TODO: get mayority label for each cluster assignment (we have labels from the previous step)
# TODO: check these labels with those in the corrected test data in order to calculate accuracy
The script KDDCup99.py runds through a series of steps to perform
k-means clustering over the complete dataset using PySpark
with different K
values in order to find the best one.
The clustering results are stored in a CSV
file. This file is very convenient
for visualisation purposes. It would be very hard to cluster and visualise
results of the complete dataset using Scikit-learn
.
The following chart depicts the first two pincipal components for the clustering results.
Remember that we have up to 24 different labels in our complete dataset. However we have generated up to 80 different clusters. As a result of this, some of the clusters appear very close in the first principal component. This is due to the variability of interactions for a given type of attack (or label).
# TODO: follow the same approach for label assignment in the test data as before
Note: in the following we have used a 7-node Spark cluster, with 512Mb and 2 cores per node.
In order to show how we use Spark
to do k-means clustering in our dataset,
let's perform here a single clustering run with the complete dataset, for a K
value of 80 that showed to be particulary good.
# Some imports we will use
from collections import OrderedDict
from time import time
First we need to load the data, using the complete dataset file stored in NFS.
data_file = "/nfs/data/KDD99/kddcup.data"
raw_data = sc.textFile(data_file)
As a warm up, let's count the number of interactions by label.
# count by all different labels and print them decreasingly
print "Counting all different labels"
labels = raw_data.map(lambda line: line.strip().split(",")[-1])
t0 = time()
label_counts = labels.countByValue()
tt = time()-t0
sorted_labels = OrderedDict(sorted(label_counts.items(), key=lambda t: t[1], reverse=True))
for label, count in sorted_labels.items():
print label, count
print "Counted in {} seconds".format(round(tt,3))
Counting all different labels
smurf. 2807886
neptune. 1072017
normal. 972781
satan. 15892
ipsweep. 12481
portsweep. 10413
nmap. 2316
back. 2203
warezclient. 1020
teardrop. 979
pod. 264
guess_passwd. 53
buffer_overflow. 30
land. 21
warezmaster. 20
imap. 12
rootkit. 10
loadmodule. 9
ftp_write. 8
multihop. 7
phf. 4
perl. 3
spy. 2
Counted in 9.12 seconds
Now we prepare the data for clustering input. The data contains non-numeric
features, and we want to exclude them since k-means works just with numeric
features. These are the first three and the last column in each data row that is
the label.
In order to do that, we define a function that we apply to the RDD as a
Spark
transformation by using map
. Remember that we can apply as many
transofmrations as we want without making Spark
start any processing. Is is
when we trigger an action when all the transformations are applied.
def parse_interaction(line):
"""
Parses a network data interaction.
"""
line_split = line.split(",")
clean_line_split = [line_split[0]]+line_split[4:-1]
return (line_split[-1], array([float(x) for x in clean_line_split]))
parsed_data = raw_data.map(parse_interaction)
parsed_data_values = parsed_data.values().cache()
Additionally, we have used cache
in order to keep the results at hand
once they are calculated by the first action.
We will also standardise our data as we have done so far when performing distance-based clustering.
from pyspark.mllib.feature import StandardScaler
standardizer = StandardScaler(True, True)
t0 = time()
standardizer_model = standardizer.fit(parsed_data_values)
tt = time() - t0
standardized_data_values = standardizer_model.transform(parsed_data_values)
print "Data standardized in {} seconds".format(round(tt,3))
Data standardized in 9.54 seconds
We can now perform k-means clustering.
from pyspark.mllib.clustering import KMeans
t0 = time()
clusters = KMeans.train(standardized_data_values, 80,
maxIterations=10, runs=5,
initializationMode="random")
tt = time() - t0
print "Data clustered in {} seconds".format(round(tt,3))
Data clustered in 137.496 seconds
Once we have our clusters, we can use them to label test data and test accuracy.
# TODO
This repository contains a variety of content; some developed by Jose A. Dianes, and some from third-parties. The third-party content is distributed under the license provided by those parties.
The content developed by Jose A. Dianes is distributed under the following license:
Copyright 2016 Jose A Dianes
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.