-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathanomaly_detection.py
289 lines (251 loc) · 13.2 KB
/
anomaly_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
# anomaly detection with k-means
# 1:
%%javascript
var kernel = IPython.notebook.kernel;var thename = window.document.getElementById("notebook_name").innerHTML;var command = "THE_NOTEBOOK = " + "'"+thename+"'";kernel.execute(command);command="os.environ['THE_NOTEBOOK'] = THE_NOTEBOOK";kernel.execute(command);var cell = IPython.notebook.get_cell(2);cell.execute();IPython.notebook.get_cell(3).focus_cell();var x = $('.code_cell');$(x[1]).children('.input').hide();
# 2:
outputdir = "/tmp/tools/"
!mkdir -p $outputdir
!wget "https://www.dropbox.com/s/4g0pigmro4vo1b4/menutemplate?dl=0" -O /tmp/tools/menutemplate >> /tmp/toollog 2>&1
!wget "https://www.dropbox.com/s/3flttpzhsja8td7/construct_menu.py?dl=0" -O /tmp/tools/construct_menu.py >> /tmp/toollog 2>&1
!python /tmp/tools/construct_menu.py "{THE_NOTEBOOK}.ipynb" {outputdir}
from IPython.core.display import HTML
output_file_name = outputdir + THE_NOTEBOOK.replace(" ", "").replace("[", "").replace("]", "") + ".ipynb.html"
with open(output_file_name) as fp:
html = fp.read()
HTML(html)
import os
import sys
import re
import time
from pyspark import SparkContext
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
# from pyspark.sql.functions import *
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pyspark.sql.functions as func
import matplotlib.patches as mpatches
from operator import add
from pyspark.mllib.clustering import KMeans, KMeansModel
from operator import add
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
import itertools
input_path = "/datasets/k-means/kddcup.data"
raw_data = sc.textFile(input_path, 12)
Run the cell below to generate the road map (do not modify it)
In [1]:
%%javascript
var kernel = IPython.notebook.kernel;var thename = window.document.getElementById("notebook_name").innerHTML;var command = "THE_NOTEBOOK = " + "'"+thename+"'";kernel.execute(command);command="os.environ['THE_NOTEBOOK'] = THE_NOTEBOOK";kernel.execute(command);var cell = IPython.notebook.get_cell(2);cell.execute();IPython.notebook.get_cell(3).focus_cell();var x = $('.code_cell');$(x[1]).children('.input').hide();
In [2]:
outputdir = "/tmp/tools/"
!mkdir -p $outputdir
!wget "https://www.dropbox.com/s/4g0pigmro4vo1b4/menutemplate?dl=0" -O /tmp/tools/menutemplate >> /tmp/toollog 2>&1
!wget "https://www.dropbox.com/s/3flttpzhsja8td7/construct_menu.py?dl=0" -O /tmp/tools/construct_menu.py >> /tmp/toollog 2>&1
!python /tmp/tools/construct_menu.py "{THE_NOTEBOOK}.ipynb" {outputdir}
from IPython.core.display import HTML
output_file_name = outputdir + THE_NOTEBOOK.replace(" ", "").replace("[", "").replace("]", "") + ".ipynb.html"
with open(output_file_name) as fp:
html = fp.read()
HTML(html)
Out[2]:
O
Anomaly Detection in Network Traffic with K-means clustering
We can categorize machine learning algorithms into two main groups: supervised learning and unsupervised learning. With supervised learning algorithms, in order to predict unknown values for new data, we have to know the target value for many previously-seen examples. In contrast, unsupervised learning algorithms explore the data which has no target attribute to find some intrinsic structures in them.
Clustering is a technique for finding similar groups in data, called clusters. Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given.
In this notebook, we will use K-means, a very well known clustering algorithm to detect anomaly network connections based on statistics about each of them. For a thorough overview of K-means clustering, from a research perspective, have a look at this wonderful tutorial.
Goals
We expect students to:
Learn (or revise) and understand the K-means algorithm
Implement a simple K-means algorithm
Use K-means to detect anomalies network connection data
Steps
In section 1, we will have an overview about K-means then implement a simple version of it.
In section 2, we build models with and without categorical features.
Finally, in the last section, using our models, we will detect unsual connections.
1. K-means
1.1. Introduction
Clustering is a typical and well-known type of unsupervised learning. Clustering algorithms try to find natural groupings in data. Similar data points (according to some notion of similarity) are considered in the same group. We call these groups clusters.
K-Means clustering is a simple and widely-used clustering algorithm. Given value of kk , it tries to build kk clusters from samples in the dataset. Therefore, kk is an hyperparameter of the model. The right value of kk is not easy to determine, as it highly depends on the data set and the way that data is featurized.
To measue the similarity between any two data points, K-means requires the definition of a distance funtion between data points. What is a distance? It is a value that indicates how close two data points are in their space. In particular, when data points lie in a dd -dimensional space, the Euclidean distance is a good choice of a distance function, and is supported by MLLIB.
In K-means, a cluster is a group of points, with a representative entity called a centroid. A centroid is also a point in the data space: the center of all the points that make up the cluster. It's defined to be the arithmetic mean of the points. In general, when working with K-means, each data sample is represented in a dd -dimensional numeric vector, for which it is easier to define an appropriate distance function. As a consequence, in some applications, the original data must be transformed into a different representation, to fit the requirements of K-means.
1.2. How does it work ?
Given kk , the K-means algorithm works as follows:
Randomly choose kk data points (seeds) to be the initial centroids
Assign each data point to the closest centroid
Re-compute (update) the centroids using the current cluster memberships
If a convergence criterion is not met, go to step 2
We can also terminate the algorithm when it reaches an iteration budget, which yields an approximate result. From the pseudo-code of the algorithm, we can see that K-means clustering results can be sensitive to the order in which data samples in the data set are explored. A sensible practice would be to run the analysis several times, randomizing objects order; then, average the cluster centres of those runs and input the centres as initial ones for one final run of the analysis.
1.3. Illustrative example
One of the best ways to study an algorithm is trying implement it. In this section, we will go step by step to implement a simple K-means algorithm.
Question 1
Question 1.1
Complete the below function to calculate an euclidean distance between any two points in dd -dimensional data space
In [20]:
import numpy as np
# calculate distance between two d-dimensional points
def euclidean_distance(p1, p2):
return np.sqrt(np.sum([(c1 - c2)**2 for c1, c2 in zip(p1, p2)]))
# test our function
assert (round(euclidean_distance([1,2,3] , [10,18,12]), 2) == 20.45), "Function's wrong"
Question 1.2
Given a data point and the current set of centroids, complete the function below to find the index of the closest centroid for that data point.
In [21]:
def find_closest_centroid(datapoint, centroids):
# find the index of the closest centroid of the given data point.
return min(enumerate(centroids), key=lambda x: euclidean_distance(datapoint, x[1]))[0]
assert(find_closest_centroid( [1,1,1], [ [2,1,2], [1,2,1], [3,1,2] ] ) == 1), "Function's wrong"
Question 1.3
Write a function to randomize kk initial centroids.
In [22]:
np.random.seed(22324)
# randomize initial centroids
def randomize_centroids(data, k):
random_indices = np.arange(len(data))
np.random.shuffle(random_indices)
random_indices = random_indices[:k]
centroids = [data[i] for i in range(len(data)) if i in random_indices]
return centroids
assert(len(
randomize_centroids(
np.array([
np.array([2,1,2]),
np.array([1,2,1]),
np.array([3,1,2])
]),
2)) == 2), "Wrong function"
Question 1.4
Write function check_converge to check the stop creteria of the algorithm.
In [23]:
MAX_ITERATIONS = 10
# return True if clusters have converged , otherwise, return False
def check_converge(centroids, old_centroids, num_iterations, threshold=0):
# if it reaches an iteration budget
if num_iterations > MAX_ITERATIONS:
return True
# check if the centroids don't move (or very slightly)
distances = np.array([euclidean_distance(c, o) for c, o in zip(centroids, old_centroids)])
if (distances <= threshold).all():
return True
return False
Question 1.5
Write function update_centroid to update the new positions for the current centroids based on the position of their members.
In [24]:
# centroids: a list of centers
# clusters: a list of k elements. Each element i-th is a list of data points that are assigned to center i-th
def update_centroids(centroids, clusters):
assert(len(centroids) == len(clusters))
clusters = np.array(clusters)
for i, cluster in enumerate(clusters):
centroids[i] = sum(cluster)/len(cluster)
return centroids
Question 1.6
Complete the K-means algorithm scheleton below, with the functions you wrote above.
In [25]:
# data : set of data points
# k : number of clusters
# centroids: initial list of centroids
def kmeans(data, k=2, centroids=None):
data = np.array(data)
# randomize the centroids if they are not given
if not centroids:
centroids = randomize_centroids(data, k)
old_centroids = centroids[:]
iterations = 0
while True:
iterations += 1
# init empty clusters
clusters = [[] for i in range(k)]
# assign each data point to the closest centroid
for datapoint in data:
# find the closest center of each data point
centroid_idx = find_closest_centroid(datapoint, centroids)
# assign datapoint to the closest cluster
clusters[centroid_idx].append(datapoint)
# keep the current position of centroids before changing them
old_centroids = centroids[:]
# update centroids
centroids = update_centroids(centroids, clusters)
# if the stop criteria are met, stop the algorithm
if check_converge(centroids, old_centroids, iterations):
break
return centroids
Next, we will test our algorithm on Fisher's Iris dataset, and plot the resulting clusters in 3D.
Question 1.7
The code below can be used to test your algorithm with three different datasets: Iris, Moon and Blob. Run your algorithm to cluster datapoints in these datasets, plot the results and discuss about them. Do you think that our algorithm works well? Why?
In [26]:
# the sourcecode in this cell is inspired from
# https://gist.github.com/bbarrilleaux/9841297
%matplotlib inline
from sklearn import datasets, cluster
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# load data
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
# do the clustering
centers = kmeans(X_iris, k=3)
labels = [find_closest_centroid(p, centers) for p in X_iris]
#plot the clusters in color
fig = plt.figure(1, figsize=(8, 8))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=200)
plt.cla()
ax.scatter(X_iris[:, 3], X_iris[:, 0], X_iris[:, 2], c=labels)
# moon
# np.random.seed(0)
# X, y = datasets.make_moons(2000, noise=0.2)
# blob
# np.random.seed(0)
# X, y = datasets.make_blobs(n_samples=2000, centers=3, n_features=20, random_state=0)
# centers = kmeans(X, k=3)
# labels = [find_closest_centroid(p, centers) for p in X]
# fig = plt.figure(1, figsize=(8, 8))
# plt.clf()
# plt.scatter(X[:,0], X[:,1], s=40, c=labels, cmap=plt.cm.Spectral)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.show()
# Here we use sci-kit learn implementation of K-means
# centers =cluster.KMeans(n_clusters=3)
# centers.fit(X_iris)
# labels = centers2.labels_
/opt/conda/lib/python3.4/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if self._edgecolors == str('face'):
# 27: network intrusion usecase
import os
import sys
import re
import time
from pyspark import SparkContext
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
# from pyspark.sql.functions import *
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pyspark.sql.functions as func
import matplotlib.patches as mpatches
from operator import add
from pyspark.mllib.clustering import KMeans, KMeansModel
from operator import add
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
import itertools
input_path = "/datasets/k-means/kddcup.data"
raw_data = sc.textFile(input_path, 12)