Anomaly Detection¶

Outlier Detection, unsupervised anomaly detection¶

The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations
The outliers/anomalies cannot form a dense cluster as available estimators assume that the outliers/anomalies are located in low density regions

Novelty Detection, semi-supervised anomaly detection¶

The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty
The novelties/anomalies can form a dense cluster as long as they are in a low density region of the training data, considered as normal in this context

1. Outlier Detection¶

Don’t have a clean data set representing the population of regular observations
Can use trained model to evaluate the training dataset

import numpy as np
m1 = np.random.normal(size=1000)
m2 = np.random.normal(scale=0.5, size=1000)
X = np.vstack([m1.ravel(), m2.ravel()]).T
X_test = np.array([[-3, 1], [0, 0]])

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(X[:, 0], X[:, 1], "yo")
plt.plot(X_test[:, 0], X_test[:, 1], "bs")

[<matplotlib.lines.Line2D at 0x7f964a749690>]

GMM ¶

EllipticEnvelope¶

covariance estimation method
assumes the data is Gaussian and learns an ellipse
1 inlier, -1 outlier

from sklearn.covariance import EllipticEnvelope
# contamination, control how many percentage to be outlier
clf = EllipticEnvelope(random_state=0, contamination = 0.05)
clf.fit(X)
labels = clf.predict(X)

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)
ax.scatter(X[:, 0], X[:, 1], c=labels, marker='o', cmap="Paired")
ax.scatter(X[:, 0], X[:, 1], marker='o', c=labels)

<matplotlib.collections.PathCollection at 0x7f964c97db90>

Fast-MCD (minimum covariance determinant)¶

Useful for outlier detection, in particular to clean up dataset
Assume the normal instances are generated from a single Gaussian distribution
Give a better estimation of the elliptic envelope

from sklearn.covariance import MinCovDet
clf = MinCovDet(random_state=0)
clf.fit(X)

MinCovDet(assume_centered=False, random_state=0, store_precision=True,
          support_fraction=None)

contamination = 4 # 4% of the the samples are outliers
threshold = np.percentile(clf.dist_, 100-contamination)
anomalies = X[clf.dist_ > threshold]

fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)
ax.scatter(X[:, 0], X[:, 1])
ax.scatter(anomalies[:, 0], anomalies[:, 1], marker='x')

<matplotlib.collections.PathCollection at 0x7f964ed40d10>

PCA¶

Use inverse_tranform() to convert the reduced features to original dimension
Compare inversed data and the original data
Outliers have larger reconstruction errors

Isolation Forest¶

Efficient, especially in high-dimensional datasets
Path length, averaged over a forest of such random trees, is a measure of normality and our decision function
Shorter paths for anomalies, higher likely to be anomalies
Steps
- Build a Random Forest
- Each Decision Tree is grown randomly, at each node picks a feature randomly, then pick a random threhold value between min and max values
- Dataset gradually get chopped into pieces until all instances end up isolated from other instances

from sklearn.ensemble import IsolationForest
# contamination, control how many percentage to be outlier
clf = IsolationForest(n_estimators=100, warm_start=True, contamination = 0.05)
clf.fit(X)
labels = clf.predict(X)

fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)
ax.scatter(X[:, 0], X[:, 1], c=labels, marker='o', cmap="Paired")
ax.scatter(X[:, 0], X[:, 1], marker='o', c=labels)

<matplotlib.collections.PathCollection at 0x7f964d080e90>

Local Outlier Factor¶

Compare the density of instances around a given instance to the density around its neighbors
Locality is given by k-nearest neighbors, whose distance is used to estimate the local density
Samples that have a substantially lower density than their neighbors considered outliers

from sklearn.neighbors import LocalOutlierFactor
# contamination, control how many percentage to be outlier
lof = LocalOutlierFactor(novelty=False, n_neighbors=10, contamination = 0.05)
labels = lof.fit_predict(X)

fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)
ax.scatter(X[:, 0], X[:, 1], c=labels, marker='o', cmap="Paired")
ax.scatter(X[:, 0], X[:, 1], marker='o', c=labels)

<matplotlib.collections.PathCollection at 0x7f964d36af10>

2. Novelty Detection¶

Need a clean data set representing the population of regular observations
Cannot use trained model to evaluate the training datase

One-class SVM¶

Have one class of instances, try to separate the instances in high-dimensional space from the origin
Not scale to large datasets

from sklearn.svm import OneClassSVM
clf = OneClassSVM(gamma='auto').fit(X)
labels = clf.predict(X_test)

labels, X_test

(array([-1,  1]),
 array([[-3,  1],
        [ 0,  0]]))

fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)

ax.scatter(X[:, 0], X[:, 1])
ax.scatter(X_test[:, 0], X_test[:, 1], c=labels, marker='x', s=100, cmap="Paired")

<matplotlib.collections.PathCollection at 0x7f964df30a10>

Local Outlier Factor¶

from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(novelty=True)
lof.fit(X)
lables = lof.predict(X_test)

fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)

ax.scatter(X[:, 0], X[:, 1])
ax.scatter(X_test[:, 0], X_test[:, 1], c=labels, marker='x', s=100, cmap="Paired")

<matplotlib.collections.PathCollection at 0x7f964df96910>

Anomaly Detection¶

Outlier Detection, unsupervised anomaly detection¶

Novelty Detection, semi-supervised anomaly detection¶

1. Outlier Detection¶

GMM¶

EllipticEnvelope¶

Fast-MCD (minimum covariance determinant)¶

PCA¶

Isolation Forest¶

Local Outlier Factor¶

2. Novelty Detection¶

One-class SVM¶

Local Outlier Factor¶

GMM ¶