Anomaly Detection

Outlier Detection, unsupervised anomaly detection

  • The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations
  • The outliers/anomalies cannot form a dense cluster as available estimators assume that the outliers/anomalies are located in low density regions

Novelty Detection, semi-supervised anomaly detection

  • The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty
  • The novelties/anomalies can form a dense cluster as long as they are in a low density region of the training data, considered as normal in this context

1. Outlier Detection

  • Don’t have a clean data set representing the population of regular observations
  • Can use trained model to evaluate the training dataset
In [2]:
import numpy as np
m1 = np.random.normal(size=1000)
m2 = np.random.normal(scale=0.5, size=1000)
X = np.vstack([m1.ravel(), m2.ravel()]).T
X_test = np.array([[-3, 1], [0, 0]])
In [5]:
import warnings
warnings.filterwarnings('ignore')
In [6]:
import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(X[:, 0], X[:, 1], "yo")
plt.plot(X_test[:, 0], X_test[:, 1], "bs")
Out[6]:
[<matplotlib.lines.Line2D at 0x7f964a749690>]

EllipticEnvelope

  • covariance estimation method
  • assumes the data is Gaussian and learns an ellipse
  • 1 inlier, -1 outlier
In [11]:
from sklearn.covariance import EllipticEnvelope
# contamination, control how many percentage to be outlier
clf = EllipticEnvelope(random_state=0, contamination = 0.05)
clf.fit(X)
labels = clf.predict(X)
In [14]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)
ax.scatter(X[:, 0], X[:, 1], c=labels, marker='o', cmap="Paired")
ax.scatter(X[:, 0], X[:, 1], marker='o', c=labels)
Out[14]:
<matplotlib.collections.PathCollection at 0x7f964c97db90>

Fast-MCD (minimum covariance determinant)

  • Useful for outlier detection, in particular to clean up dataset
  • Assume the normal instances are generated from a single Gaussian distribution
  • Give a better estimation of the elliptic envelope
In [98]:
from sklearn.covariance import MinCovDet
clf = MinCovDet(random_state=0)
clf.fit(X)
Out[98]:
MinCovDet(assume_centered=False, random_state=0, store_precision=True,
          support_fraction=None)
In [101]:
contamination = 4 # 4% of the the samples are outliers
threshold = np.percentile(clf.dist_, 100-contamination)
anomalies = X[clf.dist_ > threshold]
In [103]:
fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)
ax.scatter(X[:, 0], X[:, 1])
ax.scatter(anomalies[:, 0], anomalies[:, 1], marker='x')
Out[103]:
<matplotlib.collections.PathCollection at 0x7f964ed40d10>

PCA

  • Use inverse_tranform() to convert the reduced features to original dimension
  • Compare inversed data and the original data
  • Outliers have larger reconstruction errors

Isolation Forest

  • Efficient, especially in high-dimensional datasets
  • Path length, averaged over a forest of such random trees, is a measure of normality and our decision function
  • Shorter paths for anomalies, higher likely to be anomalies
  • Steps
    • Build a Random Forest
    • Each Decision Tree is grown randomly, at each node picks a feature randomly, then pick a random threhold value between min and max values
    • Dataset gradually get chopped into pieces until all instances end up isolated from other instances
In [17]:
from sklearn.ensemble import IsolationForest
# contamination, control how many percentage to be outlier
clf = IsolationForest(n_estimators=100, warm_start=True, contamination = 0.05)
clf.fit(X)
labels = clf.predict(X)
In [18]:
fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)
ax.scatter(X[:, 0], X[:, 1], c=labels, marker='o', cmap="Paired")
ax.scatter(X[:, 0], X[:, 1], marker='o', c=labels)
Out[18]:
<matplotlib.collections.PathCollection at 0x7f964d080e90>

Local Outlier Factor

  • Compare the density of instances around a given instance to the density around its neighbors
  • Locality is given by k-nearest neighbors, whose distance is used to estimate the local density
  • Samples that have a substantially lower density than their neighbors considered outliers
In [25]:
from sklearn.neighbors import LocalOutlierFactor
# contamination, control how many percentage to be outlier
lof = LocalOutlierFactor(novelty=False, n_neighbors=10, contamination = 0.05)
labels = lof.fit_predict(X)
In [26]:
fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)
ax.scatter(X[:, 0], X[:, 1], c=labels, marker='o', cmap="Paired")
ax.scatter(X[:, 0], X[:, 1], marker='o', c=labels)
Out[26]:
<matplotlib.collections.PathCollection at 0x7f964d36af10>

2. Novelty Detection

  • Need a clean data set representing the population of regular observations
  • Cannot use trained model to evaluate the training datase

One-class SVM

  • Have one class of instances, try to separate the instances in high-dimensional space from the origin
  • Not scale to large datasets
In [35]:
from sklearn.svm import OneClassSVM
clf = OneClassSVM(gamma='auto').fit(X)
labels = clf.predict(X_test)
In [40]:
labels, X_test
Out[40]:
(array([-1,  1]),
 array([[-3,  1],
        [ 0,  0]]))
In [44]:
fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)

ax.scatter(X[:, 0], X[:, 1])
ax.scatter(X_test[:, 0], X_test[:, 1], c=labels, marker='x', s=100, cmap="Paired")
Out[44]:
<matplotlib.collections.PathCollection at 0x7f964df30a10>

Local Outlier Factor

In [45]:
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(novelty=True)
lof.fit(X)
lables = lof.predict(X_test)
In [46]:
fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)

ax.scatter(X[:, 0], X[:, 1])
ax.scatter(X_test[:, 0], X_test[:, 1], c=labels, marker='x', s=100, cmap="Paired")
Out[46]:
<matplotlib.collections.PathCollection at 0x7f964df96910>