Gaussian Mixtures¶

Gaussian Mixture Model (GMM)¶

Using Expectation-Maximization (EM) algorithm, similar with K-Means algorithm
Steps
- Initialize the cluster parameters randomly
  - k clusters
  - $\mu$, mean
  - $\Sigma$, covariance matrix
  - $\phi$, categorical distribution in clusters
- Expectation step, assign instances to clusters
  - find the cluster centers ($\mu^1$ to $\mu^k$)
  - find cluster size, shape, and orientation ($\Sigma^1$ to $\Sigma^k$)
  - find relative weights ($\phi^1$ to $\phi^k$)
  - For each instance, estimates the probability that it belongs to each cluster
- Maximization step, updating clusters
  - Each cluster is updated using all the instances in the dataset, with the each instance weighted by the estimated probability that it belongs to that cluster
covariance_type
- full, default, each cluster can take on any shape, size, and orientation
- spherical, all clusters must be spherical, but they can have different diameters
- diag, clusters can take on any ellipsoidal shape of any size, but the ellipsoid's axes must be parallel to the coordinate axes, the covariance matrices must be diagonal
- tied, all custers must have the same ellipsoidal shape, size, and orientation
Prons
- The fastest algorithm for learning mixture models
- Not bias the means towards zero, or bias the cluster sizes to have specific structures that might or might not apply
- Generative model, be able to sample new instances from trained model
Cons
- When one has insufficiently many points per mixture, estimating the covariance matrices becomes difficult
- Always use all the components it has access to, needing held-out data or information theoretical criteria to decide how many components to use in the absence of external cues
- EM may end up to poor solutions, it needs to be run several times, the best results will be kept, default n_init is 1
- EM can struggle to converge to the optimal solution when there are many dimensions, or many clusters
- Not scal to large number of features
Complexity
- spherical or diag, $O(kmn)$
- tied or full, $O(kmn^2+kn^3)$

from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=3, random_state=42)
model.fit(X)
y_pred = model.predict(X) # hard clustering
y_pred_prob = model.predict_proba(X) # soft clustering

model.weights_ # phi
model.means_ # mu
model.covariances_ # covariance matrices

import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(X[y_pred==0, 2], X[y_pred==0, 3], "yo", label="Cluster 1")
plt.plot(X[y_pred==1, 2], X[y_pred==1, 3], "bs", label="Cluster 2")
plt.plot(X[y_pred==2, 2], X[y_pred==2, 3], "g^", label="Cluster 3")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper left", fontsize=12)

<matplotlib.legend.Legend at 0x7fc1ca71dd90>

# sample new instances
X_new, y_new = model.sample(100)

# log value of the probability density function for each sample
model.score_samples(X_new)

import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(X_new[y_new==0, 2], X_new[y_new==0, 3], "yo", label="Cluster 1")
plt.plot(X_new[y_new==1, 2], X_new[y_new==1, 3], "bs", label="Cluster 2")
plt.plot(X_new[y_new==2, 2], X_new[y_new==2, 3], "g^", label="Cluster 3")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper left", fontsize=12)

<matplotlib.legend.Legend at 0x7fc1caa70290>

Anomaly Detection Using Gaussian Mixtures¶

import numpy as np
contamination = 4 # 4% of the the samples are outliers
densities = model.score_samples(X)
density_threshold = np.percentile(densities, 4)
anomalies = X[densities < density_threshold]

import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(X[y_pred==0, 2], X[y_pred==0, 3], "yo", label="Cluster 1")
plt.plot(X[y_pred==1, 2], X[y_pred==1, 3], "bs", label="Cluster 2")
plt.plot(X[y_pred==2, 2], X[y_pred==2, 3], "g^", label="Cluster 3")
plt.plot(anomalies[:, 2], anomalies[:, 3], 'rx', label="Outlier")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper left", fontsize=12)

<matplotlib.legend.Legend at 0x7fc1cab7d350>

Selecting the Number of Clusters¶

Bayesian Information Criterion (BIC) $$BIC = log(m)p - 2log(\hat{L})$$
Akaike Information Criterion (AIC) $$AIC = 2p - 2log(\hat{L})$$
m, the number of instances
p, the number of parameters learned by the model
$\hat{L}$, the maximized value of the likelihood function of the model
Choose k which has the lowest BIC and AIC

bics = np.empty(10)
aics = np.empty(10)

for n_components in range(1, 10):
    model = GaussianMixture(n_components=n_components, random_state=42)
    model.fit(X)
    bics[n_components] = model.bic(X)
    aics[n_components] = model.aic(X)

import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(range(1, 10), bics[1:], "yo-", label="BIC")
plt.plot(range(1, 10), aics[1:], "bs-", label="AIC")
plt.xlabel("k", fontsize=14)
plt.ylabel("Information Criterion", fontsize=14)
plt.legend(loc="upper left", fontsize=12)

<matplotlib.legend.Legend at 0x7fc1cabe8790>

Bayesian Gaussian Mixture Models¶

Be able to search for the optimal number of clusters
- Set the number of clusters to a value greater than the optimal number of clusters
- The model will eliminate the unnecessary clusters automatically
weight_concentration_prior
- concentration is low, weight will likeyly be close to 1, will be few clusters
- concentration is high, weight likely be close to 0, generate many clusters
Prons
- Avoids the singularities often found in expectation-maximization solutions
- Not change much with changes to the parameters, leading to more stability and less tuning
- Have less pathological special cases
Cons
- Introduces some subtle biases to the model
- The extra parametrization necessary for variational inference make inference slower
- Needs an extra hyperparameter that might need experimental tuning via cross-validation

from sklearn.mixture import BayesianGaussianMixture
model = BayesianGaussianMixture(n_components=10, n_init=10)
model.fit(X)
np.round(model.weights_, 2)

array([0.2 , 0.33, 0.41, 0.01, 0.04, 0.  , 0.  , 0.  , 0.  , 0.  ])

y_pred = model.predict(X)

import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(X[y_pred==0, 2], X[y_pred==0, 3], "yo", label="Cluster 1")
plt.plot(X[y_pred==1, 2], X[y_pred==1, 3], "bs", label="Cluster 2")
plt.plot(X[y_pred==2, 2], X[y_pred==2, 3], "g^", label="Cluster 3")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper left", fontsize=12)

<matplotlib.legend.Legend at 0x7fc1cb314a50>