Gaussian Mixtures

Gaussian Mixture Model (GMM)

  • Using Expectation-Maximization (EM) algorithm, similar with K-Means algorithm
  • Steps
    • Initialize the cluster parameters randomly
      • k clusters
      • $\mu$, mean
      • $\Sigma$, covariance matrix
      • $\phi$, categorical distribution in clusters
    • Expectation step, assign instances to clusters
      • find the cluster centers ($\mu^1$ to $\mu^k$)
      • find cluster size, shape, and orientation ($\Sigma^1$ to $\Sigma^k$)
      • find relative weights ($\phi^1$ to $\phi^k$)
      • For each instance, estimates the probability that it belongs to each cluster
    • Maximization step, updating clusters
      • Each cluster is updated using all the instances in the dataset, with the each instance weighted by the estimated probability that it belongs to that cluster
  • covariance_type
    • full, default, each cluster can take on any shape, size, and orientation
    • spherical, all clusters must be spherical, but they can have different diameters
    • diag, clusters can take on any ellipsoidal shape of any size, but the ellipsoid's axes must be parallel to the coordinate axes, the covariance matrices must be diagonal
    • tied, all custers must have the same ellipsoidal shape, size, and orientation
  • Prons
    • The fastest algorithm for learning mixture models
    • Not bias the means towards zero, or bias the cluster sizes to have specific structures that might or might not apply
    • Generative model, be able to sample new instances from trained model
  • Cons
    • When one has insufficiently many points per mixture, estimating the covariance matrices becomes difficult
    • Always use all the components it has access to, needing held-out data or information theoretical criteria to decide how many components to use in the absence of external cues
    • EM may end up to poor solutions, it needs to be run several times, the best results will be kept, default n_init is 1
    • EM can struggle to converge to the optimal solution when there are many dimensions, or many clusters
    • Not scal to large number of features
  • Complexity
    • spherical or diag, $O(kmn)$
    • tied or full, $O(kmn^2+kn^3)$
In [2]:
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
data.target_names
Out[2]:
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
In [3]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
In [22]:
from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=3, random_state=42)
model.fit(X)
y_pred = model.predict(X) # hard clustering
y_pred_prob = model.predict_proba(X) # soft clustering
In [ ]:
model.weights_ # phi
model.means_ # mu
model.covariances_ # covariance matrices
In [13]:
import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(X[y_pred==0, 2], X[y_pred==0, 3], "yo", label="Cluster 1")
plt.plot(X[y_pred==1, 2], X[y_pred==1, 3], "bs", label="Cluster 2")
plt.plot(X[y_pred==2, 2], X[y_pred==2, 3], "g^", label="Cluster 3")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper left", fontsize=12)
Out[13]:
<matplotlib.legend.Legend at 0x7fc1ca71dd90>
In [26]:
# sample new instances
X_new, y_new = model.sample(100)
In [ ]:
# log value of the probability density function for each sample
model.score_samples(X_new)
In [27]:
import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(X_new[y_new==0, 2], X_new[y_new==0, 3], "yo", label="Cluster 1")
plt.plot(X_new[y_new==1, 2], X_new[y_new==1, 3], "bs", label="Cluster 2")
plt.plot(X_new[y_new==2, 2], X_new[y_new==2, 3], "g^", label="Cluster 3")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper left", fontsize=12)
Out[27]:
<matplotlib.legend.Legend at 0x7fc1caa70290>

Anomaly Detection Using Gaussian Mixtures

In [34]:
import numpy as np
contamination = 4 # 4% of the the samples are outliers
densities = model.score_samples(X)
density_threshold = np.percentile(densities, 4)
anomalies = X[densities < density_threshold]
In [35]:
import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(X[y_pred==0, 2], X[y_pred==0, 3], "yo", label="Cluster 1")
plt.plot(X[y_pred==1, 2], X[y_pred==1, 3], "bs", label="Cluster 2")
plt.plot(X[y_pred==2, 2], X[y_pred==2, 3], "g^", label="Cluster 3")
plt.plot(anomalies[:, 2], anomalies[:, 3], 'rx', label="Outlier")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper left", fontsize=12)
Out[35]:
<matplotlib.legend.Legend at 0x7fc1cab7d350>

Selecting the Number of Clusters

  • Bayesian Information Criterion (BIC) $$BIC = log(m)p - 2log(\hat{L})$$
  • Akaike Information Criterion (AIC) $$AIC = 2p - 2log(\hat{L})$$
  • m, the number of instances
  • p, the number of parameters learned by the model
  • $\hat{L}$, the maximized value of the likelihood function of the model

  • Choose k which has the lowest BIC and AIC

In [46]:
bics = np.empty(10)
aics = np.empty(10)

for n_components in range(1, 10):
    model = GaussianMixture(n_components=n_components, random_state=42)
    model.fit(X)
    bics[n_components] = model.bic(X)
    aics[n_components] = model.aic(X)
In [47]:
import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(range(1, 10), bics[1:], "yo-", label="BIC")
plt.plot(range(1, 10), aics[1:], "bs-", label="AIC")
plt.xlabel("k", fontsize=14)
plt.ylabel("Information Criterion", fontsize=14)
plt.legend(loc="upper left", fontsize=12)
Out[47]:
<matplotlib.legend.Legend at 0x7fc1cabe8790>

Bayesian Gaussian Mixture Models

  • Be able to search for the optimal number of clusters
    • Set the number of clusters to a value greater than the optimal number of clusters
    • The model will eliminate the unnecessary clusters automatically
  • weight_concentration_prior
    • concentration is low, weight will likeyly be close to 1, will be few clusters
    • concentration is high, weight likely be close to 0, generate many clusters
  • Prons
    • Avoids the singularities often found in expectation-maximization solutions
    • Not change much with changes to the parameters, leading to more stability and less tuning
    • Have less pathological special cases
  • Cons
    • Introduces some subtle biases to the model
    • The extra parametrization necessary for variational inference make inference slower
    • Needs an extra hyperparameter that might need experimental tuning via cross-validation
In [71]:
from sklearn.mixture import BayesianGaussianMixture
model = BayesianGaussianMixture(n_components=10, n_init=10)
model.fit(X)
np.round(model.weights_, 2)
Out[71]:
array([0.2 , 0.33, 0.41, 0.01, 0.04, 0.  , 0.  , 0.  , 0.  , 0.  ])
In [72]:
y_pred = model.predict(X)
In [73]:
import matplotlib.pyplot as plt
plt.figure(figsize=(9, 3.5))

plt.plot(X[y_pred==0, 2], X[y_pred==0, 3], "yo", label="Cluster 1")
plt.plot(X[y_pred==1, 2], X[y_pred==1, 3], "bs", label="Cluster 2")
plt.plot(X[y_pred==2, 2], X[y_pred==2, 3], "g^", label="Cluster 3")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper left", fontsize=12)
Out[73]:
<matplotlib.legend.Legend at 0x7fc1cb314a50>