Dimensionality Reduction

The Curse of Dimensionality

  • Most points in a high-dimensional hypercube are very close to the border
  • High-dimensional datasets are at risk of being very sparse
  • The more dimensions the training set has, the greater the risk of overfitting it
  • Solution
    • Increase the size of the training set (impossible, exponentially increasing)
    • Dimension reduction

Pros

  • Speed up training
  • Data visualization
  • May filter out some noise and unnecessary details and result in higher performance

Cons

  • In general, cause some information loss, perform slightly worse

Major Approaches of Reducing Dimension

  • Projection
    • Project data perpendiculary onto subspace
    • PCA
  • Manifold
    • Most real-world high-dimensional datasets lie close to a much lower-dimensional manifold
    • LLE

1. PCA

  • Project data onto a hyperplane which preserves the maximum variance
In [9]:
import numpy as np
In [10]:
data = np.array([[45,37,42,35], [38,31,26,28], [10,15,17,21]]).T # 5*3 matrix
print(data)
[[45 38 10]
 [37 31 15]
 [42 26 17]
 [35 28 21]]

Normalize data

In [19]:
avg = np.average(data, axis=0)
#std = np.std(data, axis=0)
data = data - avg
#data = data/std

Calculate Covariance Matrix

  • Coursera, $\Sigma = \frac{1}{m}\Sigma_{i=1}^{m}(X^{i})(X^{i})^{T}$
  • Numpy, $\Sigma = \frac{1}{m-1}\Sigma_{i=1}^{m}(X^{i})(X^{i})^{T}$
  • $X^{i}$ is a n*1 vector
  • covariance matrix is a n*n matrix
In [12]:
covariance = data.T.dot(data)/(data.shape[0]-1)
print(covariance)
[[ 20.91666667  13.25       -16.75      ]
 [ 13.25        27.58333333 -20.75      ]
 [-16.75       -20.75        20.91666667]]

Calculate Eignvector

  • Use either svd or eig to calculate the eignvector of covariance matrix
  • svd is stable
    • the direction of the unit vectors is not stable
    • a pair of unit vector may rotate or swap if the variance along these two axes are close
  • The eignvector matrix U is Orthogonal matrix, $U^{-1} = U^{T}$
In [13]:
# u is the eigenvector
u, s, v = np.linalg.svd(covariance)
print(u, s)
[[-0.50086556  0.77075698  0.39378595]
 [-0.63320197 -0.63647951  0.44039652]
 [ 0.59007538 -0.02876659  0.80683551]] [57.40088536 10.60016819  1.41561311]
In [14]:
# v is the eigenvector
w, v = np.linalg.eig(covariance)
print(v, w)
[[-0.50086556 -0.77075698  0.39378595]
 [-0.63320197  0.63647951  0.44039652]
 [ 0.59007538  0.02876659  0.80683551]] [57.40088536 10.60016819  1.41561311]

Calculate Reduced Matrix

  • Choose the first k columns from U to create $U^{k}$, U is a n*k matrix
  • Typically, 95% of variance is retained
  • $Z = X * U^{k}$, Z is a m*k matrix
In [15]:
# Reduced features m*k
Z = data.dot(v[:, :2])
print(Z)
[[-10.61319192   0.40259444]
 [  0.77652327   2.25712663]
 [  2.61835608  -4.72152263]
 [  7.21831257   2.06180157]]

Calcualte Retained Variance

In [18]:
ratio = np.sum(w[:2])/np.sum(w)
print(ratio)
0.9796070140316309

Calculate Approximation Matrix

  • Change Dimension Back to m*n
  • $X*U^{k}*(U^{k})^{-1} = Z*(U^{k})^{-1}$
  • $X = Z*(U^{k})^{T}$, U is a orthogonal matrix, its transpose is equal to its inverse matrix
In [347]:
X_app = Z.dot(v[:, :2].T)
X_app = X_app + avg
print(X_app)
[[44.75547986 37.72653717  9.49899804]
 [37.62137014 31.69491878 16.27313709]
 [42.07770213 26.08689936 17.15920536]
 [34.54544787 27.49164469 20.0686595 ]]

PCA with Sklearn

In [67]:
data = np.array([[45,37,42,35], [38,31,26,28], [10,15,17,21]]).T # 5*3 matrix
print(data)
[[45 38 10]
 [37 31 15]
 [42 26 17]
 [35 28 21]]
In [68]:
from sklearn.decomposition import PCA
pca = PCA()
In [69]:
pca.fit(data)
Z = pca.transform(data)
print(Z)
[[10.61319192 -0.40259444 -0.62094684]
 [-0.77652327 -2.25712663  1.57793885]
 [-2.61835608  4.72152263  0.19732072]
 [-7.21831257 -2.06180157 -1.15431273]]
In [70]:
print(pca.components_) # principle components
print(pca.get_covariance()) # sigma
print(pca.explained_variance_) # eign values
print(pca.explained_variance_ratio_) # percentage of variance
[[ 0.50086556  0.63320197 -0.59007538]
 [ 0.77075698 -0.63647951 -0.02876659]
 [-0.39378595 -0.44039652 -0.80683551]]
[[ 20.91666667  13.25       -16.75      ]
 [ 13.25        27.58333333 -20.75      ]
 [-16.75       -20.75        20.91666667]]
[57.40088536 10.60016819  1.41561311]
[0.82690351 0.1527035  0.02039299]
In [42]:
X_app = pca.inverse_transform(Z)
print(X_app)
[[45. 38. 10.]
 [37. 31. 15.]
 [42. 26. 17.]
 [35. 28. 21.]]

Choosing the Right Number of Dimensions

In [52]:
# choose the number of PC with 95% variance
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
In [ ]:
# choose the ratio of variance to preserve
pca = PCA(n_components=0.95)
In [56]:
# search elbow
import matplotlib.pyplot as plt
fig, ax = plt.subplots();
ax.plot(range(len(pca.explained_variance_ratio_)), cumsum);

Incremental PCA (IPCA)

  • PCA require the whole training set to fit in memory
  • IPCA split training set into mini-batches and feed an IPCA alrorithm one min-batch at a time
In [61]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
data_X, data_Y = mnist['data'], mnist['target']
In [64]:
from sklearn.decomposition import IncrementalPCA

n_batches = 100
pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(data_X, n_batches):
    pca.partial_fit(X_batch)
    
X_reduced = pca.transform(data_X)

2. Kernel PCA

  • Preserve clusters of instance after projection
  • Sometimes even unrolling datasets that lie close to a twisted manifold
  • convert data in original space to data in infinite feature space, then use linear PCA to reduce demension
  • The reduced data cannot be inverted to the original data, they can be inverted to pre-image in original space
In [102]:
from sklearn.datasets import make_swiss_roll
X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)
In [80]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import proj3d

axes = [-11.5, 14, -2, 23, -12, 15]

fig = plt.figure(figsize=(6, 5))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=t, cmap=plt.cm.hot)
ax.view_init(10, -70)
ax.set_xlabel("$x_1$", fontsize=18)
ax.set_ylabel("$x_2$", fontsize=18)
ax.set_zlabel("$x_3$", fontsize=18)
Out[80]:
Text(0.5, 0, '$x_3$')
In [108]:
from sklearn.decomposition import KernelPCA

pca = PCA(n_components=2) # default PCA, equals to linear PCA
X_default = pca.fit_transform(X)
rbf_pca = KernelPCA(n_components = 2, kernel='rbf', gamma=0.04) # rbf kernel
X_rbf = rbf_pca.fit_transform(X)
linear_pca = KernelPCA(n_components = 2, kernel= 'linear') # linear kernel
X_linear = linear_pca.fit_transform(X)
sig_pca = KernelPCA(n_components = 2, kernel='sigmoid', gamma=0.001) # sigmoid kernel
X_sig = sig_pca.fit_transform(X)
In [111]:
f, axs = plt.subplots(2, 2, figsize=(11, 8))
axs[0, 0].scatter(X_default[:, 0], X_default[:, 1], c=t, cmap=plt.cm.hot)
axs[0, 1].scatter(X_linear[:, 0], X_linear[:, 1], c=t, cmap=plt.cm.hot)
axs[1, 0].scatter(X_rbf[:, 0], X_rbf[:, 1], c=t, cmap=plt.cm.hot)
axs[1, 1].scatter(X_sig[:, 0], X_sig[:, 1], c=t, cmap=plt.cm.hot)
Out[111]:
<matplotlib.collections.PathCollection at 0x7fca956a9b90>

Selecting a Kernel and Tuning Hyperparameters

In [96]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('kpca', KernelPCA(n_components=2)),
    ('log_reg', LogisticRegression()),
])
In [ ]:
param_grid = [{
    'kpca__gamma': np.linspace(0.03, 0.05, 10),
    'kpca__kernel': ['rbf', 'sigmoid'],
}]

grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)
In [113]:
# Select with the lowest reconstruction error
rbf_pca = KernelPCA(n_components = 2, kernel='rbf', gamma=0.0433, fit_inverse_transform=True)
X_rbf = rbf_pca.fit_transform(X);
X_preimage = rbf_pca.inverse_transform(X_rbf)
In [136]:
# Original Plot
fig = plt.figure(figsize=(18, 6))
ax = fig.add_subplot(1, 3, 1, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=t, cmap=plt.cm.hot)
ax.view_init(10, -70)
# Reduced Plot
ax = fig.add_subplot(1, 3, 2)
ax.scatter(X_rbf[:, 0], X_rbf[:, 1], c=t, cmap=plt.cm.hot)
# Pre-image Plot
ax = fig.add_subplot(1, 3, 3, projection='3d')
ax.scatter(X_preimage[:, 0], X_preimage[:, 1], X_preimage[:, 2], c=t, cmap=plt.cm.hot)
Out[136]:
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7fca9f1ea210>
In [133]:
from sklearn.metrics import mean_squared_error
mean_squared_error(X, X_preimage)
Out[133]:
32.78630879576614

3. Locally Linear Embedding (LLE)

  • Manifold Learning technique
  • Measure how each instace linearly relates to its closest neighbors, then look for a low-dimensional representation of the instance where the local relationships are best preserved
  • O(mlog(m)n logk) for finding the k nearest neighbors, O($mnk^{3}$) for weights, O($dm^2$) for low-dimensional representations
  • Scale poorly to very large datasets
In [161]:
from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=8)
X_reduced = lle.fit_transform(X)
In [162]:
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
Out[162]:
<matplotlib.collections.PathCollection at 0x7fca98e99390>

4. Modified LLE

  • When the number of neighbors is greater than the number of input dimensions
  • the matrix defining each local neighborhood is rank-deficient
  • use multiple weight vectors in each neighborhood to address the regularization problem
In [173]:
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=8, method = 'modified')
X_reduced = lle.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
Out[173]:
<matplotlib.collections.PathCollection at 0x7fca9b64e3d0>

5. Hessian Eigenmapping known as Hessian-based LLE: HLLE

  • another method of solving the regularization problem of LLE
In [171]:
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=8, method = 'hessian')
X_reduced = lle.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
Out[171]:
<matplotlib.collections.PathCollection at 0x7fca9971fa10>

6. Local Tangent Space Alignment (LTSA)

  • Rather than focusing on preserving neighborhood distances as in LLE, LTSA seeks to characterize the local geometry at each neighborhood via its tangent space
In [177]:
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=8, method = 'ltsa')
X_reduced = lle.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
Out[177]:
<matplotlib.collections.PathCollection at 0x7fca9b2f5650>

7. Random Projections

  • computationally efficient, it often introduces relatively high distortion
  • $X_{reduced} = \dfrac{1}{\sqrt{d}}XM$
  • X is a $m*n$ matrix, M is a $n*d$ random matrix
  • Gaussian Random Project
    • $r_{ij}$ in M is a random Gaussian number
  • Sparse Random Project
    • Random matrix is generated with different probabilities
In [166]:
from sklearn import random_projection
rp_gaussian = random_projection.GaussianRandomProjection(n_components=2)
X_reduced = rp_gaussian.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
Out[166]:
<matplotlib.collections.PathCollection at 0x7fca98fad5d0>
In [167]:
rp_sparse = random_projection.SparseRandomProjection(n_components=2)
X_reduced = rp_sparse.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
Out[167]:
<matplotlib.collections.PathCollection at 0x7fca994f36d0>

8. Multi-dimensional Scaling (MDS)

  • seeks a low-dimensional representation of the data in which the distances respect well the distances in the original high-dimensional space
  • Metric MDS
    1. Calculate similarity/dissimilarity matrix D of the original data, $d_{ij}$ represents the Euclidean distance
    2. Assign data points in n-dimensional space to arbitrary coordinates in d-dimensional space by random initial coordinates and steepest descent
    3. Calculate similarity/dissimilarity matrix M of the reduced data, $m_{ij}$ represents the Euclidean distance
  • Non-Metric MDS
    • Focuses on the ordination of the data
    • If $m_{ij} < m_{jk}$, then $d_{ij} < d_{jk}$
In [179]:
# metric MDS
from sklearn.manifold import MDS
embedding = MDS(n_components=2)
X_reduced = embedding.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
Out[179]:
<matplotlib.collections.PathCollection at 0x7fca9b628d90>
In [180]:
# nonmetric MDS
from sklearn.manifold import MDS
embedding = MDS(n_components=2, metric=False)
X_reduced = embedding.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
Out[180]:
<matplotlib.collections.PathCollection at 0x7fca96221a10>

9. Isometric Mapping (Isomap)

  • A nonlinear method for dimensionality reduction
  • Steps
    • Construct neighborhood graph
    • Compute shortest paths between all pairs
    • Construct d-dimensional coordinate vectors with MDS
In [178]:
from sklearn.manifold import Isomap
embedding = Isomap(n_components=2)
X_reduced = embedding.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
Out[178]:
<matplotlib.collections.PathCollection at 0x7fca9fc76e90>

10. Spectral Embedding

  • finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian
In [176]:
from sklearn.manifold import SpectralEmbedding
embedding = SpectralEmbedding(n_components=2)
X_reduced = embedding.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
Out[176]:
<matplotlib.collections.PathCollection at 0x7fca9ed38290>

11. t-distributed Stochastic Neighbor Embedding (t-SNE)

  1. Each point in high-D has a conditional probability of picking each other point as its neighbor. The probability is a t-distribution calculated by the distance/similarity between two points.
  2. Give each datapoint a location in the low- dimensional space. Also calculate the probabilities.
  3. Minimizing the Kullback–Leibler divergence, $\sum_{i}\sum_{j}p_{j|i}log\dfrac{p_{j|i}}{q_{j|i}}$, $p_{j|i}$ is the probability that point i picks j as its neighbor in high-D, $q_{j|i}$ is the probability that point i picks j as its neighbor in low-D

Pros

* Revealing the structure at many scales on a single map
* Revealing data that lie in multiple, different, manifolds or clusters
* Reducing the tendency to crowd points together at the center
* might be beneficial to visually disentangle a dataset that comprises several manifolds at once

Cons

* Note that the KL divergence is not convex, i.e. multiple restarts with different initializations will end up in local minima of the KL divergence
* t-SNE is computationally expensive
* The Barnes-Hut t-SNE method is limited to two or three dimensional embeddings
* Global structure is not explicitly preserved. This problem is mitigated by initializing points with PCA.

Optimization

  1. perplexity, $k=2^{S}$, the number of nearest neighbors, larger perplexities lead to more nearest neighbors and less sensitive to small structure
  2. early exaggeration factor, controls how tight natural clusters in the original space are in the embedded space and how much space will be between them, for larger values, the space between natural clusters will be larger in the embedded space
  3. learning rate
  4. maximum number of iterations
  5. angle, is a tradeoff between performance and accuracy, larger angles lead to better speed but less accurate results
In [185]:
from sklearn.manifold import TSNE
embedding = TSNE(n_components=2, init='pca')
X_reduced = embedding.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
Out[185]:
<matplotlib.collections.PathCollection at 0x7fca9616b890>