Dimensionality Reduction¶

The Curse of Dimensionality¶

Most points in a high-dimensional hypercube are very close to the border
High-dimensional datasets are at risk of being very sparse
The more dimensions the training set has, the greater the risk of overfitting it
Solution
- Increase the size of the training set （impossible, exponentially increasing)
- Dimension reduction

Pros¶

Speed up training
Data visualization
May filter out some noise and unnecessary details and result in higher performance

Cons¶

In general, cause some information loss, perform slightly worse

Major Approaches of Reducing Dimension¶

Projection
- Project data perpendiculary onto subspace
- PCA
Manifold
- Most real-world high-dimensional datasets lie close to a much lower-dimensional manifold
- LLE

1. PCA¶

Project data onto a hyperplane which preserves the maximum variance

import numpy as np

data = np.array([[45,37,42,35], [38,31,26,28], [10,15,17,21]]).T # 5*3 matrix
print(data)

[[45 38 10]
 [37 31 15]
 [42 26 17]
 [35 28 21]]

Normalize data¶

avg = np.average(data, axis=0)
#std = np.std(data, axis=0)
data = data - avg
#data = data/std

Calculate Covariance Matrix¶

Coursera, $\Sigma = \frac{1}{m}\Sigma_{i=1}^{m}(X^{i})(X^{i})^{T}$
Numpy, $\Sigma = \frac{1}{m-1}\Sigma_{i=1}^{m}(X^{i})(X^{i})^{T}$
$X^{i}$ is a n*1 vector
covariance matrix is a n*n matrix

covariance = data.T.dot(data)/(data.shape[0]-1)
print(covariance)

[[ 20.91666667  13.25       -16.75      ]
 [ 13.25        27.58333333 -20.75      ]
 [-16.75       -20.75        20.91666667]]

Calculate Eignvector¶

Use either svd or eig to calculate the eignvector of covariance matrix
svd is stable
- the direction of the unit vectors is not stable
- a pair of unit vector may rotate or swap if the variance along these two axes are close
The eignvector matrix U is Orthogonal matrix, $U^{-1} = U^{T}$

# u is the eigenvector
u, s, v = np.linalg.svd(covariance)
print(u, s)

[[-0.50086556  0.77075698  0.39378595]
 [-0.63320197 -0.63647951  0.44039652]
 [ 0.59007538 -0.02876659  0.80683551]] [57.40088536 10.60016819  1.41561311]

# v is the eigenvector
w, v = np.linalg.eig(covariance)
print(v, w)

[[-0.50086556 -0.77075698  0.39378595]
 [-0.63320197  0.63647951  0.44039652]
 [ 0.59007538  0.02876659  0.80683551]] [57.40088536 10.60016819  1.41561311]

Calculate Reduced Matrix¶

Choose the first k columns from U to create $U^{k}$, U is a n*k matrix
Typically, 95% of variance is retained
$Z = X * U^{k}$, Z is a m*k matrix

# Reduced features m*k
Z = data.dot(v[:, :2])
print(Z)

[[-10.61319192   0.40259444]
 [  0.77652327   2.25712663]
 [  2.61835608  -4.72152263]
 [  7.21831257   2.06180157]]

Calcualte Retained Variance¶

ratio = np.sum(w[:2])/np.sum(w)
print(ratio)

0.9796070140316309

Calculate Approximation Matrix¶

Change Dimension Back to m*n
$X*U^{k}*(U^{k})^{-1} = Z*(U^{k})^{-1}$
$X = Z*(U^{k})^{T}$, U is a orthogonal matrix, its transpose is equal to its inverse matrix

X_app = Z.dot(v[:, :2].T)
X_app = X_app + avg
print(X_app)

[[44.75547986 37.72653717  9.49899804]
 [37.62137014 31.69491878 16.27313709]
 [42.07770213 26.08689936 17.15920536]
 [34.54544787 27.49164469 20.0686595 ]]

PCA with Sklearn¶

data = np.array([[45,37,42,35], [38,31,26,28], [10,15,17,21]]).T # 5*3 matrix
print(data)

[[45 38 10]
 [37 31 15]
 [42 26 17]
 [35 28 21]]

from sklearn.decomposition import PCA
pca = PCA()

pca.fit(data)
Z = pca.transform(data)
print(Z)

[[10.61319192 -0.40259444 -0.62094684]
 [-0.77652327 -2.25712663  1.57793885]
 [-2.61835608  4.72152263  0.19732072]
 [-7.21831257 -2.06180157 -1.15431273]]

print(pca.components_) # principle components
print(pca.get_covariance()) # sigma
print(pca.explained_variance_) # eign values
print(pca.explained_variance_ratio_) # percentage of variance

[[ 0.50086556  0.63320197 -0.59007538]
 [ 0.77075698 -0.63647951 -0.02876659]
 [-0.39378595 -0.44039652 -0.80683551]]
[[ 20.91666667  13.25       -16.75      ]
 [ 13.25        27.58333333 -20.75      ]
 [-16.75       -20.75        20.91666667]]
[57.40088536 10.60016819  1.41561311]
[0.82690351 0.1527035  0.02039299]

X_app = pca.inverse_transform(Z)
print(X_app)

[[45. 38. 10.]
 [37. 31. 15.]
 [42. 26. 17.]
 [35. 28. 21.]]

Choosing the Right Number of Dimensions¶

# choose the number of PC with 95% variance
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

# choose the ratio of variance to preserve
pca = PCA(n_components=0.95)

# search elbow
import matplotlib.pyplot as plt
fig, ax = plt.subplots();
ax.plot(range(len(pca.explained_variance_ratio_)), cumsum);

Incremental PCA (IPCA)¶

PCA require the whole training set to fit in memory
IPCA split training set into mini-batches and feed an IPCA alrorithm one min-batch at a time

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
data_X, data_Y = mnist['data'], mnist['target']

from sklearn.decomposition import IncrementalPCA

n_batches = 100
pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(data_X, n_batches):
    pca.partial_fit(X_batch)
    
X_reduced = pca.transform(data_X)

2. Kernel PCA¶

Preserve clusters of instance after projection
Sometimes even unrolling datasets that lie close to a twisted manifold
convert data in original space to data in infinite feature space, then use linear PCA to reduce demension
The reduced data cannot be inverted to the original data, they can be inverted to pre-image in original space

from sklearn.datasets import make_swiss_roll
X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import proj3d

axes = [-11.5, 14, -2, 23, -12, 15]

fig = plt.figure(figsize=(6, 5))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=t, cmap=plt.cm.hot)
ax.view_init(10, -70)
ax.set_xlabel("$x_1$", fontsize=18)
ax.set_ylabel("$x_2$", fontsize=18)
ax.set_zlabel("$x_3$", fontsize=18)

Text(0.5, 0, '$x_3$')

from sklearn.decomposition import KernelPCA

pca = PCA(n_components=2) # default PCA, equals to linear PCA
X_default = pca.fit_transform(X)
rbf_pca = KernelPCA(n_components = 2, kernel='rbf', gamma=0.04) # rbf kernel
X_rbf = rbf_pca.fit_transform(X)
linear_pca = KernelPCA(n_components = 2, kernel= 'linear') # linear kernel
X_linear = linear_pca.fit_transform(X)
sig_pca = KernelPCA(n_components = 2, kernel='sigmoid', gamma=0.001) # sigmoid kernel
X_sig = sig_pca.fit_transform(X)

f, axs = plt.subplots(2, 2, figsize=(11, 8))
axs[0, 0].scatter(X_default[:, 0], X_default[:, 1], c=t, cmap=plt.cm.hot)
axs[0, 1].scatter(X_linear[:, 0], X_linear[:, 1], c=t, cmap=plt.cm.hot)
axs[1, 0].scatter(X_rbf[:, 0], X_rbf[:, 1], c=t, cmap=plt.cm.hot)
axs[1, 1].scatter(X_sig[:, 0], X_sig[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca956a9b90>

Selecting a Kernel and Tuning Hyperparameters¶

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('kpca', KernelPCA(n_components=2)),
    ('log_reg', LogisticRegression()),
])

param_grid = [{
    'kpca__gamma': np.linspace(0.03, 0.05, 10),
    'kpca__kernel': ['rbf', 'sigmoid'],
}]

grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)

# Select with the lowest reconstruction error
rbf_pca = KernelPCA(n_components = 2, kernel='rbf', gamma=0.0433, fit_inverse_transform=True)
X_rbf = rbf_pca.fit_transform(X);
X_preimage = rbf_pca.inverse_transform(X_rbf)

# Original Plot
fig = plt.figure(figsize=(18, 6))
ax = fig.add_subplot(1, 3, 1, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=t, cmap=plt.cm.hot)
ax.view_init(10, -70)
# Reduced Plot
ax = fig.add_subplot(1, 3, 2)
ax.scatter(X_rbf[:, 0], X_rbf[:, 1], c=t, cmap=plt.cm.hot)
# Pre-image Plot
ax = fig.add_subplot(1, 3, 3, projection='3d')
ax.scatter(X_preimage[:, 0], X_preimage[:, 1], X_preimage[:, 2], c=t, cmap=plt.cm.hot)

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7fca9f1ea210>

from sklearn.metrics import mean_squared_error
mean_squared_error(X, X_preimage)

32.78630879576614

3. Locally Linear Embedding (LLE)¶

Manifold Learning technique
Measure how each instace linearly relates to its closest neighbors, then look for a low-dimensional representation of the instance where the local relationships are best preserved
O(mlog(m)n logk) for finding the k nearest neighbors, O($mnk^{3}$) for weights, O($dm^2$) for low-dimensional representations
Scale poorly to very large datasets

from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=8)
X_reduced = lle.fit_transform(X)

fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca98e99390>

4. Modified LLE¶

When the number of neighbors is greater than the number of input dimensions
the matrix defining each local neighborhood is rank-deficient
use multiple weight vectors in each neighborhood to address the regularization problem

lle = LocallyLinearEmbedding(n_components=2, n_neighbors=8, method = 'modified')
X_reduced = lle.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca9b64e3d0>

5. Hessian Eigenmapping known as Hessian-based LLE: HLLE¶

another method of solving the regularization problem of LLE

lle = LocallyLinearEmbedding(n_components=2, n_neighbors=8, method = 'hessian')
X_reduced = lle.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca9971fa10>

6. Local Tangent Space Alignment (LTSA)¶

Rather than focusing on preserving neighborhood distances as in LLE, LTSA seeks to characterize the local geometry at each neighborhood via its tangent space

lle = LocallyLinearEmbedding(n_components=2, n_neighbors=8, method = 'ltsa')
X_reduced = lle.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca9b2f5650>

7. Random Projections¶

computationally efficient, it often introduces relatively high distortion
$X_{reduced} = \dfrac{1}{\sqrt{d}}XM$
X is a $m*n$ matrix, M is a $n*d$ random matrix
Gaussian Random Project
- $r_{ij}$ in M is a random Gaussian number
Sparse Random Project
- Random matrix is generated with different probabilities

from sklearn import random_projection
rp_gaussian = random_projection.GaussianRandomProjection(n_components=2)
X_reduced = rp_gaussian.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca98fad5d0>

rp_sparse = random_projection.SparseRandomProjection(n_components=2)
X_reduced = rp_sparse.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca994f36d0>

8. Multi-dimensional Scaling (MDS)¶

seeks a low-dimensional representation of the data in which the distances respect well the distances in the original high-dimensional space
Metric MDS
1. Calculate similarity/dissimilarity matrix D of the original data, $d_{ij}$ represents the Euclidean distance
2. Assign data points in n-dimensional space to arbitrary coordinates in d-dimensional space by random initial coordinates and steepest descent
3. Calculate similarity/dissimilarity matrix M of the reduced data, $m_{ij}$ represents the Euclidean distance
Non-Metric MDS
- Focuses on the ordination of the data
- If $m_{ij} < m_{jk}$, then $d_{ij} < d_{jk}$

# metric MDS
from sklearn.manifold import MDS
embedding = MDS(n_components=2)
X_reduced = embedding.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca9b628d90>

# nonmetric MDS
from sklearn.manifold import MDS
embedding = MDS(n_components=2, metric=False)
X_reduced = embedding.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca96221a10>

9. Isometric Mapping (Isomap)¶

A nonlinear method for dimensionality reduction
Steps
- Construct neighborhood graph
- Compute shortest paths between all pairs
- Construct d-dimensional coordinate vectors with MDS

from sklearn.manifold import Isomap
embedding = Isomap(n_components=2)
X_reduced = embedding.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca9fc76e90>

10. Spectral Embedding¶

finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian

from sklearn.manifold import SpectralEmbedding
embedding = SpectralEmbedding(n_components=2)
X_reduced = embedding.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca9ed38290>

11. t-distributed Stochastic Neighbor Embedding (t-SNE)¶

Each point in high-D has a conditional probability of picking each other point as its neighbor. The probability is a t-distribution calculated by the distance/similarity between two points.
Give each datapoint a location in the low- dimensional space. Also calculate the probabilities.
Minimizing the Kullback–Leibler divergence, $\sum_{i}\sum_{j}p_{j|i}log\dfrac{p_{j|i}}{q_{j|i}}$, $p_{j|i}$ is the probability that point i picks j as its neighbor in high-D, $q_{j|i}$ is the probability that point i picks j as its neighbor in low-D

Pros¶

* Revealing the structure at many scales on a single map
* Revealing data that lie in multiple, different, manifolds or clusters
* Reducing the tendency to crowd points together at the center
* might be beneficial to visually disentangle a dataset that comprises several manifolds at once

Cons¶

* Note that the KL divergence is not convex, i.e. multiple restarts with different initializations will end up in local minima of the KL divergence
* t-SNE is computationally expensive
* The Barnes-Hut t-SNE method is limited to two or three dimensional embeddings
* Global structure is not explicitly preserved. This problem is mitigated by initializing points with PCA.

Optimization¶

perplexity, $k=2^{S}$, the number of nearest neighbors, larger perplexities lead to more nearest neighbors and less sensitive to small structure
early exaggeration factor, controls how tight natural clusters in the original space are in the embedded space and how much space will be between them, for larger values, the space between natural clusters will be larger in the embedded space
learning rate
maximum number of iterations
angle, is a tradeoff between performance and accuracy, larger angles lead to better speed but less accurate results

from sklearn.manifold import TSNE
embedding = TSNE(n_components=2, init='pca')
X_reduced = embedding.fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)

<matplotlib.collections.PathCollection at 0x7fca9616b890>

Dimensionality Reduction¶

The Curse of Dimensionality¶

Pros¶

Cons¶

Major Approaches of Reducing Dimension¶

1. PCA¶

Normalize data¶

Calculate Covariance Matrix¶

Calculate Eignvector¶

Calculate Reduced Matrix¶

Calcualte Retained Variance¶

Calculate Approximation Matrix¶

PCA with Sklearn¶

Choosing the Right Number of Dimensions¶

Incremental PCA (IPCA)¶

2. Kernel PCA¶

Selecting a Kernel and Tuning Hyperparameters¶

3. Locally Linear Embedding (LLE)¶

4. Modified LLE¶

5. Hessian Eigenmapping known as Hessian-based LLE: HLLE¶

6. Local Tangent Space Alignment (LTSA)¶

7. Random Projections¶

8. Multi-dimensional Scaling (MDS)¶

9. Isometric Mapping (Isomap)¶

10. Spectral Embedding¶

11. t-distributed Stochastic Neighbor Embedding (t-SNE)¶

Pros¶

Cons¶

Optimization¶

Reference¶