PCA - Principal Component Analysis
PCA is a popular method for dimensionality reduction. It projects data from higher dimension on to hyperplane (or in this case simple 2d plane) that is closest to the data, like 3d to 2d, 2d to 1d, but dimensions can be even higher.
This example shows projecting data from 3d space onto 2d plane.
# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)
# To plot pretty figures
%matplotlib notebook
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
Create test 3D dataset:
m=60
w1, w2 = 0.1, 0.3
noise = 0.1
angles = np.random.rand(m)*3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)
Display test data
fig = plt.figure(figsize=(8,6))
ax = fig.gca(projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], label='Test data')
ax.legend()
fig.show()
<IPython.core.display.Javascript object>
Using Singular Value Decomposition
Print first 20 rows of the data:
print(X[0:20,:])
[[ 0.7039474 0.65945649 0.3645693 ]
[-1.0763742 -0.50036633 -0.21646923]
[-0.90914129 0.22798983 0.05968884]
[-0.26327638 0.52076062 0.31957985]
[ 1.10617261 0.17601773 0.13888377]
[ 1.0008109 0.15934054 0.07250963]
[ 0.87852969 -0.25299137 -0.07699588]
[-1.13697989 -0.29958862 -0.2851556 ]
[-0.36237145 0.50948368 0.10889779]
[-0.7732104 0.3332867 0.05678017]
[ 0.77535051 -0.28414251 0.01996138]
[-0.95258223 -0.54563776 -0.17623123]
[-1.14143853 -0.19599228 -0.17164135]
[ 1.10180912 0.27754754 0.33879858]
[ 1.12822483 0.30165509 0.17685333]
[ 1.16129536 0.28698597 0.47424224]
[ 0.97281446 0.60223335 0.34051818]
[ 0.05953046 0.59871039 0.09985041]
[ 0.47966775 0.76970355 0.17178859]
[ 0.9661238 0.46601271 0.28466344]]
Calculate mean for each column:
X.mean(axis=0)
array([0.15681044, 0.17698876, 0.0862971 ])
Make columns 0 mean (substract mean from the matrix):
X_centered = X - X.mean(axis=0)
X_centered[0:20,:]
array([[ 5.47136958e-01, 4.82467738e-01, 2.78272202e-01],
[-1.23318464e+00, -6.77355084e-01, -3.02766325e-01],
[-1.06595173e+00, 5.10010710e-02, -2.66082639e-02],
[-4.20086817e-01, 3.43771864e-01, 2.33282748e-01],
[ 9.49362168e-01, -9.71028359e-04, 5.25866684e-02],
[ 8.44000462e-01, -1.76482181e-02, -1.37874640e-02],
[ 7.21719251e-01, -4.29980124e-01, -1.63292983e-01],
[-1.29379033e+00, -4.76577375e-01, -3.71452702e-01],
[-5.19181893e-01, 3.32494922e-01, 2.26006880e-02],
[-9.30020839e-01, 1.56297940e-01, -2.95169326e-02],
[ 6.18540077e-01, -4.61131262e-01, -6.63357193e-02],
[-1.10939267e+00, -7.22626519e-01, -2.62528326e-01],
[-1.29824897e+00, -3.72981033e-01, -2.57938446e-01],
[ 9.44998681e-01, 1.00558780e-01, 2.52501481e-01],
[ 9.71414397e-01, 1.24666339e-01, 9.05562297e-02],
[ 1.00448492e+00, 1.09997216e-01, 3.87945146e-01],
[ 8.16004024e-01, 4.25244590e-01, 2.54221086e-01],
[-9.72799781e-02, 4.21721631e-01, 1.35533075e-02],
[ 3.22857312e-01, 5.92714789e-01, 8.54914899e-02],
[ 8.09313362e-01, 2.89023958e-01, 1.98366337e-01]])
Use np.linalg.svd() to compute singular value decomposition components: (U, s, Vt)
U, s, Vt = np.linalg.svd(X_centered)
These 3 matrices can fully reconstruct our mean-centered 3d data by applying multiplication U * s * Vt. Vt holds all the principal components we are looking for.
Vt
array([[-0.95250178, -0.24902446, -0.17529172],
[ 0.29267159, -0.9076305 , -0.30091563],
[ 0.08416476, 0.33792558, -0.93740205]])
We are projecting from 3d to 2d so we need to take first 2 components from Vt matrix.
Extract first 2 principal components from Vt:
c1 = Vt.T[:,0]
c2 = Vt.T[:,1]
c1
array([-0.95250178, -0.24902446, -0.17529172])
c2
array([ 0.29267159, -0.9076305 , -0.30091563])
They are needed to create transformation matrix that will map points from 3d space to 2d space
np.column_stack((c1, c2))
array([[-0.95250178, 0.29267159],
[-0.24902446, -0.9076305 ],
[-0.17529172, -0.30091563]])
W2 = np.stack((c1,c2)).T
X2D = X_centered.dot(W2)
Plot the projection in 2D
fig, ax = plt.subplots(figsize=(6,5))
ax.scatter(X2D[:,0], X2D[:,1], label='2d projection')
ax.legend()
ax.plot()
<IPython.core.display.Javascript object>
[]
PCA using Scikit-Learn
Same can be achieved with less amount of code using Scikit-Learn PCA:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2D_scikit = pca.fit_transform(X)
fig, ax = plt.subplots(figsize=(6,5))
ax.scatter(X2D_scikit[:,0], X2D_scikit[:,1], label='2d projection')
ax.legend()
ax.plot()
<IPython.core.display.Javascript object>
[]
Comparison between SVD and ScikitLearn
X2D[:5]
array([[-0.690074 , -0.36150744],
[ 1.39636097, 0.34497714],
[ 1.00728461, -0.35025708],
[ 0.2736333 , -0.50516373],
[-0.91324535, 0.26290852]])
X2D_scikit[:5]
array([[-0.690074 , -0.36150744],
[ 1.39636097, 0.34497714],
[ 1.00728461, -0.35025708],
[ 0.2736333 , -0.50516373],
[-0.91324535, 0.26290852]])
- Scikit-Learn PCA does matrix mean centering automatically
- Sometimes one of the methods can result in flipped axes (multiplication by -1) but it’s not so relevant for further learning on dataset with lower dimensions
Inverse transform from 2d to 3d using Scikit-Learn
X3D_inv = pca.inverse_transform(X2D_scikit)
fig = plt.figure(figsize=(8,6))
ax = fig.gca(projection='3d')
ax.scatter(X3D_inv[:, 0], X3D_inv[:, 1], X3D_inv[:, 2], label='Inversion using Scikit-Learn')
ax.legend()
fig.show()
<IPython.core.display.Javascript object>
It’s visible how both 3d charts are similar to each other, but not exactly the same. Even if recovered chart has 3 dimensions again and overall shape of the curve is similar, the points seem to be more flattened than on original chart.
There is some reconstruction error which can be computed:
np.mean(np.sum(np.square(X3D_inv - X), axis=1))
0.009421417196957218
Inversion using numpy’s SVD
X3D_svd = X2D.dot(Vt[:2, :])
fig = plt.figure(figsize=(8,6))
ax = fig.gca(projection='3d')
ax.scatter(X3D_svd[:, 0], X3D_svd[:, 1], X3D_svd[:, 2], label='Inversion with SVD')
ax.legend()
fig.show()
<IPython.core.display.Javascript object>
The reconstruction error after reversing manually is:
np.mean(np.sum(np.square(X3D_svd - X), axis=1))
0.07278313972817166
SVD error after reconstruction is higher than Scikit-Learn It can be even observed on the 3d charts. Even if they look the same from the first glance, the axes are positioned slightly different. That’s because Scikit-Learn automatically takes care of reversing the mean centering.
To achieve the same results as with Scikit-Learn’s PCA we need to add mean to 3d reconstructed data manually:
X3D_svd += X.mean(axis=0)
np.mean(np.sum(np.square(X3D_svd - X), axis=1))
0.009421417196957218
This experiment was made after reading Hands-On Machine Learning with Scikit-Learn and TensorFlow. If you still don’t have it, I highly recommend buying it.
Source code is available to download at my github repository: