Demystifying Dimensionality Reduction-PCA on MNIST Dataset

Mahavir Dabas
Dec 25, 2020
7 min read

Why Dimensionality Reduction?

Dealing with data comes as naturally to a Machine Learning engineer as breathing. One of the most notable aspects of data analysis is data visualization which is nothing but the process of translating information into visual contexts like maps, graphs, etc. Visualizing the given information gives us a feel of the data before we can go ahead and apply complicated machine learning algorithms.

The input variables which are used as indicators for prediction are also known as the features of the dataset. The ever famous "Iris Dataset" has 4 features- Sepal Length, Sepal Width, Petal Length, Petal Width. It is important to note that in Machine Learning, a dataset with n-features and a dataset of n- dimensions mean the same thing. Thus we can use the words features and dimensions interchangeably.

Now coming back to visualization- we know that a 2-dimensional point can be visualized on a plane surface. Similarly, a 3-dimensional point can be represented by taking three mutually perpendicular axes. Now try to visualize a point having 4 dimensions like the one present in the Iris Dataset-if you can, then I'd suggest you visit your family doctor ASAP. The thing is, the human brain is capable of achieving many wonders but it can not imagine a space beyond 3-dimensions.

So the natural question arises- how do we then visualize data with 100's and possibly even 1000's of features/dimensions? We need a way of embedding our data from higher to lower dimensions without losing its essence, and this is where "Dimensionality Reduction"- the hero of this article comes into the picture.

The MNIST Dataset of 784 dimensions

The MNIST Dataset is the de facto “hello world” dataset of computer vision which was released in 1999. It is a database of handwritten digits which can be easily found here on Kaggle as the Digit Recognizer problem. The handwritten digits are stored as black and white images of 28x28 pixels. Now each of these images is a datapoint which is represented by a 28x28 matrix.

Since datapoints can only be a row/column vector, row flattening is performed on each of these matrices to obtain row vectors of 784 dimensions. Thus each of those 784 pixels present in the original 28x28 matrix act as a feature for the input variable.

Loading The Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d0=pd.read_csv("mnist_train.csv")
label=d0["label"]
d0=d0.drop("label",axis=1)
d0
label

The d0 data frame has 784 columns representing the 784 pixel features and 42000 rows each acting as a datapoint. The label data frame which has been separated from the original data frame stores the output variables for each row present in d0. Hence the label stores the "digits" represented by each of the row flattened pixel matrix.

I have already marked that the pixel matrix at index 3 represents the handwritten digit 4, so let's use this fact to check whether we have loaded the data correctly or not.

index = 3
number_at_index_3 = d0.iloc[index].to_numpy().reshape(28,28)  # reshape from 1d to 2d pixel array
plt.imshow(number_at_index_3, interpolation = "none", cmap = "gray")
plt.show()

Some Prerequisites Required Before Moving Ahead

Data Preprocessing Techniques

What follows after extracting information is data preprocessing. It helps us store data in a format that would be highly useful to us. There are two main methods of data preprocessing-

Column Normalisation- Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

Where X'=tranformed feature, Xmin=minimum value of the feature, Xmax=maximum value of the feature

The advantage of using column normalisation is that it converts features belonging to different scales (cm, km, kg, Celcius, etc) to the same scale which ranges between 0 and 1. This is of huge help when algorithms based on euclidean distances (KNN, K-means, SVM, etc) are used as it prevents the algorithm from being biased by the high magnitude values of one feature, that is, it makes all features contribute equally to the result.

2. Column Standardisation- Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Where X'=tranformed feature, μ=mean of the feature values, σ=standard deviation of the feature values

Column Standardisation is mostly used in tree-based algorithms where nodes are split, making them invariant to the scale of the features.

Covariance Matrix

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

For a column-standardised dataset X with d features and n data points, the covariance matrix S can be calculated by the following formula-

Some important points regarding the covariance matrix-

S is square-symmetric with size dxd where d is the number of features in the dataset
The element, Sij=covariance(feature i,feature j)

Principal Component Analysis

Geometric Intuition Behind PCA- VERY IMPORTANT!

Case-1

For the purpose of understanding, consider a situation where you have been provided with a dataset of 2 dimensions and the task is to reduce it to a unidimensional dataset.

Consider feature 1 to be the degree of blackness of hair of people from India and feature 2 to be their corresponding heights.

It is quite easy to say that the variability in feature 1 would be very less when compared to the variability in feature 2 as most of the people living in India have black hair. The graph corresponding to this information would look something like this-

variability/spread in feature 1- less

variability/spread in feature 2-more

Now if you are forced to drop one feature, which one would you drop?

You would want to preserve that feature/dimension which has the most spread/variance/variability as it contains the most information leading to less information being lost when you drop the other feature.

Preserve Feature 2!

We have reduced the dimensionality from 1 to 2.

Case-2

Now consider another dataset having 2 features and the following plot-

For this particular case, it is quite difficult to assess which feature has more variability.

So what to do if both the features have almost the same spread? In such cases, we create a new dimension with the maximum spread and project the data points onto this new dimension and select it as the final dimension.

Take projections on Feature x and preserve it!

We have again successfully reduced the dimensionality from 2 to 1.

Workflow-5 steps towards PCA

Principal Component Analysis performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the covariance matrix of the data is constructed and the eigenvectors on this matrix are computed.

Since we already know that the covariance matrix is a square-symmetric matrix of size dxd (where d=number of features), it will have d eigenvalues.

The eigenvector corresponding to the largest eigenvalue- Unit vector in direction of maximum variance

The eigenvector corresponding to the 2nd largest eigenvalue- Unit vector in direction of 2nd maximum variance

The eigenvector corresponding to the 3rd largest eigenvalue- Unit vector in direction of 3rd maximum variance

And this trend follows for all the eigenvalues.

So we have the following steps for dimensionality reduction using PCA-

Column standardise the data matrix X
Find the covariance matrix of X and call it by S
Find the eigenvalues and the eigenvectors of S-The unit vector in the direction of max spread/variance is the eigenvector v1 corresponding to the largest eigenvalue
Find the projections of each of the datapoints on v1
Take v1 as the final feature, reducing the dimensionality to 1

Let's Code!

Step 1-To column standardise the dataset, we use the standard scaler function provided by the sklearn library.

Step 2-We have already seen the formula to find out the covariance matrix for a column-standardised dataset, so let's apply it on our standardised_data.

Step 3- For finding the top eigen vectors of covariance matrix scipy function eigh has been used.

The function will return the eigen values in the ascending order. For 2-d visualisation the top 2 eigen values have been chosen (782 and 783-indexing starts from 0) . Columns 1 and 2 for vectors are the two eigenvectors corresponding to the 2 largest eigen values.

Step 4- Project the data points on to the new dimensions and create the final data matrix.

#for simplification-vectors has been updated as it's transpose making it's new shape to be of- 2x784
vectors=vectors.T
new_coordinates=np.matmul(vectors,standardised_data.T)
print ("new data matrix' shape is {} X {} = {}".format(vectors.shape,standardised_data.T.shape,new_coordinates.shape))
#creating the final dataframe with 2 dimensions
new_data_matrix=new_coordinates.T
final_data=pd.DataFrame(data=new_data_matrix,columns=("1st_principal", "2nd_principal"))
#appending the handwritten number labels to the dataframe
final_data["labels"]=label
final_data

Step 5- Now plot the result based on the final data. Seaborn has been used to plot the scatter plot.

# ploting the 2d data points with seaborn
import seaborn as sn
sn.FacetGrid(final_data, hue="labels",size=8).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()
plt.show()

We have successfully reduced the dimensionality of a 784-dimensional dataset used the results for 2-d visualisation of the data!

It is interesting to observe that there is a huge overlap between the colours of cyan and grey representing the digits 4 and 9 as both of them are visually similar to each other in a handwritten context.

Why fear when scikit learn is there?

All this hassle can be avoided by using just one function provided by the most beloved library to get practically the same result!

# initializing the pca
from sklearn import decomposition
pca = decomposition.PCA()
# configuring the parameteres
# the number of components = 2
pca.n_components = 2
pca_data = pca.fit_transform(d0)
pca_data.shape
# creating a new data fram which help us in ploting the result data
pca_df = pd.DataFrame(data=pca_data, columns=("1st_principal", "2nd_principal"))
#adding the labels
pca_df["labels"]=label
sn.FacetGrid(pca_df, hue="labels", size=8).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()
plt.show()

Information vs Number of Dimensions

In real world applications, there is not only a constraint on the number of dimensions but also on the percentage of information that has to be preserved after dimensionality reduction.

For example- If you are given the task to reduce the dimensionality but are also told to preserve no less than 90 percent of the information, what is the method to decide how many final dimensions to take?

The percentage of original information retained depends on the eigenvalues of the final eigenvectors considered. The formula for information retention is as follows-

Using this formula, a curve of Information Retained vs No of dimensions taken is observed.

pca.n_components = 784
pca_data = pca.fit_transform(sample_data)
percentage_var_explained = pca.explained_variance_ / np.sum(pca.explained_variance_);
cum_var_explained = np.cumsum(percentage_var_explained)
# Plot the PCA spectrum
plt.figure(1, figsize=(8, 6))
plt.clf()
plt.plot(cum_var_explained, linewidth=2)
plt.axis('tight')
plt.grid()
plt.xlabel('No of features taken')
plt.ylabel('Information Retained')
plt.show()

It can be seen from this plot that when 200 dimensions are taken, almost 90% of the original information is retained with a 75% reduction in the dimensionality.

Conclusion

We have successfully reduced the dimensionality of the 784 dimensional MNIST dataset for 2d visualisation through PCA.