Principal Component Analysis

Principal component analysis, or PCA, is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.The idea of PCA is to reduce the number of variables of a data set, while preserving as much information as possible.

In this cookbook we will be implementing PCA to reduce the number of variables in our data set by preserving 95% of the data’s variance

# load the required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.preprocessing import scale
from sklearn import preprocessing 
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

The first step is to read the data. Please refer to the Random Forest Regression Cookbook for a detailed explanation about the feature engineering. Here we will be reading the dataset with the new variables obtained after feature engineering.

dust_df = pd.read_csv('../saharan_dust_met_cat_vars.csv', index_col='time')

dust_df

	PM10	T2	rh2	slp	PBLH	wind_speed_10m	wind_speed_925hPa	WIND_DIR	RAIN
time
1960-01-01	2000.1490	288.24875	32.923786	1018.89420	484.91812	6.801503	13.483623	NE	0
1960-01-02	4686.5370	288.88450	30.528862	1017.26575	601.58310	8.316340	18.027075	NE	0
1960-01-03	5847.7515	290.97128	26.504536	1015.83514	582.38540	9.148216	17.995173	NE	0
1960-01-04	5252.0586	292.20060	30.678936	1013.92230	555.11860	8.751743	15.806478	NE	0
1960-01-05	3379.3190	293.06076	27.790462	1011.94934	394.95440	6.393228	9.160809	NE	0
...	...	...	...	...	...	...	...	...	...
2010-12-27	2681.4685	292.38474	18.858383	1011.69574	315.81320	4.749993	7.846004	NE	4
2010-12-28	1345.8488	291.46680	26.357006	1010.66340	232.03355	3.051484	3.346668	NE	4
2010-12-29	4500.9810	289.62990	23.169529	1014.53740	557.29913	6.249619	13.007574	NE	4
2010-12-30	5150.3840	290.11844	43.158295	1017.36230	745.95575	8.769048	19.371056	NE	4
2010-12-31	4753.6760	291.39554	38.372738	1016.40400	697.51697	8.137861	17.666280	NE	4

18466 rows × 9 columns

One Hot Encoding

Let’s look at the unique values of the WIND_DIR feature

dust_df['WIND_DIR'].unique()

array(['NE', 'NW', 'SE', 'SW'], dtype=object)

Since WIND_DIR is a categorical variable, we have to perform One Hot Encoding(OHE). One Hot Encoding creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature. One Hot Encoding is the process of creating dummy variables.

one_hot = pd.get_dummies(dust_df['WIND_DIR'], dtype=int)

one_hot

	NE	NW	SE	SW
time
1960-01-01	1	0	0	0
1960-01-02	1	0	0	0
1960-01-03	1	0	0	0
1960-01-04	1	0	0	0
1960-01-05	1	0	0	0
...	...	...	...	...
2010-12-27	1	0	0	0
2010-12-28	1	0	0	0
2010-12-29	1	0	0	0
2010-12-30	1	0	0	0
2010-12-31	1	0	0	0

18466 rows × 4 columns

df_new = dust_df.drop('WIND_DIR', axis=1)

Let’s now join the One Hot Encoded Fatures into the original dataframe

df_new = df_new.join(one_hot)

df_new

	PM10	T2	rh2	slp	PBLH	wind_speed_10m	wind_speed_925hPa	RAIN	NE	NW	SE	SW
time
1960-01-01	2000.1490	288.24875	32.923786	1018.89420	484.91812	6.801503	13.483623	0	1	0	0	0
1960-01-02	4686.5370	288.88450	30.528862	1017.26575	601.58310	8.316340	18.027075	0	1	0	0	0
1960-01-03	5847.7515	290.97128	26.504536	1015.83514	582.38540	9.148216	17.995173	0	1	0	0	0
1960-01-04	5252.0586	292.20060	30.678936	1013.92230	555.11860	8.751743	15.806478	0	1	0	0	0
1960-01-05	3379.3190	293.06076	27.790462	1011.94934	394.95440	6.393228	9.160809	0	1	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...
2010-12-27	2681.4685	292.38474	18.858383	1011.69574	315.81320	4.749993	7.846004	4	1	0	0	0
2010-12-28	1345.8488	291.46680	26.357006	1010.66340	232.03355	3.051484	3.346668	4	1	0	0	0
2010-12-29	4500.9810	289.62990	23.169529	1014.53740	557.29913	6.249619	13.007574	4	1	0	0	0
2010-12-30	5150.3840	290.11844	43.158295	1017.36230	745.95575	8.769048	19.371056	4	1	0	0	0
2010-12-31	4753.6760	291.39554	38.372738	1016.40400	697.51697	8.137861	17.666280	4	1	0	0	0

18466 rows × 12 columns

X_features = df_new.drop(labels='PM10', axis = 1)

X_features

	T2	rh2	slp	PBLH	wind_speed_10m	wind_speed_925hPa	RAIN	NE	NW	SE	SW
time
1960-01-01	288.24875	32.923786	1018.89420	484.91812	6.801503	13.483623	0	1	0	0	0
1960-01-02	288.88450	30.528862	1017.26575	601.58310	8.316340	18.027075	0	1	0	0	0
1960-01-03	290.97128	26.504536	1015.83514	582.38540	9.148216	17.995173	0	1	0	0	0
1960-01-04	292.20060	30.678936	1013.92230	555.11860	8.751743	15.806478	0	1	0	0	0
1960-01-05	293.06076	27.790462	1011.94934	394.95440	6.393228	9.160809	0	1	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...
2010-12-27	292.38474	18.858383	1011.69574	315.81320	4.749993	7.846004	4	1	0	0	0
2010-12-28	291.46680	26.357006	1010.66340	232.03355	3.051484	3.346668	4	1	0	0	0
2010-12-29	289.62990	23.169529	1014.53740	557.29913	6.249619	13.007574	4	1	0	0	0
2010-12-30	290.11844	43.158295	1017.36230	745.95575	8.769048	19.371056	4	1	0	0	0
2010-12-31	291.39554	38.372738	1016.40400	697.51697	8.137861	17.666280	4	1	0	0	0

18466 rows × 11 columns

We will now separate out the features and the target variable

X = df_new.drop(labels='PM10', axis=1).values
y = df_new['PM10'].values

X.shape

(18466, 11)

y.shape

(18466,)

Here we will be splitting the data into training and testing data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Feature Scaling

Feature scaling is a vital pre-processing step in machine learning that involves transforming numerical features to a common scale. Scaling techniques aim to normalize the range, distribution, and magnitude of features, reducing potential biases and inconsistencies that may arise from variations in their values.

The next step is to scale the data. We will be using RobustScaler for scaling. RobustScaler is an algorithm that scales features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile) but can be configured.

When applying feature scaling, it’s crucial to fit the scaler on the training data and then use the same scaler to transform the test data. This ensures consistency and prevents data leakage.

scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

We will now use Scikit-Learn’s PCA estimator to find out the number of principal components we need. Since we are not sure how many principal components we need for this data, we will use Minka’s MLE to guess the dimension. Use of n_components == ‘mle’ will interpret svd_solver == ‘auto’ as svd_solver == ‘full’

pca = PCA(n_components = 'mle', svd_solver = 'full')

X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

#Print dataset shape before and after PCA
print(f'\nTraining Data shape: {X_train.shape}')
print(f'PCA Training Data shape: {X_train_pca.shape}')

Training Data shape: (14772, 11)
PCA Training Data shape: (14772, 10)

From the Scikit-learn implementation, we can get the information about the explained variance and plot the cumulative variance. The importance of each component is represented by the explained variance ratio, which indicates the portion of the variance that lies along each principal component.

explained_variance = pca.explained_variance_ratio_

explained_variance

array([0.37721146, 0.25138801, 0.14908775, 0.12167106, 0.04221808,
       0.01815109, 0.01510624, 0.01231552, 0.00793525, 0.00491554])

By summing the explained variance ratio of the first \(N{pc}\) components, we obtain the so-called cumulative explained variance

cumulative_var_ratio = np.cumsum(explained_variance)

cumulative_var_ratio

array([0.37721146, 0.62859947, 0.77768722, 0.89935828, 0.94157636,
       0.95972744, 0.97483369, 0.98714921, 0.99508446, 1.        ])

The minimum number of principal components required to preserve the 95% of the data’s variance can be computed with the following command

d = np.argmax(cumulative_var_ratio >= 0.95) + 1
d

np.int64(6)

Let’s plot the cumulative variance for the number of principal components

plt.rcParams["figure.figsize"] = (10,6)
fig, ax = plt.subplots()
xi = np.arange(0,10, step=1)
y = np.cumsum(pca.explained_variance_ratio_)
plt.xlim([0,10])
plt.ylim([0.0,1.1])
plt.plot(xi, y, marker='o', linestyle='--', color='b')


plt.xlabel('Number of Components')
plt.xticks(np.arange(0,10, step=1)) 
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=10)

ax.grid(axis='x')
plt.show()

../_images/03472843799dfc55bd88e9d7999627764fbb2c37cc320ca3c9a4714ba1c2b780.png

This plot illustrates how the cumulative variance explained by principal components increases as more components are added. It helps determine the number of components needed to capture the majority of the variance in the dataset. Hence we can conclude from the plot and from calculation that in order to preserve 95% of the data’s variance, we need 6 principal components.

Now let’s run the PCA estimator with n_components set to 6

pca = PCA(n_components = 6, svd_solver = 'full')

X_train_pca_6 = pca.fit_transform(X_train)
X_test_pca_6 = pca.transform(X_test)

print(pca.components_)

[[-3.61778391e-01 -3.38369866e-01  4.51569378e-01 -1.77126326e-01
   4.96939306e-01  4.18413802e-01 -1.24063367e-01  2.35693347e-01
  -5.11995608e-02 -3.42324113e-02 -1.50261375e-01]
 [-1.62573966e-01  6.98539654e-01  3.53914136e-01  4.49498676e-01
   2.20612234e-01  1.69030327e-01  1.78128216e-01 -1.60511838e-01
  -7.38533169e-03  1.53053509e-02  1.52591819e-01]
 [ 3.29626953e-01 -2.98494821e-01 -2.85023040e-01  6.97675877e-01
   2.57328897e-01  2.60838223e-01 -3.15504826e-01 -1.39799500e-02
  -1.47564894e-03 -7.00179657e-04  1.61557786e-02]
 [ 1.65111754e-01 -2.63244559e-01 -7.53441771e-02  1.07850312e-01
   1.06558189e-01  1.34670039e-01  9.20119383e-01  8.20988846e-02
  -1.97557125e-02 -9.40257531e-03 -5.29405967e-02]
 [-1.43147624e-01 -3.46012782e-01  5.56860207e-01  4.37957853e-01
  -4.84413002e-01 -2.75072184e-01  1.18798913e-02 -8.92497782e-03
   1.35216504e-01  3.95280255e-02 -1.65819552e-01]
 [ 9.90391435e-02  2.46357985e-01 -2.31771115e-02  1.33495943e-01
  -1.09713471e-01 -1.00560474e-01 -3.16829952e-02  7.63185470e-01
  -4.79819794e-01 -9.09281230e-03 -2.74272863e-01]]

#Print dataset shape before and after PCA
print(f'\nTraining Data shape: {X_train.shape}')
print(f'PCA Training Data shape: {X_train_pca_6.shape}')

Training Data shape: (14772, 11)
PCA Training Data shape: (14772, 6)

# number of components
n_pcs= pca.components_.shape[0]
n_pcs

We now know that out of 11 features, 6 features were selected by PCA as relevant features but we do not know what those 6 features are. We can do that by first finding the index of the feature with the largest absolute value in each principal component’s loading vector using list comprehension. Then we can map these indices to their corresponding feature names from the original dataset. The next step is to create a dictionary that associates each principal component with its most important feature and convert the dictionary to a dataframe

# get the index of the most important feature on EACH component(largest absolute value)

most_important = [np.abs(pca.components_[i]).argmax() for i in range(n_pcs)]

#get the corresponding feature names from the original dataset
initial_feature_names = list(X_features.columns.values)
initial_feature_names

['T2',
 'rh2',
 'slp',
 'PBLH',
 'wind_speed_10m',
 'wind_speed_925hPa',
 'RAIN',
 'NE',
 'NW',
 'SE',
 'SW']

# map the indices to their corresponding feature names from the original dataset
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# create a dictionary that associates each principal component with its most important feature 
dic = {'PC{}'.format(i+1): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(sorted(dic.items()))

df

	0	1
0	PC1	wind_speed_10m
1	PC2	rh2
2	PC3	PBLH
3	PC4	RAIN
4	PC5	slp
5	PC6	NE

Now we know the 6 principal components that were selected by the PCA estimator. They are

wind_speed_10m
rh2
PBLH
RAIN
slp
NE

Let’s visualize our results

The PCA-transformed data is clustered using KMeans with 6 clusters.

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into K clusters. It works by iteratively assigning each data point to the nearest centroid and then updating the centroids based on the mean of the data points assigned to each cluster.

# Cluster the PCA-transformed data
kmeans = KMeans(n_clusters=6, random_state=42)
cluster_labels = kmeans.fit_predict(X_train_pca_6)

# Create a DataFrame combining PCA components and cluster labels
pca_cluster_df = pd.DataFrame(X_train_pca_6, columns=[f'PC{i+1}' for i in range(6)])
pca_cluster_df['Cluster'] = cluster_labels

# Get the feature names associated with each cluster
n_pcs = pca.components_.shape[0]
most_important = [np.abs(pca.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = list(X_features.columns.values)
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# Plot the scatter plot
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', data=pca_cluster_df, palette='tab10')
plt.title('PCA Components in Clusters')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

# Create custom legend
legend_labels = {i: most_important_names[i] for i in range(len(most_important_names))}
plt.legend(title='Feature', labels=legend_labels.values())

plt.show()

../_images/4606715f44b17ae438b50fae3b5874b1ff6be3304615da6c6ff5d922a7429503.png

Here the PCA components and cluster labels are combined into a DataFrame. A scatter plot is created where each point represents a data sample, and the points are colored based on their cluster assignment.

In the upcoming notebook, we’ll delve into Self-Organizing Maps (SOM) and explore their application on this dataset. We will discover how SOMs are leveraged for clustering and visualization, offering insights into the underlying structures and patterns within the data.