{ "cells": [ { "cell_type": "markdown", "id": "fab43fd0-8697-4332-a5d6-a08ea87b5409", "metadata": {}, "source": [ "# Principal Component Analysis" ] }, { "cell_type": "markdown", "id": "dbd558c7-e808-49b8-9c0b-1c68c88776bd", "metadata": {}, "source": [ "Principal component analysis, or PCA, is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.The idea of PCA is to reduce the number of variables of a data set, while preserving as much information as possible.\n", "\n", "In this cookbook we will be implementing PCA to reduce the number of variables in our data set by preserving 95% of the data’s variance \n" ] }, { "cell_type": "code", "execution_count": 44, "id": "deb4403b-0ff4-4657-b629-22181ae5fc5d", "metadata": {}, "outputs": [], "source": [ "# load the required packages\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import sklearn\n", "from sklearn.preprocessing import scale\n", "from sklearn import preprocessing \n", "from sklearn.decomposition import PCA\n", "from sklearn.cluster import KMeans\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import RobustScaler\n", "from sklearn.preprocessing import LabelEncoder\n", "import warnings\n", "warnings.filterwarnings('ignore')\n" ] }, { "cell_type": "markdown", "id": "e68f9250-7e8a-4bf5-bbd4-db0ea12104db", "metadata": {}, "source": [ "The first step is to read the data.\n", "Please refer to the Random Forest Regression Cookbook for a detailed explanation about the feature engineering. Here we will be reading the dataset with the new variables obtained after feature engineering." ] }, { "cell_type": "code", "execution_count": 2, "id": "247fd82f-d239-4f03-b316-dacc58bdbb41", "metadata": {}, "outputs": [], "source": [ "\n", "dust_df = pd.read_csv('../saharan_dust_met_cat_vars.csv', index_col='time')\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "bda74152-fd6a-4915-8ea2-14537cc8baeb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | PM10 | \n", "T2 | \n", "rh2 | \n", "slp | \n", "PBLH | \n", "wind_speed_10m | \n", "wind_speed_925hPa | \n", "WIND_DIR | \n", "RAIN | \n", "
---|---|---|---|---|---|---|---|---|---|
time | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
1960-01-01 | \n", "2000.1490 | \n", "288.24875 | \n", "32.923786 | \n", "1018.89420 | \n", "484.91812 | \n", "6.801503 | \n", "13.483623 | \n", "NE | \n", "0 | \n", "
1960-01-02 | \n", "4686.5370 | \n", "288.88450 | \n", "30.528862 | \n", "1017.26575 | \n", "601.58310 | \n", "8.316340 | \n", "18.027075 | \n", "NE | \n", "0 | \n", "
1960-01-03 | \n", "5847.7515 | \n", "290.97128 | \n", "26.504536 | \n", "1015.83514 | \n", "582.38540 | \n", "9.148216 | \n", "17.995173 | \n", "NE | \n", "0 | \n", "
1960-01-04 | \n", "5252.0586 | \n", "292.20060 | \n", "30.678936 | \n", "1013.92230 | \n", "555.11860 | \n", "8.751743 | \n", "15.806478 | \n", "NE | \n", "0 | \n", "
1960-01-05 | \n", "3379.3190 | \n", "293.06076 | \n", "27.790462 | \n", "1011.94934 | \n", "394.95440 | \n", "6.393228 | \n", "9.160809 | \n", "NE | \n", "0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2010-12-27 | \n", "2681.4685 | \n", "292.38474 | \n", "18.858383 | \n", "1011.69574 | \n", "315.81320 | \n", "4.749993 | \n", "7.846004 | \n", "NE | \n", "4 | \n", "
2010-12-28 | \n", "1345.8488 | \n", "291.46680 | \n", "26.357006 | \n", "1010.66340 | \n", "232.03355 | \n", "3.051484 | \n", "3.346668 | \n", "NE | \n", "4 | \n", "
2010-12-29 | \n", "4500.9810 | \n", "289.62990 | \n", "23.169529 | \n", "1014.53740 | \n", "557.29913 | \n", "6.249619 | \n", "13.007574 | \n", "NE | \n", "4 | \n", "
2010-12-30 | \n", "5150.3840 | \n", "290.11844 | \n", "43.158295 | \n", "1017.36230 | \n", "745.95575 | \n", "8.769048 | \n", "19.371056 | \n", "NE | \n", "4 | \n", "
2010-12-31 | \n", "4753.6760 | \n", "291.39554 | \n", "38.372738 | \n", "1016.40400 | \n", "697.51697 | \n", "8.137861 | \n", "17.666280 | \n", "NE | \n", "4 | \n", "
18466 rows × 9 columns
\n", "\n", " | NE | \n", "NW | \n", "SE | \n", "SW | \n", "
---|---|---|---|---|
time | \n", "\n", " | \n", " | \n", " | \n", " |
1960-01-01 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-02 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-03 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-04 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-05 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2010-12-27 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-28 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-29 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-30 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-31 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
18466 rows × 4 columns
\n", "\n", " | PM10 | \n", "T2 | \n", "rh2 | \n", "slp | \n", "PBLH | \n", "wind_speed_10m | \n", "wind_speed_925hPa | \n", "RAIN | \n", "NE | \n", "NW | \n", "SE | \n", "SW | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
time | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
1960-01-01 | \n", "2000.1490 | \n", "288.24875 | \n", "32.923786 | \n", "1018.89420 | \n", "484.91812 | \n", "6.801503 | \n", "13.483623 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-02 | \n", "4686.5370 | \n", "288.88450 | \n", "30.528862 | \n", "1017.26575 | \n", "601.58310 | \n", "8.316340 | \n", "18.027075 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-03 | \n", "5847.7515 | \n", "290.97128 | \n", "26.504536 | \n", "1015.83514 | \n", "582.38540 | \n", "9.148216 | \n", "17.995173 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-04 | \n", "5252.0586 | \n", "292.20060 | \n", "30.678936 | \n", "1013.92230 | \n", "555.11860 | \n", "8.751743 | \n", "15.806478 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-05 | \n", "3379.3190 | \n", "293.06076 | \n", "27.790462 | \n", "1011.94934 | \n", "394.95440 | \n", "6.393228 | \n", "9.160809 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2010-12-27 | \n", "2681.4685 | \n", "292.38474 | \n", "18.858383 | \n", "1011.69574 | \n", "315.81320 | \n", "4.749993 | \n", "7.846004 | \n", "4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-28 | \n", "1345.8488 | \n", "291.46680 | \n", "26.357006 | \n", "1010.66340 | \n", "232.03355 | \n", "3.051484 | \n", "3.346668 | \n", "4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-29 | \n", "4500.9810 | \n", "289.62990 | \n", "23.169529 | \n", "1014.53740 | \n", "557.29913 | \n", "6.249619 | \n", "13.007574 | \n", "4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-30 | \n", "5150.3840 | \n", "290.11844 | \n", "43.158295 | \n", "1017.36230 | \n", "745.95575 | \n", "8.769048 | \n", "19.371056 | \n", "4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-31 | \n", "4753.6760 | \n", "291.39554 | \n", "38.372738 | \n", "1016.40400 | \n", "697.51697 | \n", "8.137861 | \n", "17.666280 | \n", "4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
18466 rows × 12 columns
\n", "\n", " | T2 | \n", "rh2 | \n", "slp | \n", "PBLH | \n", "wind_speed_10m | \n", "wind_speed_925hPa | \n", "RAIN | \n", "NE | \n", "NW | \n", "SE | \n", "SW | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
time | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
1960-01-01 | \n", "288.24875 | \n", "32.923786 | \n", "1018.89420 | \n", "484.91812 | \n", "6.801503 | \n", "13.483623 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-02 | \n", "288.88450 | \n", "30.528862 | \n", "1017.26575 | \n", "601.58310 | \n", "8.316340 | \n", "18.027075 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-03 | \n", "290.97128 | \n", "26.504536 | \n", "1015.83514 | \n", "582.38540 | \n", "9.148216 | \n", "17.995173 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-04 | \n", "292.20060 | \n", "30.678936 | \n", "1013.92230 | \n", "555.11860 | \n", "8.751743 | \n", "15.806478 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1960-01-05 | \n", "293.06076 | \n", "27.790462 | \n", "1011.94934 | \n", "394.95440 | \n", "6.393228 | \n", "9.160809 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2010-12-27 | \n", "292.38474 | \n", "18.858383 | \n", "1011.69574 | \n", "315.81320 | \n", "4.749993 | \n", "7.846004 | \n", "4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-28 | \n", "291.46680 | \n", "26.357006 | \n", "1010.66340 | \n", "232.03355 | \n", "3.051484 | \n", "3.346668 | \n", "4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-29 | \n", "289.62990 | \n", "23.169529 | \n", "1014.53740 | \n", "557.29913 | \n", "6.249619 | \n", "13.007574 | \n", "4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-30 | \n", "290.11844 | \n", "43.158295 | \n", "1017.36230 | \n", "745.95575 | \n", "8.769048 | \n", "19.371056 | \n", "4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2010-12-31 | \n", "291.39554 | \n", "38.372738 | \n", "1016.40400 | \n", "697.51697 | \n", "8.137861 | \n", "17.666280 | \n", "4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
18466 rows × 11 columns
\n", "\n", " | 0 | \n", "1 | \n", "
---|---|---|
0 | \n", "PC1 | \n", "wind_speed_10m | \n", "
1 | \n", "PC2 | \n", "rh2 | \n", "
2 | \n", "PC3 | \n", "PBLH | \n", "
3 | \n", "PC4 | \n", "RAIN | \n", "
4 | \n", "PC5 | \n", "slp | \n", "
5 | \n", "PC6 | \n", "NE | \n", "