ERA5 Data Preprocessing - Machine Learning Hurricane Intensity Prediction Cookbook

Overview¶

Here, we will use the processed IBTRACKS data to select ERA5 environmental variables associated with each cyclone.

Prerequisites¶

Concepts	Importance	Notes
Intro to NUMPY	Necessary
Intro to PANDAS	Necessary
Intro to XARRAY	Necessary
Project management	Helpful

Time to learn: ~15 minntes

Imports¶

Begin your body of content with another --- divider before continuing into this section, then remove this body text and populate the following code cell with all necessary Python imports up-front:

import xarray as xr 
from dask.distributed import Client
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pandas as pd
import glob
from global_land_mask import globe
import cartopy.feature as cfeature
from matplotlib.path import Path
import matplotlib.patches as patches
from matplotlib import patheffects
import numpy as np
import dask

Edit and pad ERA5 data¶

In this section, we will select ERA5 data within a 5x5 latitude/longitude grid centered at each cyclone center at each time step in our dataset. We will then have to pad the data to account for instances in which grid cells occur over land.

input_dsets = xr.open_dataset('../test_folder/final_proc_5yr_6h.nc')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 1
----> 1 input_dsets = xr.open_dataset('../test_folder/final_proc_5yr_6h.nc')

File ~/micromamba/envs/cookbook-dev/lib/python3.12/site-packages/xarray/backends/api.py:577, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, create_default_indexes, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    574     kwargs.update(backend_kwargs)
    576 if engine is None:
--> 577     engine = plugins.guess_engine(filename_or_obj)
    579 if from_array_kwargs is None:
    580     from_array_kwargs = {}

File ~/micromamba/envs/cookbook-dev/lib/python3.12/site-packages/xarray/backends/plugins.py:212, in guess_engine(store_spec, must_support_groups)
    204 else:
    205     error_msg = (
    206         "found the following matches with the input file in xarray's IO "
    207         f"backends: {compatible_engines}. But their dependencies may not be installed, see:\n"
    208         "https://docs.xarray.dev/en/stable/user-guide/io.html \n"
    209         "https://docs.xarray.dev/en/stable/getting-started-guide/installing.html"
    210     )
--> 212 raise ValueError(error_msg)

ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'h5netcdf', 'scipy']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html

# calculating coriolis parameter 
cor_parms =  2 * 7.29 * 1e-5 * np.sin(np.radians(input_dsets['latitude']))

input_dsets['cor_params'] = xr.DataArray(cor_parms,
                                            name='cor_params'
                                            ).broadcast_like(input_dsets['r'])

ib_data_processed_6h = pd.read_csv('../test_folder/ib_data_processed_6h.csv')

final_data = []
max_len = ib_data_processed_6h.groupby('id').size().max()  # assuming max length is 3 hours per storm

Edit predictors for each cyclone¶

Here we can move on to our second objective, to explicitly edit the predictors that will be used by the machine learning model. We wish to center each cyclone within a 5x5 grid at each time step. We will then select the data at each grid cell for each variable of interest including sea surface temperatures, 500 hPa relatice humidity, pressure, vertical wind shear, 850 hPa relative vorticity, and the coriolis parameter.

for id_number,group in ib_data_processed_6h.groupby('id'):
    events_data = []
    for index,row in group.iterrows():
        lat = int(row['LAT'])
        lon = int(row['LON'])
        time = row['datetime']
        
        #We want data in a 5x5 latitude/longitude grid centered on the cyclone latitude/longitude
        latmin = lat - 2
        latmax = lat + 2
        lonmin = lon - 2
        lonmax = lon + 2
        sel_data = input_dsets.sel(latitude=slice(latmax, latmin), longitude=slice(lonmin, lonmax), time=time)
        
            
        final_xr = sel_data.rename({'latitude': 'y', 'longitude': 'x'})
        final_xr['x'] = np.arange(0,final_xr.sizes['x'])
        final_xr['y'] = np.arange(0,final_xr.sizes['y'])
        
        # fill NaN values with zeros along the x and y dimensions
        for jj in final_xr.data_vars:
            final_xr[jj].fillna(0)  # Fill NaN values
        
        #Recall that we are trying to predict the wind speed.
        #Hence, our target is USA_WIND
        final_xr['target'] = row['USA_WIND']    
        events_data.append(final_xr)
    
    final_event = xr.concat(events_data,dim='time')
    
    #Pad data with zeros up to the maximum time
    if len(final_event.time) <= max_len:
        new_time = pd.date_range(start=final_event['time'].min().values, periods=max_len ,freq='6h')
        padded_data = final_event.reindex(time=new_time, fill_value=0.0)
    else:
        padded_data = final_event
    
    lead_time = np.arange(0,max_len*6 ,6)
    padded_data['lead'] = ('time', lead_time)
    padded_data = padded_data.assign_coords({'lead': padded_data['lead'].astype(int)})
    
    # swap time and lead dimensions
    padded_data = padded_data.swap_dims({'time': 'lead'})
    padded_data['id'] = id_number 
    padded_data = padded_data.set_coords('id')
    
    # convert the time dimension to a variable
    final_data.append(padded_data)

final_input_padded = xr.concat(final_data, dim='id')
final_input_padded


final_input_padded.to_netcdf('../test_folder/input_predictands.nc')

Summary¶

Here, we selected and edited ERA5 data associated with the cyclones at each time step in our dataset. This involved gathering data for each variable of interest within a 5x5 grid. We also needed to be sure to mask out all grid cells corresponding to land as our AI model will only take into account grid cells over water.

What’s next?¶

We have now officially preprocessed all of our data! Next, we will test each variable of interest to get a sense of how well it can act as a predictor for cyclone intensity. After this, we will begin setting up our AI model!

Data Preprocessing

Preprocess IBTRACK data

Model

Model Setup