Skip to article frontmatterSkip to article content

ERA5 Data Preprocessing


Overview

Here, we will use the processed IBTRACKS data to select ERA5 environmental variables associated with each cyclone.

Prerequisites

ConceptsImportanceNotes
Intro to NUMPYNecessary
Intro to PANDASNecessary
Intro to XARRAYNecessary
Project managementHelpful
  • Time to learn: ~15 minntes

Imports

Begin your body of content with another --- divider before continuing into this section, then remove this body text and populate the following code cell with all necessary Python imports up-front:

import xarray as xr 
from dask.distributed import Client
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pandas as pd
import glob
from global_land_mask import globe
import cartopy.feature as cfeature
from matplotlib.path import Path
import matplotlib.patches as patches
from matplotlib import patheffects
import numpy as np
import dask

Edit and pad ERA5 data

In this section, we will select ERA5 data within a 5x5 latitude/longitude grid centered at each cyclone center at each time step in our dataset. We will then have to pad the data to account for instances in which grid cells occur over land.

input_dsets = xr.open_dataset('../test_folder/final_proc_5yr_6h.nc')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 1
----> 1 input_dsets = xr.open_dataset('../test_folder/final_proc_5yr_6h.nc')

File ~/micromamba/envs/cookbook-dev/lib/python3.12/site-packages/xarray/backends/api.py:696, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, create_default_indexes, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    693     kwargs.update(backend_kwargs)
    695 if engine is None:
--> 696     engine = plugins.guess_engine(filename_or_obj)
    698 if from_array_kwargs is None:
    699     from_array_kwargs = {}

File ~/micromamba/envs/cookbook-dev/lib/python3.12/site-packages/xarray/backends/plugins.py:194, in guess_engine(store_spec)
    186 else:
    187     error_msg = (
    188         "found the following matches with the input file in xarray's IO "
    189         f"backends: {compatible_engines}. But their dependencies may not be installed, see:\n"
    190         "https://docs.xarray.dev/en/stable/user-guide/io.html \n"
    191         "https://docs.xarray.dev/en/stable/getting-started-guide/installing.html"
    192     )
--> 194 raise ValueError(error_msg)

ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'h5netcdf', 'scipy']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html
# calculating coriolis parameter 
cor_parms =  2 * 7.29 * 1e-5 * np.sin(np.radians(input_dsets['latitude']))

input_dsets['cor_params'] = xr.DataArray(cor_parms,
                                            name='cor_params'
                                            ).broadcast_like(input_dsets['r'])
ib_data_processed_6h = pd.read_csv('../test_folder/ib_data_processed_6h.csv')
final_data = []
max_len = ib_data_processed_6h.groupby('id').size().max()  # assuming max length is 3 hours per storm

Edit predictors for each cyclone

Here we can move on to our second objective, to explicitly edit the predictors that will be used by the machine learning model. We wish to center each cyclone within a 5x5 grid at each time step. We will then select the data at each grid cell for each variable of interest including sea surface temperatures, 500 hPa relatice humidity, pressure, vertical wind shear, 850 hPa relative vorticity, and the coriolis parameter.

for id_number,group in ib_data_processed_6h.groupby('id'):
    events_data = []
    for index,row in group.iterrows():
        lat = int(row['LAT'])
        lon = int(row['LON'])
        time = row['datetime']
        
        #We want data in a 5x5 latitude/longitude grid centered on the cyclone latitude/longitude
        latmin = lat - 2
        latmax = lat + 2
        lonmin = lon - 2
        lonmax = lon + 2
        sel_data = input_dsets.sel(latitude=slice(latmax, latmin), longitude=slice(lonmin, lonmax), time=time)
        
            
        final_xr = sel_data.rename({'latitude': 'y', 'longitude': 'x'})
        final_xr['x'] = np.arange(0,final_xr.sizes['x'])
        final_xr['y'] = np.arange(0,final_xr.sizes['y'])
        
        # fill NaN values with zeros along the x and y dimensions
        for jj in final_xr.data_vars:
            final_xr[jj].fillna(0)  # Fill NaN values
        
        #Recall that we are trying to predict the wind speed.
        #Hence, our target is USA_WIND
        final_xr['target'] = row['USA_WIND']    
        events_data.append(final_xr)
    
    final_event = xr.concat(events_data,dim='time')
    
    #Pad data with zeros up to the maximum time
    if len(final_event.time) <= max_len:
        new_time = pd.date_range(start=final_event['time'].min().values, periods=max_len ,freq='6h')
        padded_data = final_event.reindex(time=new_time, fill_value=0.0)
    else:
        padded_data = final_event
    
    lead_time = np.arange(0,max_len*6 ,6)
    padded_data['lead'] = ('time', lead_time)
    padded_data = padded_data.assign_coords({'lead': padded_data['lead'].astype(int)})
    
    # swap time and lead dimensions
    padded_data = padded_data.swap_dims({'time': 'lead'})
    padded_data['id'] = id_number 
    padded_data = padded_data.set_coords('id')
    
    # convert the time dimension to a variable
    final_data.append(padded_data)
final_input_padded = xr.concat(final_data, dim='id')
final_input_padded

final_input_padded.to_netcdf('../test_folder/input_predictands.nc')

Summary

Here, we selected and edited ERA5 data associated with the cyclones at each time step in our dataset. This involved gathering data for each variable of interest within a 5x5 grid. We also needed to be sure to mask out all grid cells corresponding to land as our AI model will only take into account grid cells over water.

What’s next?

We have now officially preprocessed all of our data! Next, we will test each variable of interest to get a sense of how well it can act as a predictor for cyclone intensity. After this, we will begin setting up our AI model!