Overview¶
Here, we will use the processed IBTRACKS data to select ERA5 environmental variables associated with each cyclone.
Prerequisites¶
Concepts | Importance | Notes |
---|---|---|
Intro to NUMPY | Necessary | |
Intro to PANDAS | Necessary | |
Intro to XARRAY | Necessary | |
Project management | Helpful |
- Time to learn: ~15 minntes
Imports¶
Begin your body of content with another ---
divider before continuing into this section, then remove this body text and populate the following code cell with all necessary Python imports up-front:
import xarray as xr
from dask.distributed import Client
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pandas as pd
import glob
from global_land_mask import globe
import cartopy.feature as cfeature
from matplotlib.path import Path
import matplotlib.patches as patches
from matplotlib import patheffects
import numpy as np
import dask
Edit and pad ERA5 data¶
In this section, we will select ERA5 data within a 5x5 latitude/longitude grid centered at each cyclone center at each time step in our dataset. We will then have to pad the data to account for instances in which grid cells occur over land.
input_dsets = xr.open_dataset('../test_folder/final_proc_5yr_6h.nc')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[2], line 1
----> 1 input_dsets = xr.open_dataset('../test_folder/final_proc_5yr_6h.nc')
File ~/micromamba/envs/cookbook-dev/lib/python3.12/site-packages/xarray/backends/api.py:696, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, create_default_indexes, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
693 kwargs.update(backend_kwargs)
695 if engine is None:
--> 696 engine = plugins.guess_engine(filename_or_obj)
698 if from_array_kwargs is None:
699 from_array_kwargs = {}
File ~/micromamba/envs/cookbook-dev/lib/python3.12/site-packages/xarray/backends/plugins.py:194, in guess_engine(store_spec)
186 else:
187 error_msg = (
188 "found the following matches with the input file in xarray's IO "
189 f"backends: {compatible_engines}. But their dependencies may not be installed, see:\n"
190 "https://docs.xarray.dev/en/stable/user-guide/io.html \n"
191 "https://docs.xarray.dev/en/stable/getting-started-guide/installing.html"
192 )
--> 194 raise ValueError(error_msg)
ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'h5netcdf', 'scipy']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html
# calculating coriolis parameter
cor_parms = 2 * 7.29 * 1e-5 * np.sin(np.radians(input_dsets['latitude']))
input_dsets['cor_params'] = xr.DataArray(cor_parms,
name='cor_params'
).broadcast_like(input_dsets['r'])
ib_data_processed_6h = pd.read_csv('../test_folder/ib_data_processed_6h.csv')
final_data = []
max_len = ib_data_processed_6h.groupby('id').size().max() # assuming max length is 3 hours per storm
Edit predictors for each cyclone¶
Here we can move on to our second objective, to explicitly edit the predictors that will be used by the machine learning model. We wish to center each cyclone within a 5x5 grid at each time step. We will then select the data at each grid cell for each variable of interest including sea surface temperatures, 500 hPa relatice humidity, pressure, vertical wind shear, 850 hPa relative vorticity, and the coriolis parameter.
for id_number,group in ib_data_processed_6h.groupby('id'):
events_data = []
for index,row in group.iterrows():
lat = int(row['LAT'])
lon = int(row['LON'])
time = row['datetime']
#We want data in a 5x5 latitude/longitude grid centered on the cyclone latitude/longitude
latmin = lat - 2
latmax = lat + 2
lonmin = lon - 2
lonmax = lon + 2
sel_data = input_dsets.sel(latitude=slice(latmax, latmin), longitude=slice(lonmin, lonmax), time=time)
final_xr = sel_data.rename({'latitude': 'y', 'longitude': 'x'})
final_xr['x'] = np.arange(0,final_xr.sizes['x'])
final_xr['y'] = np.arange(0,final_xr.sizes['y'])
# fill NaN values with zeros along the x and y dimensions
for jj in final_xr.data_vars:
final_xr[jj].fillna(0) # Fill NaN values
#Recall that we are trying to predict the wind speed.
#Hence, our target is USA_WIND
final_xr['target'] = row['USA_WIND']
events_data.append(final_xr)
final_event = xr.concat(events_data,dim='time')
#Pad data with zeros up to the maximum time
if len(final_event.time) <= max_len:
new_time = pd.date_range(start=final_event['time'].min().values, periods=max_len ,freq='6h')
padded_data = final_event.reindex(time=new_time, fill_value=0.0)
else:
padded_data = final_event
lead_time = np.arange(0,max_len*6 ,6)
padded_data['lead'] = ('time', lead_time)
padded_data = padded_data.assign_coords({'lead': padded_data['lead'].astype(int)})
# swap time and lead dimensions
padded_data = padded_data.swap_dims({'time': 'lead'})
padded_data['id'] = id_number
padded_data = padded_data.set_coords('id')
# convert the time dimension to a variable
final_data.append(padded_data)
final_input_padded = xr.concat(final_data, dim='id')
final_input_padded
final_input_padded.to_netcdf('../test_folder/input_predictands.nc')
Summary¶
Here, we selected and edited ERA5 data associated with the cyclones at each time step in our dataset. This involved gathering data for each variable of interest within a 5x5 grid. We also needed to be sure to mask out all grid cells corresponding to land as our AI model will only take into account grid cells over water.
What’s next?¶
We have now officially preprocessed all of our data! Next, we will test each variable of interest to get a sense of how well it can act as a predictor for cyclone intensity. After this, we will begin setting up our AI model!