
Overview¶
The first step of any machine learning algorithm is to first load in, filter, and then process the necessary data. Here, we will load in the hurricane track data from IBTRACKS and then filter the tracks by requirements that we will set.
Prerequisites¶
| Concepts | Importance | Notes |
|---|---|---|
| Intro to NUMPY | Necessary | |
| Intro to PANDAS | Necessary | |
| Intro to XARRAY | Necessary | |
| Project management | Helpful |
Time to learn: estimate in minutes. For a rough idea, use 5 mins per subsection, 10 if longer; add these up for a total. Safer to round up and overestimate.
System requirements:
Populate with any system, version, or non-Python software requirements if necessary
Otherwise use the concepts table above and the Imports section below to describe required packages as necessary
If no extra requirements, remove the System requirements point altogether
Imports¶
Begin your body of content with another --- divider before continuing into this section, then remove this body text and populate the following code cell with all necessary Python imports up-front:
import xarray as xr
from dask.distributed import Client
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pandas as pd
import glob
from global_land_mask import globe
import cartopy.feature as cfeature
from matplotlib.path import Path
import matplotlib.patches as patches
from matplotlib import patheffectsRead in IBTRACKS data¶
First, read in the data as a csv file
ib_data = '../test_folder/ibtracs.NA.list.v04r00.csv'Method to preprocess the track data¶
The following method will set custom ranges for the time period and spatial domain of analysis.
def process_ibrack(ib_loc, periods=[2000,2005]):
#Read in the IBTRACKS data
read_ib_data = pd.read_csv(ib_loc,keep_default_na=False)
#Get the units for each column in read_ib_data
units = read_ib_data.iloc[0,:]
#Get data or the remainder of read_ib_data
ib_original_dft = read_ib_data.iloc[1:,:]
#Set a custom date and time range based on user choosing
ib_original_dft['datetime'] = pd.to_datetime(ib_original_dft['ISO_TIME'],format='%Y-%m-%d %H:%M:%S')
year_mask = (ib_original_dft['datetime'] > f'{periods[0]}-1-1') & (ib_original_dft['datetime'] <= f'{periods[1]}-11-30')
ib_new_period = ib_original_dft[year_mask][ib_original_dft['BASIN'] == 'NA']
#Only use cyclones over the North Atlantic basin
#This can be changed to include more cyclones outside of the Northeast Atlantic
def only_na_basin(df):
lon_wise = df.sort_values(by='datetime')
if lon_wise['LON'].iloc[0] > -55:
return df
else:
return None
#Get the number of time steps in each cyclone, or event
only_neatlantic = ib_new_period.groupby('SID').apply(only_na_basin).reset_index(drop=True)
counts = only_neatlantic.groupby('SID').count().iloc[:,0]
#Get cyclones that last at least 12 time steps
counts_12 = counts[counts > 12].index
persist_storms = ib_new_period[ib_new_period['SID'].isin(counts_12)]
persist_storms['month']= persist_storms['datetime'].dt.month
# mask out land points
def mask_lands(df):
ordered_df = df.sort_values(by='datetime')
lat = ordered_df['LAT']
lon = ordered_df['LON']
ocean_mask = pd.Series(globe.is_ocean(lat=lat,lon=lon))
idx_false = ocean_mask.idxmin()
if idx_false == 0:
return df
else:
land_mask = ocean_mask.iloc[:idx_false]
final_masked = ordered_df.iloc[:idx_false,:]
return final_masked
# filter extratropical parts of the storm tracks
def filter_ET(df):
ordered_df = df.sort_values(by='datetime')
lat_filter = ordered_df['LAT'] <= 35
filter_df = ordered_df[lat_filter]
return filter_df
exclude_et = persist_storms.groupby('SID').apply(filter_ET).reset_index(drop=True)
final_dft = exclude_et.groupby('SID').apply(mask_lands).reset_index(drop=True)
return final_dftProcess IBTRACKS data using the above methods¶
ib_data_processed = process_ibrack(ib_data,periods=[2000,2005])
ib_data_processed['LAT'] = ib_data_processed['LAT'].astype(float)
ib_data_processed['LON'] = ib_data_processed['LON'].astype(float)
ib_data_processed['USA_WIND'] = ib_data_processed['USA_WIND'].astype(float)
ib_data_processed['datetime'] = pd.to_datetime(ib_data_processed['datetime'],format='%Y-%m-%d %H:%M:%S')
ib_data_processed['SID'] = ib_data_processed['SID'].astype(str)/tmp/ipykernel_4128/278484491.py:4: DtypeWarning: Columns (0: SEASON, 1: NUMBER, 2: LAT, 3: LON, 4: DIST2LAND, 5: USA_LAT, 6: USA_LON, 7: USA_WIND, 8: USA_PRES, 9: STORM_SPEED, 10: STORM_DIR) have mixed types. Specify dtype option on import or set low_memory=False.
read_ib_data = pd.read_csv(ib_loc,keep_default_na=False)
/tmp/ipykernel_4128/278484491.py:15: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
ib_new_period = ib_original_dft[year_mask][ib_original_dft['BASIN'] == 'NA']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[4], line 1
----> 1 ib_data_processed = process_ibrack(ib_data,periods=[2000,2005])
2 ib_data_processed['LAT'] = ib_data_processed['LAT'].astype(float)
3 ib_data_processed['LON'] = ib_data_processed['LON'].astype(float)
Cell In[3], line 28, in process_ibrack(ib_loc, periods)
26 #Get the number of time steps in each cyclone, or event
27 only_neatlantic = ib_new_period.groupby('SID').apply(only_na_basin).reset_index(drop=True)
---> 28 counts = only_neatlantic.groupby('SID').count().iloc[:,0]
30 #Get cyclones that last at least 12 time steps
31 counts_12 = counts[counts > 12].index
File ~/micromamba/envs/cookbook-dev/lib/python3.12/site-packages/pandas/util/_decorators.py:336, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
330 if len(args) > num_allow_args:
331 warnings.warn(
332 msg.format(arguments=_format_argument_list(allow_args)),
333 klass,
334 stacklevel=find_stack_level(),
335 )
--> 336 return func(*args, **kwargs)
File ~/micromamba/envs/cookbook-dev/lib/python3.12/site-packages/pandas/core/frame.py:10817, in DataFrame.groupby(self, by, level, as_index, sort, group_keys, observed, dropna)
10814 if level is None and by is None:
10815 raise TypeError("You have to supply one of 'by' and 'level'")
> 10817 return DataFrameGroupBy(
10818 obj=self,
10819 keys=by,
10820 level=level,
10821 as_index=as_index,
10822 sort=sort,
10823 group_keys=group_keys,
10824 observed=observed,
10825 dropna=dropna,
10826 )
File ~/micromamba/envs/cookbook-dev/lib/python3.12/site-packages/pandas/core/groupby/groupby.py:1095, in GroupBy.__init__(self, obj, keys, level, grouper, exclusions, selection, as_index, sort, group_keys, observed, dropna)
1092 self.dropna = dropna
1094 if grouper is None:
-> 1095 grouper, exclusions, obj = get_grouper(
1096 obj,
1097 keys,
1098 level=level,
1099 sort=sort,
1100 observed=observed,
1101 dropna=self.dropna,
1102 )
1104 self.observed = observed
1105 self.obj = obj
File ~/micromamba/envs/cookbook-dev/lib/python3.12/site-packages/pandas/core/groupby/grouper.py:901, in get_grouper(obj, key, level, sort, observed, validate, dropna)
899 in_axis, level, gpr = False, gpr, None
900 else:
--> 901 raise KeyError(gpr)
902 elif isinstance(gpr, Grouper) and gpr.key is not None:
903 # Add key to exclusions
904 exclusions.add(gpr.key)
KeyError: 'SID'Organize the edited IBTRACKS file¶
ib_data_processed['id'] = ib_data_processed['SID'].astype('category')
ib_data_processed['id'] = ib_data_processed['id'].cat.codes
req_cols = ['datetime','LAT','LON','USA_WIND','id']
# groupby datetime 6h
ib_data_processed['datetime'] = ib_data_processed['datetime'].dt.floor('6h')
ib_data_processed_6h = ib_data_processed[req_cols].groupby('datetime').mean().reset_index()
# copy the SID based on the id
ib_data_processed_6h['SID'] = ib_data_processed.groupby('datetime')['SID'].first().valuesSee the organized and edited IBTRACKS file¶
ib_data_processed_6hPlot the track data we have edited and organized¶
def plot_tracks(filtered_ib):
filtered_ib['datetime'] = pd.to_datetime(filtered_ib['datetime'])
events = filtered_ib.groupby('id')
fig,ax = plt.subplots(figsize=(20,10),subplot_kw={"projection": ccrs.PlateCarree()})
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.BORDERS)
ax.coastlines()
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.OCEAN)
ax.gridlines(draw_labels=True, dms=True, x_inline=False, y_inline=False)
for event_num , event in events:
lon = event['LON'].values
lat = event['LAT'].values
vertices = [(lo, la) for lo, la in zip(lon, lat)]
codes = [Path.MOVETO]
[codes.append(Path.LINETO) for index in range(0, len(event) - 1)]
path = Path(vertices, codes)
patch = patches.PathPatch(path, lw = 1, fc = 'none', path_effects = [patheffects.withStroke(linewidth=2.5, foreground="black")], zorder = 5)
ax.add_patch(patch)
ax.set_xlim(-100, 0)
ax.set_ylim(5, 35)
return fig,ax
plot_tracks(ib_data_processed_6h)ib_data_processed_6h.to_csv('../test_folder/ib_data_processed_6h.csv')Summary¶
Here, we appropriately edited the cyclone track data over our space and time domain of interest. We have also created an output file that will be useful for eventual creation and training of our machine learning model.
What’s next?¶
Next we will edit the ERA5 data. In particular, we will take our variables of interest that will be used to train the AI model and organize them for ingestion into the machine learning model.