Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Using PelicanFS via FSSpec to Access Data on the OSDF


Overview

Now that you’ve learned about the OSDF and the Pelican command line client, you may be wondering how you can easily access that data from within a notebook using python.

You can do this using PelicanFS, which is an FSSPec implementation of the Pelican client.

This notebook will contain:

  1. A brief explanation of FSSPec and PelicanFS

  2. A real-world example using FSSPec, Pelican, Xarray, and Zarr

  3. Other common access patterns

  4. FAQs

Prerequisites

To better understand this notebook, please familiarize yourself with the following concepts:

ConceptsImportanceNotes
Intro to OSDFNecessary
Understanding of XarrayHelpfulTo better understand the example workflow
Overview of FSSpecHelpfulTo better understand the FSSpec library
  • Time to learn: 20-30 minutes


Imports

import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import metpy.calc as mpcalc
from metpy.units import units
import fsspec
import intake
/home/runner/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

What are PelicanFS and FSSPec?

First, let’s understand PelicanFS and how it integrates with FSSpec

FSSPec

FileSystem Spec (fsspec) is a python library which endeavors to provide a unified interface to many different storage backends. This includes, but is not limited to, POSIX, https, and S3. It’s used by various data processing libraries such as xarray, pandas, and intake, just to name a few.

To learn more about FSSPec, visit its information page.

Schemes

FSSpec figures out how to interact with data from different storage backends through the scheme in the data path. For example, FSSpec knows to use the “Hyper Text Transfer Protocol” interface whenever it sees URLs with the https: scheme. This lets users interact with data from a variety of storage technologies without forcing them to know how those technologies work under the hood.

PelicanFS

PelicanFS is an implementation of FSSpec that introduces two new schemes to FSSpec: pelican and osdf. PelicanFS enables you to use the pelican:// scheme to access data via Pelican Federations like the OSDF in any software that already understands FSSpec. To use it, you must specify the federation host name. A Pelican path looks like:

pelican://<federation-host-name>/<namespace-path>

The osdf scheme is a specific instance of the pelican scheme that knows how to access the OSDF. A path using the osdf scheme should not provide the federation root. An OSDF path looks like:

osdf:///<namespace-path>

PelicanFS teaches FSSpec how to interact with the Pelican protocol using the above pelican:-schemed or osdf:-schemed URLs.

If you’d like to understand more about how pelican works, check out the documentation here.

Putting it all together

What does this mean in practice?

If you want to access data from the OSDF using FSSpec or any library that uses FSSpec, build the osdf-schemed URL for the data and use that URL as your data path and then FSSpec and PelicanFS will do all the work to resolve it behind the scenes.


A PelicanFS Example using Real Data

The following is an example that shows how PelicanFS works on real world data using FSSPec and Xarray to access Zarr data from AWS.

This portion of the notebook is based off of the Project Pythia HRRR AWS Cookbook

Setting the Proper Path

The data for this tutorial is part of AWS Open Data, hosted in the us-west-1 region. The OSDF provides access to that region using the /aws-opendata/us-west-1 namespace.

Let’s first create a path which uses the osdf scheme.

# Set the date, hour, variable, and level for the HRRR data
date = '20211016'
hour = '21'
var = 'TMP'
level = '2m_above_ground'

# Construct object paths for the Zarr datasets using the osdf scheme
namespace_object1 = f'osdf:///aws-opendata/us-west-1/hrrrzarr/sfc/{date}/{date}_{hour}z_anl.zarr/{level}/{var}/{level}/'
namespace_object2 = f'osdf:///aws-opendata/us-west-1/hrrrzarr/sfc/{date}/{date}_{hour}z_anl.zarr/{level}/{var}/'

Using FSSpec to access the data

Now we can access the data using XArray as usual. The two objects will be accessed using fsspec’s get_mapper function, which knows to use PelicanFS because we created the path using the osdf scheme.

# Get mappers for the Zarr datasets

object1 = fsspec.get_mapper(namespace_object1)
object2 = fsspec.get_mapper(namespace_object2)

# Open the datasets
ds = xr.open_mfdataset([object1, object2], engine='zarr', decode_timedelta=True)

# Display the dataset
ds
Loading...

Continue the workflow

As you can see, Xarray streamed the data correctly into the datasets. To prove the workflow works, the next cell continues the computation and generates two plots. This tutorial will not go in depth as to what this code is accomplishing.

If you’d like to know more about the following workflow, please refer to the Project Pythia HRRR AWS Cookbook

# Define coordinates for projection
lon1 = -97.5
lat1 = 38.5
slat = 38.5

# Define the Lambert Conformal projection
projData = ccrs.LambertConformal(
    central_longitude=lon1,
    central_latitude=lat1,
    standard_parallels=[slat, slat],
    globe=ccrs.Globe(
        semimajor_axis=6371229,
        semiminor_axis=6371229
    )
)

# Display dataset coordinates
ds.coords

# Extract temperature data
airTemp = ds.TMP

# Display the temperature data
airTemp

# Convert temperature units to Celsius
airTemp = airTemp.metpy.convert_units('degC')

# Display the converted temperature data
airTemp

# Extract projection coordinates
x = airTemp.projection_x_coordinate
y = airTemp.projection_y_coordinate

# Plot temperature data
airTemp.plot(figsize=(11, 8.5))

# Compute minimum and maximum temperatures
minTemp = airTemp.min().compute()
maxTemp = airTemp.max().compute()

# Display minimum and maximum temperature values
minTemp.values, maxTemp.values

# Define contour levels
fint = np.arange(np.floor(minTemp.values), np.ceil(maxTemp.values) + 2, 2)

# Define plot bounds and resolution
latN = 50.4
latS = 24.25
lonW = -123.8
lonE = -71.2
res = '50m'

# Create a figure and axis with projection
fig = plt.figure(figsize=(18, 12))
ax = plt.subplot(1, 1, 1, projection=projData)
ax.set_extent([lonW, lonE, latS, latN], crs=ccrs.PlateCarree())
ax.add_feature(cfeature.COASTLINE.with_scale(res))
ax.add_feature(cfeature.STATES.with_scale(res))

# Add the title
tl1 = 'HRRR 2m temperature ($^\\circ$C)'
tl2 = f'Analysis valid at: {hour}00 UTC {date}'
plt.title(f'{tl1}\n{tl2}', fontsize=16)

# Contour fill
CF = ax.contourf(x, y, airTemp, levels=fint, cmap=plt.get_cmap('coolwarm'))

# Make a colorbar for the ContourSet returned by the contourf call
cbar = fig.colorbar(CF, shrink=0.5)
cbar.set_label(r'2m Temperature ($^\circ$C)', size='large')

# Show the plot
plt.show()
<Figure size 1100x850 with 2 Axes>
/home/runner/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/cartopy/io/__init__.py:242: DownloadWarning: Downloading: https://naturalearth.s3.amazonaws.com/50m_physical/ne_50m_coastline.zip
  warnings.warn(f'Downloading: {url}', DownloadWarning)
/home/runner/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/cartopy/io/__init__.py:242: DownloadWarning: Downloading: https://naturalearth.s3.amazonaws.com/50m_cultural/ne_50m_admin_1_states_provinces_lakes.zip
  warnings.warn(f'Downloading: {url}', DownloadWarning)
<Figure size 1800x1200 with 2 Axes>

Other Ways to Access

There are other common ways to access data and use data with FSSpec and PelicanFS. This section will will cover the following topics

  1. Using an Intake Catalog

  2. Directly Accessing Data

Intake Catalog

In order to use PelicanFS with an Intake catalog, the paths in the catalog itself need to use the osdf or pelican schemes.

Here’s an example using the catalog located at https://data.gdex.ucar.edu/d850001/catalogs/osdf/cmip6-aws/cmip6-osdf-zarr.json

An entry in the catalog’s csv file looks like:

HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,ta,gn,osdf:///aws-opendata/us-west-2/cmip6-pds/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/Amon/ta/gn/v20170706/,,20170706

Notice how the path is using the ‘osdf’ scheme and the ‘/aws-opendata/us-west-2’ namespace. If all the paths in the csv file are formatted like this, then you can use the Intake catalog exactly as usual.

Here is a workflow and plot which uses an Intake catalog and the osdf scheme. If you want to understand more about the underlying workflow, please look at the Global Mean Surface Temperature Anomalies (GMSTA) from CMIP6 data notebook.

gdex_url    =  'https://data.gdex.ucar.edu/'
cat_url     = gdex_url +  'd850001/catalogs/osdf/cmip6-aws/cmip6-osdf-zarr.json'

col = intake.open_esm_datastore(cat_url)

expts = ['historical']

query = dict(
    experiment_id=expts,
    table_id='Amon',
    variable_id=['tas'],
    member_id = 'r1i1p1f1',
    #activity_id = 'CMIP',
)

col_subset = col.search(require_all_on=["source_id"], **query)

ds = xr.open_zarr(col_subset.df['zstore'][0])

ds.tas.isel(time=0).plot()
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[5], line 4
      1 gdex_url    =  'https://data.gdex.ucar.edu/'
      2 cat_url     = gdex_url +  'd850001/catalogs/osdf/cmip6-aws/cmip6-osdf-zarr.json'
      3 
----> 4 col = intake.open_esm_datastore(cat_url)
      5 
      6 expts = ['historical']
      7 

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/intake_esm/core.py:113, in esm_datastore.__init__(self, obj, progressbar, sep, registry, read_csv_kwargs, columns_with_iterables, storage_options, **intake_kwargs)
    111     self.esmcat = ESMCatalogModel.from_dict(obj)
    112 else:
--> 113     self.esmcat = ESMCatalogModel.load(
    114         obj, storage_options=self.storage_options, read_csv_kwargs=read_csv_kwargs
    115     )
    117 self.derivedcat = registry or default_registry
    118 self._entries = {}

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/intake_esm/cat.py:250, in ESMCatalogModel.load(cls, json_file, storage_options, read_csv_kwargs)
    248         csv_path = f'{os.path.dirname(_mapper.root)}/{cat.catalog_file}'
    249     cat.catalog_file = csv_path
--> 250     df = pd.read_csv(
    251         cat.catalog_file,
    252         storage_options=storage_options,
    253         **read_csv_kwargs,
    254     )
    255 else:
    256     df = pd.DataFrame(cat.catalog_dict)

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/pandas/io/common.py:728, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    725     codecs.lookup_error(errors)
    727 # open URLs
--> 728 ioargs = _get_filepath_or_buffer(
    729     path_or_buf,
    730     encoding=encoding,
    731     compression=compression,
    732     mode=mode,
    733     storage_options=storage_options,
    734 )
    736 handle = ioargs.filepath_or_buffer
    737 handles: list[BaseBuffer]

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/pandas/io/common.py:384, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    382 # assuming storage_options is to be interpreted as headers
    383 req_info = urllib.request.Request(filepath_or_buffer, headers=storage_options)
--> 384 with urlopen(req_info) as req:
    385     content_encoding = req.headers.get("Content-Encoding", None)
    386     if content_encoding == "gzip":
    387         # Override compression based on Content-Encoding header

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/pandas/io/common.py:289, in urlopen(*args, **kwargs)
    283 """
    284 Lazy-import wrapper for stdlib urlopen, as that imports a big chunk of
    285 the stdlib.
    286 """
    287 import urllib.request
--> 289 return urllib.request.urlopen(*args, **kwargs)

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/urllib/request.py:215, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    213 else:
    214     opener = _opener
--> 215 return opener.open(url, data, timeout)

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/urllib/request.py:521, in OpenerDirector.open(self, fullurl, data, timeout)
    519 for processor in self.process_response.get(protocol, []):
    520     meth = getattr(processor, meth_name)
--> 521     response = meth(req, response)
    523 return response

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/urllib/request.py:630, in HTTPErrorProcessor.http_response(self, request, response)
    627 # According to RFC 2616, "2xx" code indicates that the client's
    628 # request was successfully received, understood, and accepted.
    629 if not (200 <= code < 300):
--> 630     response = self.parent.error(
    631         'http', request, response, code, msg, hdrs)
    633 return response

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/urllib/request.py:559, in OpenerDirector.error(self, proto, *args)
    557 if http_err:
    558     args = (dict, 'default', 'http_error_default') + orig_args
--> 559     return self._call_chain(*args)

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/urllib/request.py:492, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    490 for handler in handlers:
    491     func = getattr(handler, meth_name)
--> 492     result = func(*args)
    493     if result is not None:
    494         return result

File ~/micromamba/envs/osdf-cookbook/lib/python3.12/urllib/request.py:639, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
    638 def http_error_default(self, req, fp, code, msg, hdrs):
--> 639     raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: HTTP Error 404: Not Found

Direct Access of Data

You can also access the data directly using normal file system calls.

For example, let’s say you want to read in a csv object from the OSDF. Just use the same pattern we’ve shown before of

osdf:///<namespace-path>

for your path.

with fsspec.open('osdf:///ndp/burnpro3d/YosemiteBurnExample/burnpro3d-yosemite-example.csv') as ex_csv:
    content = ex_csv.read()
    print(content.decode())

Summary

In this notebook we demonstrated how to use PelicanFS and gave an overview of a few different common usages. The main example showed how to use PelicanFS and Xarray to open a Zarr store. We also showed how to use PelicanFS and an intake catalog.

What’s next?

The following notebooks all demonstrate various workflows which will use PelicanFS to access data from the OSDF.