Data Ingestion - Intake
Overview
In the last notebook, you learned how to efficiently load data from the Microsoft Planetary Computer platform. If that approach works for you, please proceed to a workflow example. However, there are many other established approaches and platforms (e.g. NASA EarthData, USGS EarthExplorer, Google, and AWS), for loading Landsat data that may not preprocess the data in the exact same way or use a similar API. For this reason, in this notebook we will demonstrate common alternative approaches and techniques for general data access, centered around Intake.
First, to demonstrate simple cases, we will begin by reading in a local files with Pandas and xarray. Then we will demonstrate the combined use of Intake and Dask libraries to efficiently fetch data from a remote server.
Info
A great way to contribute to this cookbook is to create a notebook that focuses on data access from a specific provider.
Prerequisites
Concepts |
Importance |
Notes |
---|---|---|
Necessary |
Background |
|
Helpful |
||
Helpful |
||
Necessary |
||
Helpful |
||
Necessary |
Time to learn: 20 minutes
Imports
import pandas as pd
import xarray as xr
import intake
import hvplot.intake
import hvplot.xarray
import warnings
warnings.simplefilter('ignore', FutureWarning) # Ignore warning about the format of epsg codes
Loading local metadata with Pandas
Often, it’s common to have local files that could contain metadata. Therefore, we’ll start with the simple case of loading small local tabular data, such as .csv
files into Pandas Dataframes.
Let’s load information about Landsat 5 bands:
landsat5_bands = pd.read_csv('data/landsat5_bands.csv').set_index('Band')
landsat5_bands
Description | Range (nm) | |
---|---|---|
Band | ||
1 | Blue | 450-520 |
2 | Green | 520-600 |
3 | Red | 630-690 |
4 | Near-Infrared | 760-900 |
5 | Short Wavelength Infrared 1 | 1550-1750 |
6 | Thermal Infrared | 10400-12500 |
7 | Short Wavelength Infrared 2 | 2080-2350 |
…and the relevant subset of Landsat 8 bands:
landsat8_bands = pd.read_csv('data/landsat8_bands.csv').set_index('Band')
landsat8_bands
Description | Range (nm) | |
---|---|---|
Band | ||
1 | Coastal Aerosol | 435-451 |
2 | Blue | 452-512 |
3 | Green | 533-590 |
4 | Red | 636-673 |
5 | Near-Infrared | 851-879 |
6 | Short Wavelength Infrared 1 | 1566-1651 |
7 | Short Wavelength Infrared 2 | 2107-2294 |
Interestingly, we can see that the range and index of the bands differ between the satellites. For example, ‘Red’ is Band-4 within Landsat 8, but Band-3 within Landsat 5. This is important to keep in mind whenever comparing data across Landsat missions.
Loading small and local data with xarray
Although initiatives are underway to enhance cloud-based data workflows, at some point, it will likely be the case that you will have some local gridded data that you want to load and preview without much fuss. For this simple case, it makes sense to use an appropriately simple API. Using xarray and hvplot, we can quickly complete this task of previewing a local netCDF file.
landsat_5_crop = xr.open_dataarray('data/landsat5_crop.nc')
landsat_5_crop
<xarray.DataArray (y: 40, x: 40)> [1600 values with dtype=float64] Coordinates: band int64 ... * y (y) float64 4.297e+06 4.297e+06 4.297e+06 ... 4.292e+06 4.291e+06 * x (x) float64 3.384e+05 3.386e+05 3.387e+05 ... 3.441e+05 3.442e+05 Attributes: transform: [ 1.500000e+02 0.000000e+00 3.323250e+05 0.000000e+00 ... crs: +init=epsg:32611 res: [150. 150.] is_tiled: 0 nodatavals: nan scales: 1.0 offsets: 0.0 AREA_OR_POINT: Area
landsat_5_crop_image = landsat_5_crop.hvplot.image(geo=True, tiles='ESRI', alpha=.7)
landsat_5_crop_image
/home/runner/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/geoviews/operation/__init__.py:14: HoloviewsDeprecationWarning: 'ResamplingOperation' is deprecated and will be removed in version 1.18, use 'ResampleOperation2D' instead.
from holoviews.operation.datashader import (
This image is just a cropped portion of a Landsat tile, solely for demonstration of loading with xarray. Also, as we learned in the last notebook, as long as the data has a valid crs
attribute, passing geo=True
to hvPlot will plot the data in a geographic coordinate system, allowing us to use the tiles
option to overlay the plot on top of map tiles.
Loading data of any size and location using Intake
Even if your current data files are local, earth science data is often complex, remote, and large, so it’s a good idea to become familiar with an approach that can handle data of nearly any complexity, location, and size. For this reason, we will demonstrate a couple different approaches of the Intake library.
First, we will start with a simpler case of reading the same local tabular file as above with Pandas. Then we will progressively cover more complex cases, such as loading multiple files and loading gridded data from a remote location with an Intake catalog.
ds = intake.open_csv('./data/landsat5_bands.csv')
Importantly, the data is not loaded yet - open_csv
only accesses the metadata:
ds
csv:
args:
urlpath: ./data/landsat5_bands.csv
description: ''
driver: intake.source.csv.CSVSource
metadata: {}
We can also peak at what format the data will be in when it gets loaded using .container
:
ds.container
'dataframe'
Now, to get better insight into the data without loading it all in just yet, we can inspect the data using .to_dask()
. You could also use .read_chunked()
to return an iterator over the chunks (or partitions) of data. Refer to the ‘Reading Data’ Intake documentation for more information about these options.
dd = ds.to_dask()
dd.head()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[9], line 1
----> 1 dd = ds.to_dask()
2 dd.head()
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/source/csv.py:32, in CSVSource.to_dask(self)
31 def to_dask(self):
---> 32 return DaskCSV(self.data).read(**self.kwargs)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/readers/readers.py:96, in BaseReader.read(self, *args, **kwargs)
94 kw.update(kwargs)
95 args = kw.pop("args", ()) or args
---> 96 return self._read(*args, **kw)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/readers/readers.py:151, in FileReader._read(self, data, **kw)
149 if self.storage_options and data.storage_options:
150 kw["storage_options"] = data.storage_options
--> 151 return self._func(**kw)
TypeError: read_csv() missing 1 required positional argument: 'path'
dd.info()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[10], line 1
----> 1 dd.info()
NameError: name 'dd' is not defined
Notice that now running .info()
gives us a lot less information than when we loaded the data earlier using pd.read_csv
, this is because the Dask
representation of the data only loads what it needs, when it is needed for a computation. However, you can always explicitly load all the data into memory using .read()
(but be careful to only run this on a dataset that will fit on your computer’s memory). Our sample dataset is very small, so let’s try it:
df = ds.read()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Band 7 non-null int64
1 Description 7 non-null object
2 Range (nm) 7 non-null object
dtypes: int64(1), object(2)
memory usage: 296.0+ bytes
Loading multiple local files
Intake also lets the user load and concatenate data across multiple files in one command
ds = intake.open_csv(['./data/landsat5_bands.csv', './data/landsat8_bands.csv'])
If the data file names allow, this can be more simply expressed using *
:
ds = intake.open_csv('./data/landsat*_bands.csv')
df = ds.read()
df.info()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[14], line 1
----> 1 df = ds.read()
2 df.info()
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/source/csv.py:35, in CSVSource.read(self)
34 def read(self):
---> 35 return PandasCSV(self.data).read(**self.kwargs)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/readers/readers.py:96, in BaseReader.read(self, *args, **kwargs)
94 kw.update(kwargs)
95 args = kw.pop("args", ()) or args
---> 96 return self._read(*args, **kw)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/readers/readers.py:151, in FileReader._read(self, data, **kw)
149 if self.storage_options and data.storage_options:
150 kw["storage_options"] = data.storage_options
--> 151 return self._func(**kw)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
617 _validate_names(kwds.get("names", None))
619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
622 if chunksize or iterator:
623 return parser
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
1617 self.options["has_index_names"] = kwds["has_index_names"]
1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
1878 if "b" not in mode:
1879 mode += "b"
-> 1880 self.handles = get_handle(
1881 f,
1882 mode,
1883 encoding=self.options.get("encoding", None),
1884 compression=self.options.get("compression", None),
1885 memory_map=self.options.get("memory_map", False),
1886 is_text=is_text,
1887 errors=self.options.get("encoding_errors", "strict"),
1888 storage_options=self.options.get("storage_options", None),
1889 )
1890 assert self.handles is not None
1891 f = self.handles.handle
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
868 elif isinstance(handle, str):
869 # Check whether the filename is to be opened in binary mode.
870 # Binary mode does not support 'encoding' and 'newline'.
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,
876 encoding=ioargs.encoding,
877 errors=errors,
878 newline="",
879 )
880 else:
881 # Binary mode
882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: './data/landsat*_bands.csv'
But sometimes there is data encoded in a file name or path that causes concatenated data to lose some important context. In this example, we lose the information about which version of landsat the training was done on (5 vs 8). To keep track of that information, we can use a python format string {variable:type}
to specify our path and declare a new field on our data. That field will get populated based on its value in the path.
ds = intake.open_csv('./data/landsat{Landsat:d}_bands.csv')
df = ds.read().set_index(['Landsat', 'Band'])
df
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[15], line 2
1 ds = intake.open_csv('./data/landsat{Landsat:d}_bands.csv')
----> 2 df = ds.read().set_index(['Landsat', 'Band'])
3 df
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/source/csv.py:35, in CSVSource.read(self)
34 def read(self):
---> 35 return PandasCSV(self.data).read(**self.kwargs)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/readers/readers.py:96, in BaseReader.read(self, *args, **kwargs)
94 kw.update(kwargs)
95 args = kw.pop("args", ()) or args
---> 96 return self._read(*args, **kw)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/readers/readers.py:151, in FileReader._read(self, data, **kw)
149 if self.storage_options and data.storage_options:
150 kw["storage_options"] = data.storage_options
--> 151 return self._func(**kw)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
617 _validate_names(kwds.get("names", None))
619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
622 if chunksize or iterator:
623 return parser
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
1617 self.options["has_index_names"] = kwds["has_index_names"]
1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
1878 if "b" not in mode:
1879 mode += "b"
-> 1880 self.handles = get_handle(
1881 f,
1882 mode,
1883 encoding=self.options.get("encoding", None),
1884 compression=self.options.get("compression", None),
1885 memory_map=self.options.get("memory_map", False),
1886 is_text=is_text,
1887 errors=self.options.get("encoding_errors", "strict"),
1888 storage_options=self.options.get("storage_options", None),
1889 )
1890 assert self.handles is not None
1891 f = self.handles.handle
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
868 elif isinstance(handle, str):
869 # Check whether the filename is to be opened in binary mode.
870 # Binary mode does not support 'encoding' and 'newline'.
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,
876 encoding=ioargs.encoding,
877 errors=errors,
878 newline="",
879 )
880 else:
881 # Binary mode
882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: './data/landsat{Landsat:d}_bands.csv'
Loading remote data with a catalog
Probably the most versatile and reproducible approach to loading data with Intake is to use a simple Catalog file to declare how the data should be loaded. The catalog lays out how the data should be loaded, defines some metadata, and specifies any patterns in the file path that should be included in the data. Here is an example of an entry in a catalog YAML file:
sources:
landsat_5_small:
description: Small version of Landsat 5 Surface Reflectance Level-2 Science Product.
driver: rasterio
cache:
- argkey: urlpath
regex: 'earth-data/landsat'
type: file
args:
urlpath: 's3://earth-data/landsat/small/LT05_L1TP_042033_19881022_20161001_01_T1_sr_band{band:d}.tif'
chunks:
band: 1
x: 50
y: 50
concat_dim: band
storage_options: {'anon': True}
The urlpath
can be a path to a file, list of files, or a path with glob notation. Alternatively, the path can be written as a python style format_string. In the case where the urlpath
is a format string (for example, {band:d}
), the fields specified in that string will be parsed from the filenames and returned in the data.
Let’s look at what our catalog contains using open_catalog
:
cat = intake.open_catalog('./data/catalog.yml')
list(cat)
['landsat_5_small',
'landsat_8_small',
'landsat_5',
'landsat_8',
'google_landsat_band',
'amazon_landsat_band',
'fluxnet_daily',
'fluxnet_metadata',
'seattle_lidar']
Here, we see a list of datasets specified by the catalog.yml file. Lets’s look at a description
and the urlpath
for one of them:
landsat_5 = cat.landsat_5_small
landsat_5.description
'Small version of Landsat 5 Surface Reflectance Level-2 Science Product.'
landsat_5.container
'xarray'
From .container
we can see that ultimately the data will be an xarray object
landsat_5.urlpath
's3://earth-data/landsat/small/LT05_L1TP_042033_19881022_20161001_01_T1_sr_band{band:d}.tif'
From the urlpath
we can see that this dataset is hosted remotely on Amazon S3
Let’s preview this dataset using .to_dask()
.
landsat_5 = cat.landsat_5_small
landsat_5.to_dask()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[21], line 2
1 landsat_5 = cat.landsat_5_small
----> 2 landsat_5.to_dask()
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/base.py:69, in DataSourceMixin.to_dask(self)
67 def to_dask(self):
68 """Return xarray object where variables are dask arrays"""
---> 69 return self.read_chunked()
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/base.py:44, in DataSourceMixin.read_chunked(self)
42 def read_chunked(self):
43 """Return xarray object (which will have chunks)"""
---> 44 self._load_metadata()
45 return self._ds
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/source/base.py:84, in DataSourceBase._load_metadata(self)
82 """load metadata only if needed"""
83 if self._schema is None:
---> 84 self._schema = self._get_schema()
85 self.dtype = self._schema.dtype
86 self.shape = self._schema.shape
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/raster.py:102, in RasterIOSource._get_schema(self)
99 self.urlpath, *_ = self._get_cache(self.urlpath)
101 if self._ds is None:
--> 102 self._open_dataset()
104 ds2 = xr.Dataset({'raster': self._ds})
105 metadata = {
106 'dims': dict(ds2.dims),
107 'data_vars': {k: list(ds2[k].coords)
(...)
110 'array': 'raster'
111 }
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/raster.py:90, in RasterIOSource._open_dataset(self)
88 self._ds = self._open_files(files)
89 else:
---> 90 self._ds = xr.open_rasterio(files, chunks=self.chunks,
91 **self._kwargs)
AttributeError: module 'xarray' has no attribute 'open_rasterio'
From this information we can see that there are three dimensions with coordinates x
, y
, and band
. This is satellite data where the x
and y
are spatial earth coordinates and the band
refers to the wavelength range of light used to capture the satellite image.
Visualizing the data
Often, data ingestion involves quickly visualizing your raw data to get a sense that things are proceeding accordingly. You have already seen how we can easily use hvPlot for bare xarray data, but hvPlot also provides interactive plotting commands for Intake, Pandas, Dask, and GeoPandas objects. We’ll look more closely at hvPlot and its options in later tutorials.
For now, let’s just plot out our different band images as columns (using col='band'
). We can also set rasterize=True
to use Datashader (another HoloViz tool) to render large data into a 2D histogram, where every array cell counts the data points falling into that pixel, as set by the resolution of your screen.
landsat_5.hvplot.image(col='band', cmap='viridis', xaxis=False, yaxis=False, colorbar=False, rasterize=True)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[22], line 1
----> 1 landsat_5.hvplot.image(col='band', cmap='viridis', xaxis=False, yaxis=False, colorbar=False, rasterize=True)
AttributeError: 'RasterIOSource' object has no attribute 'hvplot'
Hint: Try zooming in on one plot to see that they are all linked!
If we increase the complexity a bit and consider the likely situation where you are constantly loading and visualizing a similar type of data using an Intake catalog, you can instead just declare plotting parameters directly into the catalog!
Here is the relevant part of our catalog.yml
:
metadata:
plots:
band_image:
kind: 'image'
x: 'x'
y: 'y'
groupby: 'band'
rasterize: True
width: 400
dynamic: False
landsat_5 = cat.landsat_5_small
landsat_5.to_dask()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[23], line 2
1 landsat_5 = cat.landsat_5_small
----> 2 landsat_5.to_dask()
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/base.py:69, in DataSourceMixin.to_dask(self)
67 def to_dask(self):
68 """Return xarray object where variables are dask arrays"""
---> 69 return self.read_chunked()
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/base.py:44, in DataSourceMixin.read_chunked(self)
42 def read_chunked(self):
43 """Return xarray object (which will have chunks)"""
---> 44 self._load_metadata()
45 return self._ds
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/source/base.py:84, in DataSourceBase._load_metadata(self)
82 """load metadata only if needed"""
83 if self._schema is None:
---> 84 self._schema = self._get_schema()
85 self.dtype = self._schema.dtype
86 self.shape = self._schema.shape
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/raster.py:102, in RasterIOSource._get_schema(self)
99 self.urlpath, *_ = self._get_cache(self.urlpath)
101 if self._ds is None:
--> 102 self._open_dataset()
104 ds2 = xr.Dataset({'raster': self._ds})
105 metadata = {
106 'dims': dict(ds2.dims),
107 'data_vars': {k: list(ds2[k].coords)
(...)
110 'array': 'raster'
111 }
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/raster.py:90, in RasterIOSource._open_dataset(self)
88 self._ds = self._open_files(files)
89 else:
---> 90 self._ds = xr.open_rasterio(files, chunks=self.chunks,
91 **self._kwargs)
AttributeError: module 'xarray' has no attribute 'open_rasterio'
landsat_5.hvplot.band_image()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[24], line 1
----> 1 landsat_5.hvplot.band_image()
AttributeError: 'RasterIOSource' object has no attribute 'hvplot'
Excellent! Since we didn’t specify that the images should to be plotted as columns, hvPlot conveniently gives us a widget to scrub through them.
Side-note: Drilling down for data access
As part of accessing data, it’s important to understand how to drill down to different representations of our data that certain processing steps might require.
Numpy Array
Earlier, we saw that if the selected data are small enough, we can load return all the data into one in-memory xarray DataArray container. Once in an DataArray object, the data can be easily manipulated and visualized. Under the hood, this xarray object is essentially comprised of NumPy arrays with the addition of coordinate labels (and other superpowers). However, sometimes machine learning tools (e.g. scikit-learn) only accept NumPy arrays as input. These arrays are accessible to us from the xarray DataArray objects by using the .values
attribute.
npa = cat.landsat_5_small.read().values
type(npa)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[25], line 1
----> 1 npa = cat.landsat_5_small.read().values
2 type(npa)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/base.py:39, in DataSourceMixin.read(self)
37 def read(self):
38 """Return a version of the xarray with all the data in memory"""
---> 39 self._load_metadata()
40 return self._ds.load()
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/source/base.py:84, in DataSourceBase._load_metadata(self)
82 """load metadata only if needed"""
83 if self._schema is None:
---> 84 self._schema = self._get_schema()
85 self.dtype = self._schema.dtype
86 self.shape = self._schema.shape
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/raster.py:102, in RasterIOSource._get_schema(self)
99 self.urlpath, *_ = self._get_cache(self.urlpath)
101 if self._ds is None:
--> 102 self._open_dataset()
104 ds2 = xr.Dataset({'raster': self._ds})
105 metadata = {
106 'dims': dict(ds2.dims),
107 'data_vars': {k: list(ds2[k].coords)
(...)
110 'array': 'raster'
111 }
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/raster.py:90, in RasterIOSource._open_dataset(self)
88 self._ds = self._open_files(files)
89 else:
---> 90 self._ds = xr.open_rasterio(files, chunks=self.chunks,
91 **self._kwargs)
AttributeError: module 'xarray' has no attribute 'open_rasterio'
npa.shape
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[26], line 1
----> 1 npa.shape
NameError: name 'npa' is not defined
npa[:3, :3, :3]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[27], line 1
----> 1 npa[:3, :3, :3]
NameError: name 'npa' is not defined
Side-note: Organizing Multiple Datasets into a DataTree
In addition to drilling down to NumPy Arrays for data access, we can also zoom out to use an even higher level container that an xarray Dataset. The Datatree library provides a tree-like container that can hold xarray Datasets that are organised into arbitrarily nested groups like a Python dictionary. This can come in handy if you are working with several DataSets that are not alignable (i.e. they live on different grids), but are still related to each other. For example, you may have Landsat data from different missions or on a different underlying grid/resolution. The main benefit of using a DataTree is that you can map standard xarray operations recursively over all nodes.
Warning
DataTree is currently in incubation and may be upstreamed into xarray in the future. Check the status on the Datatree repository before use
Let’s import datatree and load datasets from different missions that have different underlying grids, resolutions, and offsets, but cover the same lake at different times.
from datatree import DataTree
ls5_da = cat.landsat_5_small.read()
ls5_da.head(2)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[29], line 1
----> 1 ls5_da = cat.landsat_5_small.read()
2 ls5_da.head(2)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/base.py:39, in DataSourceMixin.read(self)
37 def read(self):
38 """Return a version of the xarray with all the data in memory"""
---> 39 self._load_metadata()
40 return self._ds.load()
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/source/base.py:84, in DataSourceBase._load_metadata(self)
82 """load metadata only if needed"""
83 if self._schema is None:
---> 84 self._schema = self._get_schema()
85 self.dtype = self._schema.dtype
86 self.shape = self._schema.shape
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/raster.py:102, in RasterIOSource._get_schema(self)
99 self.urlpath, *_ = self._get_cache(self.urlpath)
101 if self._ds is None:
--> 102 self._open_dataset()
104 ds2 = xr.Dataset({'raster': self._ds})
105 metadata = {
106 'dims': dict(ds2.dims),
107 'data_vars': {k: list(ds2[k].coords)
(...)
110 'array': 'raster'
111 }
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/raster.py:90, in RasterIOSource._open_dataset(self)
88 self._ds = self._open_files(files)
89 else:
---> 90 self._ds = xr.open_rasterio(files, chunks=self.chunks,
91 **self._kwargs)
AttributeError: module 'xarray' has no attribute 'open_rasterio'
ls8_da = cat.landsat_8_small.read()
ls8_da.head(2)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[30], line 1
----> 1 ls8_da = cat.landsat_8_small.read()
2 ls8_da.head(2)
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/base.py:39, in DataSourceMixin.read(self)
37 def read(self):
38 """Return a version of the xarray with all the data in memory"""
---> 39 self._load_metadata()
40 return self._ds.load()
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake/source/base.py:84, in DataSourceBase._load_metadata(self)
82 """load metadata only if needed"""
83 if self._schema is None:
---> 84 self._schema = self._get_schema()
85 self.dtype = self._schema.dtype
86 self.shape = self._schema.shape
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/raster.py:102, in RasterIOSource._get_schema(self)
99 self.urlpath, *_ = self._get_cache(self.urlpath)
101 if self._ds is None:
--> 102 self._open_dataset()
104 ds2 = xr.Dataset({'raster': self._ds})
105 metadata = {
106 'dims': dict(ds2.dims),
107 'data_vars': {k: list(ds2[k].coords)
(...)
110 'array': 'raster'
111 }
File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/intake_xarray/raster.py:90, in RasterIOSource._open_dataset(self)
88 self._ds = self._open_files(files)
89 else:
---> 90 self._ds = xr.open_rasterio(files, chunks=self.chunks,
91 **self._kwargs)
AttributeError: module 'xarray' has no attribute 'open_rasterio'
These two sets of Landsat bands cover roughly the same area but were taken in 1988 and 2017. They have different resolutions, different numbers of grid cells per image, and different x and y offsets. We can see the offset by printing the first x
value for each year and seeing that they are not equivalent.
print(f'Landsat 5 data first x coord: {ls5_da.x[0].values}')
print(f'Landsat 8 data first x coord: {ls8_da.x[0].values}')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[31], line 1
----> 1 print(f'Landsat 5 data first x coord: {ls5_da.x[0].values}')
2 print(f'Landsat 8 data first x coord: {ls8_da.x[0].values}')
NameError: name 'ls5_da' is not defined
Before we organize the data into a DataTree, let’s promote our DataArrays into DataSets. Each node of the DataTree contains exactly one xarray DataSet.
landsat_8_ds = ls8_da.to_dataset(name='Walker Lake')
landsat_5_ds = ls5_da.to_dataset(name='Walker Lake')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[32], line 1
----> 1 landsat_8_ds = ls8_da.to_dataset(name='Walker Lake')
2 landsat_5_ds = ls5_da.to_dataset(name='Walker Lake')
NameError: name 'ls8_da' is not defined
Now let’s combine our DataSets into Landsat subgroups by mission. We could arbitrarily nest these groups as our research requires.
dt = DataTree.from_dict({"Landsat/Mission8": landsat_8_ds, "Landsat/Mission5": landsat_5_ds})
dt
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[33], line 1
----> 1 dt = DataTree.from_dict({"Landsat/Mission8": landsat_8_ds, "Landsat/Mission5": landsat_5_ds})
2 dt
NameError: name 'landsat_8_ds' is not defined
Now we could map operations across our Landsat DataTree nodes, even though they are misaligned.
dt.mean(dim='x')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[34], line 1
----> 1 dt.mean(dim='x')
NameError: name 'dt' is not defined
This was a very simple demonstration of DataTree. Check the DataTree documentation for a comprehensive overview.
Summary
Before accessing any data, it’s a good idea to start by learning about the context and details of the dataset. This will give you the intuition to make informed decisions as you form a processing and analysis pipeline. Once you have identified a dataset to use, tailor the data access approach given the features of the data, such as size and location. If your data is local and small, a reliable approach is to access the data with either Pandas or xarray, depending on whether the data is tabular or gridded. As earth science data is often large, multidimensional, and on a remote server, a powerful and increasingly common approach is to use the combination of Intake, Dask, and xarray. Once you have accessed data, visualize it with hvPlot to ensure that it matches your expectations.
What’s next?
After loading the data, you may need to process it in some way before it can be fed into your analysis workflow. We will cover an example of such processing in the next section.
Resources and references
Authored/adapted by Demetris Roumis circa Dec, 2022
The Landsat 8 banner image is from NASA