Data Ingestion - General Purpose Tooling

Overview¶

If the specialized geospatial tools discussed in the previous notebook suit your needs, feel free to proceed to explore a workflow example, such as Spectral Clustering. However, if you’re seeking a tool that is adaptable across a wider range of data types and sources, welcome to this introduction to Intake V2, a general-purpose data ingestion and management library.

Intake is a high-level library designed for data ingestion and management. While the geospatial-specific tooling approach is optimized for satellite data, Intake offers a broader and potentially more flexible approach for multimodal data workflows, characterized by:

Unified Interface: Abstracts the details of data sources, enabling users to interact with a consistent API regardless of the data’s underlying format.
Dynamic and Shareable Catalogs: Facilitates the creation and sharing of data catalogs that can be version-controlled, updated, and maintained.
Extensible: Facilitates the addition of new data sources and formats through its plugin system.

In the following sections, we will guide you through an introduction to various Intake functionalities that simplify data access and enhance both modularity and reproducibility in geospatial workflows.

Prerequisites¶

Concepts	Importance	Notes
Intro to Landsat	Necessary	Background
Data Ingestion - Geospatial-Specific Tooling	Helpful
Pandas Cookbook	Helpful
xarray Cookbook	Necessary
Intake Quickstart	Helpful
Intake Cookbook	Necessary

Time to learn: 20 minutes

Imports¶

import intake
import planetary_computer
from pprint import pprint

# Viz
import hvplot.xarray
import panel as pn

pn.extension()

/home/runner/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/panel/config.py:35: FutureWarning: param.version.Version has been deprecated and will be removed in a future version.
  __version__ = str(param.version.Version(
/home/runner/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/param/version.py:200: FutureWarning: param.version.run_cmd has been deprecated and will be removed in a future version.
  remotes = run_cmd([cmd, 'remote', '-v'],

Connecting to Data Sources¶

To get started, we need to provide a STAC URL (or any other data source URL) to intake, and we can ask intake to recommend some suitable datatypes.

url = "https://planetarycomputer.microsoft.com/api/stac/v1"
data_types = intake.readers.datatypes.recommend(url)
pprint(data_types)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/aiohttp/http_parser.py:1030, in DeflateBuffer.feed_data(self, chunk, size)
   1028 try:
   1029     # Decompress with limit + 1 so we can detect if output exceeds limit
-> 1030     chunk = self.decompressor.decompress_sync(
   1031         chunk, max_length=self._max_decompress_size + 1
   1032     )
   1033 except Exception:

File ~/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/aiohttp/compression_utils.py:312, in BrotliDecompressor.decompress_sync(self, data, max_length)
    311     return cast(bytes, self._obj.decompress(data, max_length))
--> 312 return cast(bytes, self._obj.process(data, max_length))

TypeError: process() takes exactly 1 argument (2 given)

During handling of the above exception, another exception occurred:

ContentEncodingError                      Traceback (most recent call last)
File ~/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/aiohttp/_http_parser.pyx:761, in aiohttp._http_parser.cb_on_body()

File ~/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/aiohttp/http_parser.py:1034, in DeflateBuffer.feed_data(self, chunk, size)
   1033 except Exception:
-> 1034     raise ContentEncodingError(
   1035         "Can not decode content-encoding: %s" % self.encoding
   1036     )
   1038 self._started_decoding = True

ContentEncodingError: 400, message:
  Can not decode content-encoding: br

The above exception was the direct cause of the following exception:

ClientPayloadError                        Traceback (most recent call last)
Cell In[2], line 2
      1 url = "https://planetarycomputer.microsoft.com/api/stac/v1"
----> 2 data_types = intake.readers.datatypes.recommend(url)
      3 pprint(data_types)

File ~/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/intake/readers/datatypes.py:806, in recommend(url, mime, head, contents, storage_options, ignore)
    804 try:
    805     fs, url2 = fsspec.core.url_to_fs(url, **(storage_options or {}))
--> 806     head = fs.cat_file(url2[0] if isinstance(url2, list) else url2, end=2**20)
    807 except (IOError, IndexError, ValueError):
    808     head = False

File ~/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/fsspec/asyn.py:118, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
    115 @functools.wraps(func)
    116 def wrapper(*args, **kwargs):
    117     self = obj or args[0]
--> 118     return sync(self.loop, func, *args, **kwargs)

File ~/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/fsspec/asyn.py:103, in sync(loop, func, timeout, *args, **kwargs)
    101     raise FSTimeoutError from return_result
    102 elif isinstance(return_result, BaseException):
--> 103     raise return_result
    104 else:
    105     return return_result

File ~/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout)
     54     coro = asyncio.wait_for(coro, timeout=timeout)
     55 try:
---> 56     result[0] = await coro
     57 except Exception as ex:
     58     result[0] = ex

File ~/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/fsspec/implementations/http.py:246, in HTTPFileSystem._cat_file(self, url, start, end, **kwargs)
    244 session = await self.set_session()
    245 async with session.get(self.encode_url(url), **kw) as r:
--> 246     out = await r.read()
    247     self._raise_not_found_for_status(r, url)
    248 return out

File ~/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/aiohttp/client_reqrep.py:693, in ClientResponse.read(self)
    691 if self._body is None:
    692     try:
--> 693         self._body = await self.content.read()
    694         for trace in self._traces:
    695             await trace.send_response_chunk_received(
    696                 self.method, self.url, self._body
    697             )

File ~/micromamba/envs/landsat-ml-cookbook/lib/python3.10/site-packages/aiohttp/streams.py:416, in StreamReader.read(self, n)
    414 async def read(self, n: int = -1) -> bytes:
    415     if self._exception is not None:
--> 416         raise self._exception
    418     # migration problem; with DataQueue you have to catch
    419     # EofStream exception, so common way is to run payload.read() inside
    420     # infinite loop. what can cause real infinite loop with StreamReader
    421     # lets keep this code one major release.
    422     if __debug__:

ClientPayloadError: 400, message:
  Can not decode content-encoding: br

Selecting the Appropriate Data Type¶

After identifying the possible data types, we choose the one that best suits our needs. For handling STAC formatted JSON data from our URL, we will proceed with STACJSON.

data_type = intake.datatypes.STACJSON(url)
data_type

This object now represents the specific data type we will work with, allowing us to streamline subsequent data operations.

Initializing Data Readers¶

With the STACJSON data type specified, we explore available methods to read the data.

readers = data_type.possible_readers
pprint(readers)

This output presents us with options that can interpret the STACJSON data format effectively. The StacCatalogReader is probably the most suitable for our use case. We can use it to read the STAC catalog and explore the available contents.

Reading the Catalog¶

Next, we can access the data catalog through our reader.

reader = intake.catalogs.StacCatalogReader(
    data_type, signer=planetary_computer.sign_inplace
)
reader

This reader is now configured to handle interactions with the data catalog.

List Catalog Contents¶

Once the catalog is accessible, we read() it and then collect each dataset’s description to identify datasets of interest. For our purposes, we will just print the entries that include the word 'landsat'.

stac_cat = reader.read()

description = {}
for data_description in stac_cat.data.values():
    data = data_description.kwargs["data"]
    description[data["id"]] = data["description"]

# Print only keys that include the word 'landsat'
pprint([key for key in description.keys() if 'landsat' in key.lower()])

Detailed Dataset Examination¶

By examining specific datasets more closely, we understand their content and relevance to our project goals. We can now print the description of the desired landsat IDs.

print("1:", description["landsat-c2-l1"])
print('-------------------------------\n')
print("2:", description["landsat-c2-l2"])

Selecting and Accessing Data¶

We want "landsat-c2-l2", so with a chosen dataset, we can now access it directly and view the metadata specific to this dataset - key details that are important for analysis and interpretation. Since the output is long, we’ll utilize the HoloViz Panel library to wrap the output in a scrollable element.

landsat_reader = stac_cat["landsat-c2-l2"]
landsat_metadata = landsat_reader.read().metadata

# View extensive metadata in scrollable block
json_pane = pn.pane.JSON(landsat_metadata, name='Metadata', max_height=400, sizing_mode='stretch_width', depth=-1, theme='light')
scrollable_output = pn.Column(json_pane, height=400, sizing_mode='stretch_width', scroll=True, styles={'background': 'lightgrey'})
scrollable_output

Visual Preview¶

To get a visual preview of the dataset, particularly to check its quality and relevance, we use the following commands:

landsat_reader["thumbnail"].read()

Accessing Geospatial Data Items¶

Once we have selected the appropriate dataset, the next step is to access the specific data items. These items typically represent individual data files or collections that are part of the dataset.

The following code retrieves a handle to the ‘geoparquet-items’ from the Landsat dataset, which are optimized for efficient geospatial operations and queries.

landsat_items = landsat_reader["geoparquet-items"]
landsat_items

Converting Data for Analysis¶

To facilitate analysis, the following code selects the last few entries (tail) of the dataset, converts them into a GeoDataFrame, and reads it back into a STAC catalog format. This format is particularly suited for geospatial data and necessary for compatibility with geospatial analysis tools and libraries like Geopandas.

cat = landsat_items.tail(output_instance="geopandas:GeoDataFrame").GeoDataFrameToSTACCatalog.read()

Exploring Data Collections¶

After conversion, we explore the structure of the data collection. Each “item” in this collection corresponds to a set of assets, providing a structured way to access multiple related data files. We’ll simply print the structure of the catalog to understand the available items and their organization.

cat

Accessing Sub-Collections¶

To dive deeper into the data, we access a specific sub-collection based on its key. This allows us to focus on a particular geographic area or time period. We’ll select the first item in the catalog for now.

item_key = list(cat.entries.keys())[0]
subcat = cat[item_key].read()
subcat

Reading Specific Data Bands¶

For detailed analysis, especially in remote sensing, accessing specific spectral bands is crucial. Here, we read the red spectral band, which is often used in vegetation analysis and other remote sensing applications.

subcat.red.read()

Preparing for Multiband Analysis¶

To analyze true color imagery, we need to stack multiple spectral bands. Here, we prepare for this by setting up a band-stacking operation. Note, re-signing might be necessary at this point.

catbands = cat[item_key].to_reader(reader="StackBands", bands=["red", "green", "blue"], signer=planetary_computer.sign_inplace)

Loading and Visualizing True Color Imagery¶

After setting up the band-stacking, we read the multiband data and prepare it for visualization.

data = catbands.read(dim="band")
data

Visualizing Data¶

Finally, we visualize the true color imagery. This visualization helps in assessing the quality of the data and the appropriateness of the bands used.

data.plot.imshow(robust=True, figsize=(10, 10))

Summary¶

As earth science data becomes integrated with other types of data, a powerful approach is to utilize a general purpose set of tools, including Intake and Xarray. Once you have accessed data, visualize it with hvPlot to ensure that it matches your expectations.

What’s next?¶

Now that we know how to access the data, it’s time to proceed to analysis, where we will explore a some simple machine learning approaches.

Resources and references¶

Authored by Demetris Roumis and Andrew Huang circa April, 2024, with guidance from Martin Durant.

The banner image is a mashup of a Landsat 8 image from NASA and the Intake logo.