# Data Ingestion - General Purpose Tooling

 ![intake landsat](images/intake_landsat.png "intake landsat")

---

## Overview

If the specialized geospatial tools discussed in the previous notebook suit your needs, feel free to proceed to explore a workflow example, such as [Spectral Clustering](2.0_Spectral_Clustering_PC.ipynb). However, if you're seeking a tool that is adaptable across a wider range of data types and sources, welcome to this introduction to [Intake V2](https://intake.readthedocs.io), a general-purpose data ingestion and management library.

Intake is a high-level library designed for data ingestion and management. While the [geospatial-specific tooling](1.0_Data_Ingestion-Geospatial.ipynb) approach is optimized for satellite data, Intake offers a broader and potentially more flexible approach for multimodal data workflows, characterized by:

- **Unified Interface**: Abstracts the details of data sources, enabling users to interact with a consistent API regardless of the data's underlying format.
- **Dynamic and Shareable Catalogs**: Facilitates the creation and sharing of data catalogs that can be version-controlled, updated, and maintained.
- **Extensible**: Facilitates the addition of new data sources and formats through its plugin system.

In the following sections, we will guide you through an introduction to various Intake functionalities that simplify data access and enhance both modularity and reproducibility in geospatial workflows.


## Prerequisites

| Concepts | Importance | Notes |
| --- | --- | --- |
| [Intro to Landsat](./0.0_Intro_Landsat.ipynb) | Necessary | Background |
| [Data Ingestion - Geospatial-Specific Tooling](1.0_Data_Ingestion-Geospatial.ipynb) | Helpful | |
| [Pandas Cookbook](https://foundations.projectpythia.org/core/pandas.html) | Helpful |  |
| [xarray Cookbook](https://foundations.projectpythia.org/core/xarray.html) | Necessary |  |
| [Intake Quickstart](https://intake.readthedocs.io/en/latest/index.html) | Helpful |  |
|[Intake Cookbook](https://projectpythia.org/intake-cookbook/README.html)| Necessary | |

- **Time to learn**: 20 minutes

---

## Imports

In [None]:
import intake
import planetary_computer
from pprint import pprint

# Viz
import hvplot.xarray
import panel as pn

pn.extension()

## Connecting to Data Sources

To get started, we need to provide a STAC URL (or any other data source URL) to intake, and we can ask intake to recommend some suitable datatypes.

In [None]:
url = "https://planetarycomputer.microsoft.com/api/stac/v1"
data_types = intake.readers.datatypes.recommend(url)
pprint(data_types)

## Selecting the Appropriate Data Type
After identifying the possible data types, we choose the one that best suits our needs. For handling STAC formatted JSON data from our URL, we will proceed with `STACJSON`.

In [None]:
data_type = intake.datatypes.STACJSON(url)
data_type

This object now represents the specific data type we will work with, allowing us to streamline subsequent data operations.

## Initializing Data Readers

With the `STACJSON` data type specified, we explore available methods to read the data.

In [None]:
readers = data_type.possible_readers
pprint(readers)

This output presents us with options that can interpret the `STACJSON` data format effectively. The `StacCatalogReader` is probably the most suitable for our use case. We can use it to read the STAC catalog and explore the available contents.

## Reading the Catalog
Next, we can access the data catalog through our reader.

In [None]:
reader = intake.catalogs.StacCatalogReader(
    data_type, signer=planetary_computer.sign_inplace
)
reader

This reader is now configured to handle interactions with the data catalog.

## List Catalog Contents
Once the catalog is accessible, we `read()` it and then collect each dataset's `description` to identify datasets of interest. For our purposes, we will just print the entries that include the word `'landsat'`.

In [None]:
stac_cat = reader.read()

description = {}
for data_description in stac_cat.data.values():
    data = data_description.kwargs["data"]
    description[data["id"]] = data["description"]

In [None]:
# Print only keys that include the word 'landsat'
pprint([key for key in description.keys() if 'landsat' in key.lower()])

## Detailed Dataset Examination

By examining specific datasets more closely, we understand their content and relevance to our project goals. We can now print the description of the desired landsat IDs.

In [None]:
print("1:", description["landsat-c2-l1"])
print('-------------------------------\n')
print("2:", description["landsat-c2-l2"])

## Selecting and Accessing Data

We want `"landsat-c2-l2"`, so with a chosen dataset, we can now access it directly and view the `metadata` specific to this dataset - key details that are important for analysis and interpretation. Since the output is long, we'll utilize the HoloViz Panel library to wrap the output in a scrollable element.

In [None]:
landsat_reader = stac_cat["landsat-c2-l2"]
landsat_metadata = landsat_reader.read().metadata

# View extensive metadata in scrollable block
json_pane = pn.pane.JSON(landsat_metadata, name='Metadata', max_height=400, sizing_mode='stretch_width', depth=-1, theme='light')
scrollable_output = pn.Column(json_pane, height=400, sizing_mode='stretch_width', scroll=True, styles={'background': 'lightgrey'})
scrollable_output

## Visual Preview

To get a visual preview of the dataset, particularly to check its quality and relevance, we use the following commands:

In [None]:
landsat_reader["thumbnail"].read()

## Accessing Geospatial Data Items

Once we have selected the appropriate dataset, the next step is to access the specific data items. These items typically represent individual data files or collections that are part of the dataset.

The following code retrieves a handle to the 'geoparquet-items' from the Landsat dataset, which are optimized for efficient geospatial operations and queries.

In [None]:
landsat_items = landsat_reader["geoparquet-items"]
landsat_items

## Converting Data for Analysis

To facilitate analysis, the following code selects the last few entries (`tail`) of the dataset, converts them into a GeoDataFrame, and reads it back into a STAC catalog format. This format is particularly suited for geospatial data and necessary for compatibility with geospatial analysis tools and libraries like Geopandas.

In [None]:
cat = landsat_items.tail(output_instance="geopandas:GeoDataFrame").GeoDataFrameToSTACCatalog.read()

## Exploring Data Collections

After conversion, we explore the structure of the data collection. Each "item" in this collection corresponds to a set of assets, providing a structured way to access multiple related data files. We'll simply print the structure of the catalog to understand the available items and their organization.


In [None]:
cat

## Accessing Sub-Collections

To dive deeper into the data, we access a specific sub-collection based on its key. This allows us to focus on a particular geographic area or time period. We'll select the first item in the catalog for now.

In [None]:
item_key = list(cat.entries.keys())[0]
subcat = cat[item_key].read()
subcat

## Reading Specific Data Bands

For detailed analysis, especially in remote sensing, accessing specific spectral bands is crucial. Here, we read the red spectral band, which is often used in vegetation analysis and other remote sensing applications.

In [None]:
subcat.red.read()

## Preparing for Multiband Analysis
To analyze true color imagery, we need to stack multiple spectral bands. Here, we prepare for this by setting up a band-stacking operation. Note, re-signing might be necessary at this point.

In [None]:
catbands = cat[item_key].to_reader(reader="StackBands", bands=["red", "green", "blue"], signer=planetary_computer.sign_inplace)

## Loading and Visualizing True Color Imagery

After setting up the band-stacking, we read the multiband data and prepare it for visualization.

In [None]:
data = catbands.read(dim="band")
data

## Visualizing Data
Finally, we visualize the true color imagery. This visualization helps in assessing the quality of the data and the appropriateness of the bands used.

In [None]:
data.plot.imshow(robust=True, figsize=(10, 10))

## Summary
As earth science data becomes integrated with other types of data, a powerful approach is to utilize a general purpose set of tools, including Intake and Xarray. Once you have accessed data, visualize it with hvPlot to ensure that it matches your expectations.



### What's next?
Now that we know how to access the data, itâ€™s time to proceed to analysis, where we will explore a some simple machine learning approaches.


## Resources and references
Authored by Demetris Roumis and Andrew Huang circa April, 2024, with guidance from [Martin Durant](https://github.com/martindurant).

The banner image is a mashup of a Landsat 8 image from NASA and the Intake logo.
