# Kerchunk and Pangeo-Forge



## Overview

In this tutorial we are going to use the open-source ETL pipeline named pangeo-forge-recipes to generate Kerchunk references.

Pangeo-Forge is a community project to build reproducible cloud-native ARCO (Analysis-Ready-Cloud-Optimized) datasets. The Python library (`pangeo-forge-recipes`) is the ETL pipeline to process these datasets or "recipes". While a majority of the recipes convert a legacy format such as NetCDF to Zarr stores, `pangeo-forge-recipes` can also use Kerchunk under the hood to create reference recipes. 

It is important to note that `Kerchunk` can be used independently of `pangeo-forge-recipes` and in this example, `pangeo-forge-recipes` is acting as the runner for `Kerchunk`. 



## Prerequisites
| Concepts | Importance | Notes |
| --- | --- | --- |
| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |
| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |
| [Multi-File Datasets with Kerchunk](../case_studies/ARG_Weather.ipynb) | Required | IO/Visualization |

- **Time to learn**: 45 minutes
---

## Getting to Know The Data

`gridMET` is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index. 

### Examine a Single File

In [None]:
import xarray as xr

ds = xr.open_dataset(
    "http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_2021.nc"
)

#### Plot the Dataset

In [None]:
ds.sel(day="2021-08-01").burning_index_g.plot()

## Create a File Pattern

To build our `pangeo-forge` pipeline, we need to create a `FilePattern` object, which is composed of all of our input urls. This dataset ranges from 1979 through 2023 and is composed of one year per file. 
 
To speed up our example, we will `prune` our recipe to select the first two entries in the `FilePattern`

In [None]:
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern

years = list(range(1979, 2022 + 1))


time_dim = ConcatDim("time", keys=years)


def format_function(time):
    return f"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc"


pattern = FilePattern(format_function, time_dim, file_type="netcdf4")


pattern = pattern.prune()

pattern

## Create a Location For Output
We write to local storage for this example, but the reference file could also be shared via cloud storage. 

In [None]:
target_root = "references"
store_name = "Pangeo_Forge"

## Build the Pangeo-Forge Beam Pipeline

Next, we will chain together a bunch of methods to create a Pangeo-Forge - Apache Beam pipeline. 
Processing steps are chained together with the pipe operator (`|`). Once the pipeline is built, it can be ran in the following cell. 

The steps are as follows:
1. Creates a starting collection of our input file patterns.
2. Passes those file_patterns to `OpenWithKerchunk`, which creates references of each file.
3. Combines the references files into a single reference file and write them with `WriteCombineReferences`

Just like Kerchunk, you can specify the reference file type as either `.json` or `.parquet`.

Note: You can add additional processing steps in this pipeline. 


In [None]:
import apache_beam as beam
from pangeo_forge_recipes.transforms import OpenWithKerchunk, WriteCombinedReference

transforms = (
    # Create a beam PCollection from our input file pattern
    beam.Create(pattern.items())
    # Open with Kerchunk and create references for each file
    | OpenWithKerchunk(file_type=pattern.file_type)
    # Use Kerchunk's `MultiZarrToZarr` functionality to combine and
    # then write references. Note: Setting the correct contact_dims
    # and identical_dims is important.
    | WriteCombinedReference(
        target_root=target_root,
        store_name=store_name,
        output_file_name="reference.json",
        concat_dims=["day"],
        identical_dims=["lat", "lon", "crs"],
    )
)

In [None]:
%%time

with beam.Pipeline() as p:
    p | transforms