Streaming Data with the Open Science Data Federation (OSDF)

Have you ever pondered why accessing large-scale scientific data is so complicated, while accessing large-scale volumes of movies is so simple on Netflix?

Each data repository has its own website or a set of unique tools for accessing data. Users are often encouraged to download datasets locally and then do local computations, as repositories prioritize long-term storage and preservation rather than fast or distributed access.

How does Netflix do it without making you download the whole movie ahead of time? They leverage a content distribution network (CDN), which caches copies of the most popular movies at opportune locations on the Internet closer to users. They also let you stream your favorite shows so you can start watching while later sections of the show are still downloading.

The OSDF, an NSF-funded infrastructure providing a CDN for science, makes this kind of streaming possible for scientific data. It is connected to popular open science repositories and has hardware embedded across US and international networks and at large computing sites.

This cookbook provides examples of using the OSDF’s streaming to power science use cases in earth sciences.

Do First, Understand Later¶

How Do You Use the OSDF?

The service is powered by the same protocol as the web, HTTPS. Thus, the simplest use case is to download an object by using the browser.

Click on this link:

https://osdf-director.osg-htc.org/ospool/uc-shared/public/OSG-Staff/validation/test.txt

If a new tab opened with the text “Hello, World” – congratulations, you used the OSDF!

OSDF is often used in conjunction with computing workflows and downloads occur as part of a script. For this, a command line client - pelican is utilized. Try running the following:

pelican object get osdf:///routeviews/chicago/route-views.chicago/bgpdata/2025.03/RIBS/rib.20250319.0400.bz2 ./

rib.20250319.0400.bz2 23.03 MiB / 72.45 MiB [=====>--------------] 0s ] 0.00 b/s

rib.20250319.0400.bz2 23.03 MiB / 72.45 MiB [=====>--------------] 0s ] 0.00 b/s

rib.20250319.0400.bz2 23.03 MiB / 72.45 MiB [=====>--------------] 0s ] 0.00 b/s

rib.20250319.0400.bz2 23.03 MiB / 72.45 MiB [=====>--------------] 0s ] 0.00 b/s

Depending on the speed of your Internet connection, you may see a progress bar as the download proceeds.

Congratulations, you’re now the proud owner of 72MB of Internet routing data!

For both of these cases, we downloaded the entire object. What happens if the dataset contains data for the entire planet but you are only interested in the state of Nebraska? It’s more effective to stream the subset. For that, we will use the Pelican Python library; this library will be used throughout the remaining chapters of this cookbook.

About the OSDF¶

You don’t need to know how Netflix is built to press “play”. Similarly, you don’t need to understand the guts of the OSDF to use it in your science. However, a few key concepts are useful!

OSDF Infrastructure: The map below shows the distributed pieces of the OSDF:

Each “O” on the map is an origin; the origin service connects an existing repository to the OSDF service, making some datasets available and protecting the repository from overload. Origins are typically placed nearby where the data lives; the origin for the NCAR Geoscience Data Exchange (GDEX) is in the same datacenter as the GDEX.

Each “C” is a cache. The cache makes temporary copies of objects upon access so, on subsequent accesses, the object comes from the cache and not from the repository. This reduces the load on the repository and, ideally, increases scalability.

Unified Namespace: The OSDF provides a unified namespace for all available objects. Each repository receives a unique prefix (the IceCube experiment’s data is available from /icecube; NCAR’s GDEX is available from /ncar/gdex) and the object can be referenced from within the prefix.

From our RouteViews example above, we were interested in accessing the object named chicago/route-views.chicago/bgpdata/2025.03/RIBS/rib.20250319.0400.bz2. Since the prefix for RouteViews is /routeviews, the entire OSDF name is:

osdf://routeviews/chicago/route-views.chicago/bgpdata/2025.03/RIBS/rib.20250319.0400.bz2

“Objects” vs “Files”: You may have noticed that this notebook refers to downloading/streaming “objects” instead of “files”. What’s the difference, and why does OSDF bother making this distinction?

Both objects and files are ways for computers to store data, and in practice, the earth science calculations in this cookbook use them the same way—regardless of where the data comes from.

The key difference is the way we typically think about accessing or retrieving that data: When you open files like Word documents or images on your computer, you probably click through folders or directories to find them. But when you’re working with data over the internet, that folder-based structure doesn’t always apply.

In the OSDF, an object is simply a piece of data that can be shared, like a file—but without needing to think about where it’s stored or how it’s organized on someone else’s computer. “Object” is a more flexible term that works better when data is stored in large systems across many locations.

Finding My Objects¶

How do you find the object you’re interested in?

Typically, dataset providers connected to the OSDF provide a search, data catalog, or STAC catalog publishing OSDF-style URLs. You can determine this from the provider’s website; additionally, OSDF maintains a list of known links you can peruse.

Explore this cookbook: This cookbook provides examples for how to use OSDF to access:

NCAR’s Geoscience Data Exchange.
AWS’s OpenData Program.
The Envistor platform at Florida International University.

Summary¶

In this notebook we gave a basic introduction to the OSDF and Pelican.

What’s next?¶

In the next notebook, we will show a more indepth overview of the Pelican FSSpec client PelicanFS as well as a few usage examples.

Preamble

How to Cite This Cookbook

Introduction to OSDF

Using PelicanFS via FSSpec to Access Data on the OSDF