Cloud-Native Data

15. Cloud-Native Data#

This page is in production

This page is in production, but in the meantime, the Cloud Native Geo group host probably the definitive guide to cloud-optimised datasets.

15.1. “Cloud Native”?#

Don’t be scared off by the phrase ‘cloud-native’ - this isn’t some complex arcane magic like creating and hosting a dynamic STAC database. This is, for the most part, simply choosing to choose a different file format. This file format, when hosted at a web address (e.g. https://data_website.com/cloudoptimisedfile.tif) allows for software (not only Python packages but even QGIS!) to stream the file from online. Due to the way the file metadata is organised, we don’t have to download entire files: instead, the software just downloads the little bit of data we need. We utilised this in the ice velocity tutorial, when we pointed rioxarray a very large ITS_LIVE mosaic of the whole of Greenland, but only downloaded the section of Kangerlussuaq we wanted in a matter of seconds!

If you plan on your research data being shared, this is of course is a fantastic thing - immediate cloud-accesibility with little-to-no effort on your end. However, even if you don’t, this is still a great move, especially if you have relatively large file sizes: cloud-native datasets generally offer better compression and faster loading that legacy datasets.

15.2. Recommended file types#

15.2.1. Vector Datasets: GeoParquet#

.shp/.gpkg → .geoparquet. Recommend moving across for large file formats (>10s MBs). Extremely good compression, spatial indexing makes for faster read times. When I was playing with the PGC ArcticDEM index file, read-and-clip times went from ~1 minute to just seconds merely by changing the file format. This was before spatial indexing was properly included! (prior to v1.1.0).

import geopandas as gpd

# Example GeoDataFrame
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))

# Write and read a GeoPackage
world.to_file("world.gpkg", layer="countries", driver="GPKG")
gdf = gpd.read_file("world.gpkg", layer="countries")

# Write and read a GeoParquet
world.to_parquet("world.parquet")
gdf = gpd.read_parquet("world.parquet")

15.2.2. Raster Datasets: Cloud-Optimised Geotiffs#

.tif → , well, .tif. No excuse for not using a cloud-optimised format here: better compression and faster read times regardless of whether you’re reading from online or offline sources.

import rioxarray

# Open a raster
da = rioxarray.open_rasterio("example.tif")

# Write as normal GeoTIFF with zstd compression
da.rio.to_raster(
    "example_normal.tif",
    compress="ZSTD"
)

# Write as Cloud-Optimized GeoTIFF with zstd compression
da.rio.to_raster(
    "example_cog.tif",
    driver="COG",
    compress="ZSTD"
)

15.2.3. Multidimensional Datasets: Zarr#

.nc/.h5 → .zarr. Possibly the biggest advantage here as .zarr files allow for actual compression: netCDF files simply can’t be compressed, and I’ve seen plenty of examples of real actual science data being shared as .mat files. I don’t blame them though - the compression savings can be significant! .zarr is the future on this, although I have yet to fully use it in my own research. It can be loaded in GUI software such as QGIS but I need to confirm myself how much still needs to be done.

import xarray as xr

# Create example dataset
ds = xr.tutorial.open_dataset("air_temperature").isel(time=slice(0, 10))

# Write and read NetCDF
ds.to_netcdf("air_temperature.nc")
ds_nc = xr.open_dataset("air_temperature.nc")

# Write and read Zarr
ds.to_zarr("air_temperature.zarr", mode="w")
ds_zarr = xr.open_zarr("air_temperature.zarr")