Raster indicator conventions

Raster data

Raster data comprises a matrix of cells, or pixels, organized into a grid of rows and columns where each cell contains a value representing information such temperature, precipitation, wind speed etc. More generally each pixel can itself represent an array of values and the structure becomes a multi-dimensional array. Raster data can arise:
- from the data products of Earth Observing (EO) instruments with pixel-based sensor arrays, or
- from models that discretize space into pixels (e.g. finite element models), or
- from data derived from other rasterized data.

Why raster data

Hazard indicators very commonly make use of EO datasets and the outputs of climate models, hence the prevalence of raster data. It is important also to note that the pixels are typically square (in some coordinate reference system), as opposed to hexagonal etc: if the intent is to make data available without re-interpolation, handing such square pixels is important. Data that is not raster tends to be that defining a region e.g. a flooded area or a protected area; such data can be left as shapes or can also be converted to rasters (which may be convenient if such data is relatively rare).

Compressed, chunked N-dimensional arrays

Raster data sets can be large. For example the circumference of the Earth is 40,075 km, so a global raster with 1 km resolution at the Equator may already have something like 40,000 × 20,000 pixels — and note that some flood model data sets are 5 m or even 1 m resolution. For this reason, raster data is often chunked and compressed. ‘Chunking’ means that the raster grid is split into smaller grids and these smaller grids are compressed. This is efficient if the intent is to access one specific region without having to read and decompress the entire data set. The problem of dealing with raster data is often then one of dealing with compressed, chunked, N-dimensional arrays.

Zarr format

In Physrisk, hazard indicator raster data is stored using the Zarr format (see https://zarr.readthedocs.io/en/stable/). Cloud Optimized GeoTIFF (https://cogeo.org/) is another format with much the same facility for dealing with chunked, compressed N-dimensional data and indeed there are other choices. Zarr was ultimately chosen for its ability to store natively hierarchical multidimensional array data in a cloud-optimized way and for the flexibility of its back-ends stores. Notably, Zarr chunks are stored in separate files or objects which means that Zarr arrays can be written in parallel.

In the next sections, we discuss the conventions for storing raster hazard indicator data.

Overview of convention

Note: in the following, the terminology is that of Xarray data structure documentation; see https://docs.xarray.dev/en/latest/user-guide/data-structures.html.

Three-dimensional arrays

Hazard indicators are stored as three-dimensional arrays. The array dimensions are \((z, y, x)\) where \(y\) is the spatial \(y\) dimension (‘latitude’ or ‘y’), \(x\) is the spatial \(x\) dimension (‘longitude’ or ‘x’) and \(z\) is a non-spatial dimension. This dimension can be given a descriptive name, for example ‘return_period’ or ‘threshold’ but by default it is called ‘index’ and we call it the ‘index’ [1] dimension henceforth.

This index dimension takes on different meanings according to the type of data being stored, commonly:

Return periods, e.g. in the case of acute hazards. For example the coordinate labels might be 5, 50, 100, 200 years, the indicator value being be maximum sustained wind speed for that return period.
Threshold values, e.g. in the case of chronic hazards. For example the coordinate labels might be 20°, 30°, 40°, 50°, the indicator value being the mean number of days with maximum daily temperature greater than the threshold.

Climate scenario and projected time (i.e. future year) are not dimensions, but rather different data sets. The rationale for this is discussed later.

Attributes versus Xarray convention

Xarray has its own convention for saving Zarr data, which is used when calling the to_zarr method. In this convention, the Xarray-type coordinates (we could also say ‘NetCDF-style’) are saved as separate arrays. In other words, the array is actually a Zarr group, with arrays defining the coordinates and one for the data itself, i.e. a structure like:

/path/to/array/
├── return_period
│   └── 0
├── latitude
│   └── 0
├── longitude
│   └── 0
├── indicator
│   └── 0.0.0
│   └── 0.0.1
...

In this case, the index dimension is called ‘return_period’. The coordinate arrays return_period, latitude and longitude are stored explicitly. indicator contains the actual data. This approach is explicit, but there are downsides, especially for large rasters.

In Physrisk, mainly driven by performance considerations, special attributes are added to the indicator array [2] to provide the same information. The conventions can be combined. That is, as long as the indicator array has these attributes it can be read stand-alone by Physrisk, but the array can still be read ‘natively’ by Xarray.

The labels of the index dimension are given by the <index dimension name>_values attribute of the hazard indicator array, e.g. return_period_values. A display name of the dimension can be given by index dimension name>_name, e.g. return_period_name.

For example, in the first case above return_period_values is given by the array [1, 50, 100, 200]. and return_period_name is return period. In addition, the units of the index co-ordinates can be specified in the

The spatial coordinates are defined by a:
- Coordinate reference system
- Affine transform (see, e.g. https://affine.readthedocs.io/en/latest/#affine)

The coordinates of the array are specified by the set of attributes:

crs: the coordinate reference system (e.g. "EPSG:4326").
dimensions: the dimension names. The order is important the order is ["<index dimension name>", "<spatial y dimension name>, "spatial x dimension name>"] e.g. ["return_period", "latitude", "longitude"]
transform_mat3x3: an array [a, b, c, d, e, f] specifying the affine transform, following the conventions of the Affine library. This transforms points specified in the coordinate reference system to points in pixel space.
<index dimension name>_name: name of the index coordinate: the non-spatial dimension.
<index dimension name>_values: the labels corresponding to each of the integer indices of the non-spatial dimension.
<index dimension name>_units: the units of the index dimension co-ordinate values.

The Zarr default is to use the ‘C’ ordering convention, i.e. the value of array data at data[k, j, i] is at address \(i + N_i \times j + N_i \times N_j \times k\). That is, value [i, j, k] and [i, k, k + 1] are adjacent in memory. This tends to give the best compression ratios in various important cases, for example flood depths at different return periods for a region with zero flood depth below a certain threshold — large regions of the chunk will be zero.

Relationship with Xarray

XArray xarray.DataArray defines a property coords: a dict-like container of arrays (coordinates) that label each point (e.g., one-dimensional arrays of numbers, datetime objects or strings). That is, the coordinates of the three dimensions are given by one-dimensional arrays. This means that to create an xrray.DataArray, one must create these arrays from the Zarr attributes. This is currently done in the hazard repo, although the functionality is sufficiently useful that it will be added to Physrisk also.

See https://github.com/os-climate/hazard/blob/main/src/hazard/utilities/xarray_utilities.py

Problems with using the Xarray Zarr convention in Physrisk

As mentioned above, Xarray provides native support for writing to Zarr and it is always desirable to follow existing conventions! However, the xrray.DataArray is written as a Zarr group with the coordinates as separate Zarr arrays. That is the co-ordinates are expanded in full and must be read prior to indexing the Zarr array. Moreover, if an affine transform is given, the indices of a pixel can be calculated directly from this transform: it is not necessary to perform a look-up from the coordinate arrays (presumably by bisection). At time of writing we have the convention that a hazard indicator array is a plain Zarr array (i.e. not a group), with the additional attributes above, but we allow both conventions to be used (and try to make it straight forward to convert from one format to another).

An aside on ‘C’-like ordering

[Numpy arrays can help to clarify the point]

[22]:

import numpy as np

print("Numpy also has 'C' ordering by default")
a = np.arange(24)
b = a.reshape([2, 3, 4])
print(b)
print(
    f"Element [0, 2, 3] = {b[0, 2, 3]} is next to element [0, 2, 2] = {b[0, 2, 2]} in memory"
)

Numpy also has 'C' ordering by default
[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
Element [0, 2, 3] = 11 is next to element [0, 2, 2] = 10 in memory

Performance

In physical climate risk calculations a common use case is that we have a portfolio of assets and we want to retrieve hazard indicator values for each asset location, perhaps with some buffer or other shape around the location, but perhaps treating assets as point-like. How well do chunked, compressed arrays do in such cases? First of all, the performance depends on two important factors:
- how the data is chunked, and
- how the data is stored.
In the Physrisk reference application, data is stored in Amazon S3 using the S3Map as a store (see https://zarr.readthedocs.io/en/stable/api/storage.html). We generally find this to be good enough for typical needs. To expand on ‘typical’, a large portfolio example might have somewhere in the range 100,000—1,000,000 assets, and it is desirable to retrieve the data in order 10s of seconds. But what are the performance
bottlenecks here?

As an aside, it should be emphasized that there are other use-cases to consider, in particular performing operations on the entire data set, perhaps using Dask, rather than looking up points. There are also considerations other than performance, in particular the convenience and cost-effectiveness of S3. But to come back to the performance trade-off in more detail, consider what is happening when a request for 1 million assets is made:

Using the affine transform the required pixels are identified together with the Zarr chunks containing the pixels. This is fast.
The compressed chunks are downloaded from S3. Thanks to the async functionality of S3FileSystem, the chunks are downloaded in parallel, but this can still be a time-consuming step. Chunk size affects the performance. If the chunks are too large then too much unnecessary data is transferred (for this specific use-case); too small and compression ratios can decrease, overall transfer speed may decrease and the number of chunks becomes large (more on this below). By default Physrisk uses, 1000×1000 spatial pixels in each chunk. An important point to note is that for physical risk calculations usually all values along the ‘index’ dimension are required. Therefore these are always in the same chunk. Also, note that this step depends on the connection speed of the compute instance (e.g. Pod) to the S3 storage. For example, when connecting to S3 from a laptop using a home internet connection halfway across the world transfer times will be slow compared to to a Pod hosted on EC2 in the same AWS Region (as for the Physrisk reference application). There is therefore a benefit in using the Physrisk API to access hazard data rather than getting this directly from S3.
The chunks are decompressed and the specific pixel values are looked up. The decompression can also be somewhat costly in time.

If this default set up does not meet the performance needs, what can be done?

Choice of store

Aside from adjusting chunk size, performance can be improved by moving from S3 storage to some form of SSD-backed storage, using a different Zarr store, and thereby decreasing the time for step 2 above. Most simply, DirectoryStore could be used, but another common approach is to use a database to hold Zarr chunks binary data. Zarr stores exist for a range of databases like LMDB, SQLite and MongoDB, plus distributed in-memory stores such as Redis. As an aside, it is interesting to note that TileDB is another example of a database that indexes compressed, chunked multi-dimensional raster data.

Benchmarking is a challenge as times depend on factors such as the clustering of the assets, sparseness of the data and resolution. To give an idea, we queried points across Europe for a high resolution flood data set (5m resolution). The data was in chunks of 6×1000×1000 (there being 6 return periods). A DirectoryStore was used with the data on SSD. The time to retrieve 1,000,000 points was about 25 seconds.

Move away from chunked, compressed data entirely

The Zarr storage of compressed, chunked data works well across a number of use-cases, but if the intent is solely to look up data from a spatial index as quickly as possible, then a database keyed on a spatial index might be preferred, albeit at the expense of other features. Members of the Physrisk project have been investigating this using H3-based indexing and databases such as DuckDB with promising results.

When a single file is desirable

It can be inefficient to store a large number of small chunks as separate files or objects when using cloud storage such as S3, and indeed typical file systems. That is, it can be preferable to store objects at more coarse granularity than chunks. Cloud Optimized GeoTIFFs take this to an extreme, in that a single object is chunked. However, while this can be read in parallel, it cannot be written in parallel in S3. Zarr similarly has a ZipStore with the same restriction.

The Zarr Sharding Codec, in development, might ultimately be the best solution. For now in Physrisk, should number of chunk objects become an issue, use of ZipStore is preferred at the expense of writing chunks in parallel. It is most likely that this comes up in cases where Zarr arrays are used for map tiles, a subject we will come on to.