Raster indicator conventions
Raster data
Why raster data
Hazard indicators very commonly make use of EO datasets and the outputs of climate models, hence the prevalence of raster data. It is important also to note that the pixels are typically square (in some coordinate reference system), as opposed to hexagonal etc: if the intent is to make data available without re-interpolation, handing such square pixels is important. Data that is not raster tends to be that defining a region e.g. a flooded area or a protected area; such data can be left as shapes or can also be converted to rasters (which may be convenient if such data is relatively rare).
Compressed, chunked N-dimensional arrays
Raster data sets can be large. For example the circumference of the Earth is 40,075 km, so a global raster with 1 km resolution at the Equator may already have something like 40,000 × 20,000 pixels — and note that some flood model data sets are 5 m or even 1 m resolution. For this reason, raster data is often chunked and compressed. ‘Chunking’ means that the raster grid is split into smaller grids and these smaller grids are compressed. This is efficient if the intent is to access one specific region without having to read and decompress the entire data set. The problem of dealing with raster data is often then one of dealing with compressed, chunked, N-dimensional arrays.
Zarr format
In Physrisk, hazard indicator raster data is stored using the Zarr format (see https://zarr.readthedocs.io/en/stable/). Cloud Optimized GeoTIFF (https://cogeo.org/) is another format with much the same facility for dealing with chunked, compressed N-dimensional data and indeed there are other choices. Zarr was ultimately chosen for its ability to store natively hierarchical multidimensional array data in a cloud-optimized way and for the flexibility of its back-ends stores. Notably, Zarr chunks are stored in separate files or objects which means that Zarr arrays can be written in parallel.
In the next sections, we discuss the conventions for storing raster hazard indicator data.
Overview of convention
Note: in the following, the terminology is that of Xarray data structure documentation; see https://docs.xarray.dev/en/latest/user-guide/data-structures.html.
Three-dimensional arrays
Hazard indicators are stored as three-dimensional arrays. The array dimensions are \((z, y, x)\) where \(y\) is the spatial \(y\) dimension (‘latitude’ or ‘y’), \(x\) is the spatial \(x\) dimension (‘longitude’ or ‘x’) and \(z\) is a non-spatial dimension. This dimension can be given a descriptive name, for example ‘return_period’ or ‘threshold’ but by default it is called ‘index’ and we call it the ‘index’ [1] dimension henceforth.
This index dimension takes on different meanings according to the type of data being stored, commonly:
Return periods, e.g. in the case of acute hazards. For example the coordinate labels might be 5, 50, 100, 200 years, the indicator value being be maximum sustained wind speed for that return period.
Threshold values, e.g. in the case of chronic hazards. For example the coordinate labels might be 20°, 30°, 40°, 50°, the indicator value being the mean number of days with maximum daily temperature greater than the threshold.
Climate scenario and projected time (i.e. future year) are not dimensions, but rather different data sets. The rationale for this is discussed later.
Attributes versus Xarray convention
Xarray has its own convention for saving Zarr data, which is used when calling the to_zarr method. In this convention, the Xarray-type coordinates (we could also say ‘NetCDF-style’) are saved as separate arrays. In other words, the array is actually a Zarr group, with arrays defining the coordinates and one for the data itself, i.e. a structure like:
/path/to/array/
├── return_period
│ └── 0
├── latitude
│ └── 0
├── longitude
│ └── 0
├── indicator
│ └── 0.0.0
│ └── 0.0.1
...
In this case, the index dimension is called ‘return_period’. The coordinate arrays return_period, latitude and longitude are stored explicitly. indicator contains the actual data. This approach is explicit, but there are downsides, especially for large rasters.
In Physrisk, mainly driven by performance considerations, special attributes are added to the indicator array [2] to provide the same information. The conventions can be combined. That is, as long as the indicator array has these attributes it can be read stand-alone by Physrisk, but the array can still be read ‘natively’ by Xarray.
The labels of the index dimension are given by the <index dimension name>_values attribute of the hazard indicator array, e.g. return_period_values. A display name of the dimension can be given by index dimension name>_name, e.g. return_period_name.
For example, in the first case above return_period_values is given by the array [1, 50, 100, 200]. and return_period_name is return period. In addition, the units of the index co-ordinates can be specified in the
The coordinates of the array are specified by the set of attributes:
crs: the coordinate reference system (e.g."EPSG:4326").dimensions: the dimension names. The order is important the order is["<index dimension name>", "<spatial y dimension name>, "spatial x dimension name>"]e.g.["return_period", "latitude", "longitude"]transform_mat3x3: an array[a, b, c, d, e, f]specifying the affine transform, following the conventions of the Affine library. This transforms points specified in the coordinate reference system to points in pixel space.<index dimension name>_name: name of the index coordinate: the non-spatial dimension.<index dimension name>_values: the labels corresponding to each of the integer indices of the non-spatial dimension.<index dimension name>_units: the units of the index dimension co-ordinate values.
The Zarr default is to use the ‘C’ ordering convention, i.e. the value of array data at data[k, j, i] is at address \(i + N_i \times j + N_i \times N_j \times k\). That is, value [i, j, k] and [i, k, k + 1] are adjacent in memory. This tends to give the best compression ratios in various important cases, for example flood depths at different return periods for a region with zero flood depth below a certain threshold — large regions of the chunk will be zero.
Relationship with Xarray
XArray xarray.DataArray defines a property coords: a dict-like container of arrays (coordinates) that label each point (e.g., one-dimensional arrays of numbers, datetime objects or strings). That is, the coordinates of the three dimensions are given by one-dimensional arrays. This means that to create an xrray.DataArray, one must create these arrays from the Zarr attributes. This is currently done in the hazard repo, although the functionality is sufficiently useful that it will
be added to Physrisk also.
See https://github.com/os-climate/hazard/blob/main/src/hazard/utilities/xarray_utilities.py
Problems with using the Xarray Zarr convention in Physrisk
As mentioned above, Xarray provides native support for writing to Zarr and it is always desirable to follow existing conventions! However, the xrray.DataArray is written as a Zarr group with the coordinates as separate Zarr arrays. That is the co-ordinates are expanded in full and must be read prior to indexing the Zarr array. Moreover, if an affine
transform is given, the indices of a pixel can be calculated directly from this transform: it is not necessary to perform a look-up from the coordinate arrays (presumably by bisection). At time of writing we have the convention that a hazard indicator array is a plain Zarr array (i.e. not a group), with the additional attributes above, but we allow both conventions to be used (and try to make it straight forward to convert from one format to another).
An aside on ‘C’-like ordering
[Numpy arrays can help to clarify the point]
[22]:
import numpy as np
print("Numpy also has 'C' ordering by default")
a = np.arange(24)
b = a.reshape([2, 3, 4])
print(b)
print(
f"Element [0, 2, 3] = {b[0, 2, 3]} is next to element [0, 2, 2] = {b[0, 2, 2]} in memory"
)
Numpy also has 'C' ordering by default
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
Element [0, 2, 3] = 11 is next to element [0, 2, 2] = 10 in memory
Performance
S3Map as a store (see https://zarr.readthedocs.io/en/stable/api/storage.html). We generally find this to be good enough for typical needs. To expand on ‘typical’, a large portfolio example might have somewhere in the range 100,000—1,000,000 assets, and it is desirable to retrieve the data in order 10s of seconds. But what are the performance
bottlenecks here?As an aside, it should be emphasized that there are other use-cases to consider, in particular performing operations on the entire data set, perhaps using Dask, rather than looking up points. There are also considerations other than performance, in particular the convenience and cost-effectiveness of S3. But to come back to the performance trade-off in more detail, consider what is happening when a request for 1 million assets is made:
Using the affine transform the required pixels are identified together with the Zarr chunks containing the pixels. This is fast.
The compressed chunks are downloaded from S3. Thanks to the async functionality of
S3FileSystem, the chunks are downloaded in parallel, but this can still be a time-consuming step. Chunk size affects the performance. If the chunks are too large then too much unnecessary data is transferred (for this specific use-case); too small and compression ratios can decrease, overall transfer speed may decrease and the number of chunks becomes large (more on this below). By default Physrisk uses, 1000×1000 spatial pixels in each chunk. An important point to note is that for physical risk calculations usually all values along the ‘index’ dimension are required. Therefore these are always in the same chunk. Also, note that this step depends on the connection speed of the compute instance (e.g. Pod) to the S3 storage. For example, when connecting to S3 from a laptop using a home internet connection halfway across the world transfer times will be slow compared to to a Pod hosted on EC2 in the same AWS Region (as for the Physrisk reference application). There is therefore a benefit in using the Physrisk API to access hazard data rather than getting this directly from S3.The chunks are decompressed and the specific pixel values are looked up. The decompression can also be somewhat costly in time.
If this default set up does not meet the performance needs, what can be done?
Choice of store
Aside from adjusting chunk size, performance can be improved by moving from S3 storage to some form of SSD-backed storage, using a different Zarr store, and thereby decreasing the time for step 2 above. Most simply, DirectoryStore could be used, but another common approach is to use a database to hold Zarr chunks binary data. Zarr stores exist for a range of databases like LMDB, SQLite and MongoDB, plus distributed in-memory stores such as Redis. As an aside, it is interesting to note that
TileDB is another example of a database that indexes compressed, chunked multi-dimensional raster data.
Benchmarking is a challenge as times depend on factors such as the clustering of the assets, sparseness of the data and resolution. To give an idea, we queried points across Europe for a high resolution flood data set (5m resolution). The data was in chunks of 6×1000×1000 (there being 6 return periods). A DirectoryStore was used with the data on SSD. The time to retrieve 1,000,000 points was about 25 seconds.
Move away from chunked, compressed data entirely
The Zarr storage of compressed, chunked data works well across a number of use-cases, but if the intent is solely to look up data from a spatial index as quickly as possible, then a database keyed on a spatial index might be preferred, albeit at the expense of other features. Members of the Physrisk project have been investigating this using H3-based indexing and databases such as DuckDB with promising results.
When a single file is desirable
It can be inefficient to store a large number of small chunks as separate files or objects when using cloud storage such as S3, and indeed typical file systems. That is, it can be preferable to store objects at more coarse granularity than chunks. Cloud Optimized GeoTIFFs take this to an extreme, in that a single object is chunked. However, while this can be read in parallel, it cannot be written in parallel in S3. Zarr similarly has a ZipStore with the same restriction.
The Zarr Sharding Codec, in development, might ultimately be the best solution. For now in Physrisk, should number of chunk objects become an issue, use of ZipStore is preferred at the expense of writing chunks in parallel. It is most likely that this comes up in cases where Zarr arrays are used for map tiles, a subject we will come on to.