Title: | Read Zarr Files in R |
---|---|
Description: | The Zarr specification defines a format for chunked, compressed, N-dimensional arrays. It's design allows efficient access to subsets of the stored array, and supports both local and cloud storage systems. Rarr aims to implement this specifcation in R with minimal reliance on an external tools or libraries. |
Authors: | Mike Smith [aut, cre] |
Maintainer: | Mike Smith <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.7.0 |
Built: | 2024-11-05 11:19:20 UTC |
Source: | https://github.com/grimbough/Rarr |
Compress and write a single chunk
.compress_and_write_chunk( input_chunk, chunk_path, compressor = use_zlib(), data_type_size, is_base64 = FALSE )
.compress_and_write_chunk( input_chunk, chunk_path, compressor = use_zlib(), data_type_size, is_base64 = FALSE )
input_chunk |
Array containing the chunk data to be compressed. Will be converted to a raw vector before compression. |
chunk_path |
Character string giving the path to the chunk that should be written. |
compressor |
A "compressor" function that returns a list giving the details of the compression tool to apply. See compressors for more details. |
data_type_size |
An integer giving the size of the original datatype. This is passed to the blosc algorithm, which seems to need it to achieve any compression. |
is_base64 |
When dealing with Py_unicode strings we convert them to base64 strings for storage in our intermediate R arrays. This argument indicates if base64 is in use, because the conversion to raw in .as_raw should be done differently for base64 strings vs other types. |
Returns TRUE
if writing is successful. Mostly called for the
side-effect of writing the compressed chunk to disk.
x[idx[[1]], idx[[2]]] <- y
for an array x
where the number of dimensions is variable.Create a string of the form x[idx[[1]], idx[[2]]] <- y
for an array x
where the number of dimensions is variable.
.create_replace_call(x_name, idx_name, idx_length, y_name)
.create_replace_call(x_name, idx_name, idx_length, y_name)
x_name |
Name of the object to have items replaced |
idx_name |
Name of the list containing the indices |
idx_length |
Length of the list specified in |
y_name |
Name of the object containing the replacement items |
A character vector of length one containing the replacement
commands. This is expected to be passed to parse() |> eval()
.
R has internal decompression tools for zlib, bz2 and lzma compression. We use external libraries bundled with the package for blosc and lz4 decompression.
.decompress_chunk(compressed_chunk, metadata)
.decompress_chunk(compressed_chunk, metadata)
compressed_chunk |
Raw vector holding the compressed bytes for this chunk. |
metadata |
List produced by |
An array with the number of dimensions specified in the Zarr metadata. In most cases it will have the same size as the Zarr chunk, however in the case of edge chunks, which overlap the extent of the array, the returned chunk will be smaller.
When a chunk is decompressed it is returned as a vector of raw bytes. This function uses the array metadata to select how to convert the bytes into the final datatype and then converts the resulting output into an array of the appropriate dimensions, including re-ordering if the original data is in row-major order.
.format_chunk(decompressed_chunk, metadata, alt_chunk_dim)
.format_chunk(decompressed_chunk, metadata, alt_chunk_dim)
decompressed_chunk |
Raw vector holding the decompressed bytes for this chunk. |
metadata |
List produced by |
alt_chunk_dim |
The dimensions of the array that should be created from
this chunk. Normally this will be the same as the chunk shape in
|
A list of length 2. The first element is the formatted chunk data. The second is an integer of length 1, indicating if warnings were encountered when converting types
If "chunk_data" is larger than the space remaining in destination array
i.e. it contains the overflowing elements, these will be trimmed when the
chunk is returned to read_data()
:::
operator. Look at that function if
things stop working.This is a modified version of paws.storge:::get_credentials(). It is
included to prevent using the :::
operator. Look at that function if
things stop working.
.get_credentials(credentials)
.get_credentials(credentials)
credentials |
Content stored at |
A credentials list to be reinserted into a paws.storage
s3 object.
If no valid credentials are found this function will error, which is expected
and is caught by .check_credentials
.
Taken from https://zarr.readthedocs.io/en/stable/spec/v2.html#logical-storage-paths
.normalize_array_path(path)
.normalize_array_path(path)
path |
Character vector of length 1 giving the path to be normalised. |
A character vector of length 1 containing the normalised path.
Parse the data type encoding string
.parse_datatype(typestr)
.parse_datatype(typestr)
typestr |
The datatype encoding string. This is in the Numpy array typestr format. |
A list of length 4 containing the details of the data type.
These functions select a compression tool and its setting when writing a Zarr file
use_blosc(cname = "lz4") use_zlib(level = 6L) use_gzip(level = 6L) use_bz2(level = 6L) use_lzma(level = 9L) use_lz4() use_zstd(level = 3)
use_blosc(cname = "lz4") use_zlib(level = 6L) use_gzip(level = 6L) use_bz2(level = 6L) use_lzma(level = 9L) use_lz4() use_zstd(level = 3)
cname |
Blosc is a 'meta-compressor' providing access to several compression algorithms. This argument defines which compression tool should be used. Valid options are: 'lz4', 'lz4hc', 'blosclz', 'zstd', 'zlib', 'snappy'. |
level |
Specify the compression level to use. The range of possible
values is dependant on the compression tool being used. For example, for
|
A list containing the details of the selected compression tool. This will be written to the .zarray metadata when the Zarr array is created.
## define 2 compression filters for blosc (using snappy) and bzip2 (level 5) blosc_with_snappy_compression <- use_blosc(cname = "snappy") bzip2_compression <- use_bz2(level = 5) ## create an example array to write to a file x <- array(runif(n = 1000, min = -10, max = 10), dim = c(10, 20, 5)) ## write the array to two files using each compression filter blosc_path <- tempfile() bzip2_path <- tempfile() write_zarr_array( x = x, zarr_array_path = blosc_path, chunk_dim = c(2, 5, 1), compressor = blosc_with_snappy_compression ) write_zarr_array( x = x, zarr_array_path = bzip2_path, chunk_dim = c(2, 5, 1), compressor = bzip2_compression ) ## the contents of the two arrays should be the same identical(read_zarr_array(blosc_path), read_zarr_array(bzip2_path)) ## the size of the files on disk are not the same sum(file.size(list.files(blosc_path, full.names = TRUE))) sum(file.size(list.files(bzip2_path, full.names = TRUE)))
## define 2 compression filters for blosc (using snappy) and bzip2 (level 5) blosc_with_snappy_compression <- use_blosc(cname = "snappy") bzip2_compression <- use_bz2(level = 5) ## create an example array to write to a file x <- array(runif(n = 1000, min = -10, max = 10), dim = c(10, 20, 5)) ## write the array to two files using each compression filter blosc_path <- tempfile() bzip2_path <- tempfile() write_zarr_array( x = x, zarr_array_path = blosc_path, chunk_dim = c(2, 5, 1), compressor = blosc_with_snappy_compression ) write_zarr_array( x = x, zarr_array_path = bzip2_path, chunk_dim = c(2, 5, 1), compressor = bzip2_compression ) ## the contents of the two arrays should be the same identical(read_zarr_array(blosc_path), read_zarr_array(bzip2_path)) ## the size of the files on disk are not the same sum(file.size(list.files(blosc_path, full.names = TRUE))) sum(file.size(list.files(bzip2_path, full.names = TRUE)))
Create an (empty) Zarr array
create_empty_zarr_array( zarr_array_path, dim, chunk_dim, data_type, order = "F", compressor = use_zlib(), fill_value, nchar, dimension_separator = "." )
create_empty_zarr_array( zarr_array_path, dim, chunk_dim, data_type, order = "F", compressor = use_zlib(), fill_value, nchar, dimension_separator = "." )
zarr_array_path |
Character vector of length 1 giving the path to the new Zarr array. |
dim |
Dimensions of the new array. Should be a numeric vector with the same length as the number of dimensions. |
chunk_dim |
Dimensions of the array chunks. Should be a numeric vector
with the same length as the |
data_type |
Character vector giving the data type of the new array.
Currently this is limited to standard R data types. Valid options are:
"integer", "double", "character". You can also use the analogous NumpPy
formats: "<i4", "<f8", "|S". If this argument isn't provided the
|
order |
Define the layout of the bytes within each chunk. Valid options are 'column', 'row', 'F' & 'C'. 'column' or 'F' will specify "column-major" ordering, which is how R arrays are arranged in memory. 'row' or 'C' will specify "row-major" order. |
compressor |
What (if any) compression tool should be applied to the
array chunks. The default is to use |
fill_value |
The default value for uninitialized portions of the array. Does not have to be provided, in which case the default for the specified data type will be used. |
nchar |
For |
dimension_separator |
The character used to to separate the dimensions in the names of the chunk files. Valid options are limited to "." and "/". |
If successful returns (invisibly) TRUE
. However this function is
primarily called for the size effect of initialising a Zarr array location
and creating the .zarray
metadata.
write_zarr_array()
, update_zarr_array()
new_zarr_array <- file.path(tempdir(), "temp.zarr") create_empty_zarr_array(new_zarr_array, dim = c(10, 20), chunk_dim = c(2, 5), data_type = "integer" )
new_zarr_array <- file.path(tempdir(), "temp.zarr") create_empty_zarr_array(new_zarr_array, dim = c(10, 20), chunk_dim = c(2, 5), data_type = "integer" )
Determine the size of chunk in bytes
get_chunk_size(datatype, dimensions)
get_chunk_size(datatype, dimensions)
datatype |
A list of details for the array datatype. Expected to be
produced by |
dimensions |
A list containing the dimensions of the chunk. Expected
to be found in a list produced by |
An integer giving the size of the chunk in bytes
Determine the size of chunk in bytes after decompression
get_decompressed_chunk_size(datatype, dimensions)
get_decompressed_chunk_size(datatype, dimensions)
datatype |
A list of details for the array datatype. Expected to be
produced by |
dimensions |
A list containing the dimensions of the chunk. Expected
to be found in a list produced by |
An integer giving the size of the chunk in bytes
Read the .zarray metadata file associated with a Zarr array
read_array_metadata(path, s3_client = NULL)
read_array_metadata(path, s3_client = NULL)
path |
A character vector of length 1. This provides the path to a Zarr array or group of arrays. This can either be on a local file system or on S3 storage. |
s3_client |
A list representing an S3 client. This should be produced
by |
A list containing the array metadata
Read a single Zarr chunk
read_chunk( zarr_array_path, chunk_id, metadata, s3_client = NULL, alt_chunk_dim = NULL )
read_chunk( zarr_array_path, chunk_id, metadata, s3_client = NULL, alt_chunk_dim = NULL )
zarr_array_path |
A character vector of length 1, giving the path to the Zarr array |
chunk_id |
A numeric vector or single data.frame row with length equal to the number of dimensions of a chunk. |
metadata |
List produced by |
s3_client |
Object created by |
alt_chunk_dim |
The dimensions of the array that should be created from
this chunk. Normally this will be the same as the chunk shape in
|
A list of length 2. The entries should be names "chunk_data" and "warning". The first is an array containing the decompressed chunk values, the second is an integer indicating whether there were any overflow warnings generated will reading the chunk into an R datatype.
Read a Zarr array
read_zarr_array(zarr_array_path, index, s3_client)
read_zarr_array(zarr_array_path, index, s3_client)
zarr_array_path |
Path to a Zarr array. A character vector of length 1. This can either be a location on a local file system or the URI to an array in S3 storage. |
index |
A list of the same length as the number of dimensions in the
Zarr array. Each entry in the list provides the indices in that dimension
that should be read from the array. Setting a list entry to |
s3_client |
Object created by |
An array with the same number of dimensions as the input array. The
extent of each dimension will correspond to the length of the values
provided to the index
argument.
## Using a local file provided with the package ## This array has 3 dimensions z1 <- system.file("extdata", "zarr_examples", "row-first", "int32.zarr", package = "Rarr" ) ## read the entire array read_zarr_array(zarr_array_path = z1) ## extract values for first 10 rows, all columns, first slice read_zarr_array(zarr_array_path = z1, index = list(1:10, NULL, 1)) ## using a Zarr file hosted on Amazon S3 ## This array has a single dimension with length 576 z2 <- "https://power-analysis-ready-datastore.s3.amazonaws.com/power_901_constants.zarr/lon/" ## read the entire array read_zarr_array(zarr_array_path = z2) ## read alternating elements read_zarr_array(zarr_array_path = z2, index = list(seq(1, 576, 2)))
## Using a local file provided with the package ## This array has 3 dimensions z1 <- system.file("extdata", "zarr_examples", "row-first", "int32.zarr", package = "Rarr" ) ## read the entire array read_zarr_array(zarr_array_path = z1) ## extract values for first 10 rows, all columns, first slice read_zarr_array(zarr_array_path = z1, index = list(1:10, NULL, 1)) ## using a Zarr file hosted on Amazon S3 ## This array has a single dimension with length 576 z2 <- "https://power-analysis-ready-datastore.s3.amazonaws.com/power_901_constants.zarr/lon/" ## read the entire array read_zarr_array(zarr_array_path = z2) ## read alternating elements read_zarr_array(zarr_array_path = z2, index = list(seq(1, 576, 2)))
Special case fill values (NaN, Inf, -Inf) are encoded as strings in the Zarr metadata. R will create arrays of type character if these are defined and the chunk isn't present on disk. This function updates the fill value to be R's representation of these special values, so numeric arrays are created. A "null" fill value implies no missing values. We set this to NA as you can't create an array of type NULL in R. It should have no impact if there are really no missing values.
update_fill_value(metadata)
update_fill_value(metadata)
metadata |
A list containing the array metadata. This should normally
be generated by running |
Returns a list with the same structure as the input. The returned
list will be identical to the input, unless the fill_value
entry was on
of: NULL, "NaN", "Infinity" or "-Infinity".
Update (a subset of) an existing Zarr array
update_zarr_array(zarr_array_path, x, index)
update_zarr_array(zarr_array_path, x, index)
zarr_array_path |
Character vector of length 1 giving the path to the Zarr array that is to be modified. |
x |
The R array (or object that can be coerced to an array) that will be written to the Zarr array. |
index |
A list with the same length as the number of dimensions of the target array. This argument indicates which elements in the target array should be updated. |
The function is primarily called for the side effect of writing to
disk. Returns (invisibly) TRUE
if the array is successfully updated.
## first create a new, empty, Zarr array new_zarry_array <- file.path(tempdir(), "new_array.zarr") create_empty_zarr_array( zarr_array_path = new_zarry_array, dim = c(20, 10), chunk_dim = c(10, 5), data_type = "double" ) ## create a matrix smaller than our Zarr array small_matrix <- matrix(runif(6), nrow = 3) ## insert the matrix into the first 3 rows, 2 columns of the Zarr array update_zarr_array(new_zarry_array, x = small_matrix, index = list(1:3, 1:2)) ## reading back a slightly larger subset, ## we can see only the top left corner has been changed read_zarr_array(new_zarry_array, index = list(1:5, 1:5))
## first create a new, empty, Zarr array new_zarry_array <- file.path(tempdir(), "new_array.zarr") create_empty_zarr_array( zarr_array_path = new_zarry_array, dim = c(20, 10), chunk_dim = c(10, 5), data_type = "double" ) ## create a matrix smaller than our Zarr array small_matrix <- matrix(runif(6), nrow = 3) ## insert the matrix into the first 3 rows, 2 columns of the Zarr array update_zarr_array(new_zarry_array, x = small_matrix, index = list(1:3, 1:2)) ## reading back a slightly larger subset, ## we can see only the top left corner has been changed read_zarr_array(new_zarry_array, index = list(1:5, 1:5))
Write an R array to Zarr
write_zarr_array( x, zarr_array_path, chunk_dim, order = "F", compressor = use_zlib(), fill_value, nchar, dimension_separator = "." )
write_zarr_array( x, zarr_array_path, chunk_dim, order = "F", compressor = use_zlib(), fill_value, nchar, dimension_separator = "." )
x |
The R array (or object that can be coerced to an array) that will be written to the Zarr array. |
zarr_array_path |
Character vector of length 1 giving the path to the new Zarr array. |
chunk_dim |
Dimensions of the array chunks. Should be a numeric vector
with the same length as the |
order |
Define the layout of the bytes within each chunk. Valid options are 'column', 'row', 'F' & 'C'. 'column' or 'F' will specify "column-major" ordering, which is how R arrays are arranged in memory. 'row' or 'C' will specify "row-major" order. |
compressor |
What (if any) compression tool should be applied to the
array chunks. The default is to use |
fill_value |
The default value for uninitialized portions of the array. Does not have to be provided, in which case the default for the specified data type will be used. |
nchar |
For character arrays this parameter gives the maximum length of
the stored strings. If this argument is not specified the array provided to
|
dimension_separator |
The character used to to separate the dimensions in the names of the chunk files. Valid options are limited to "." and "/". |
The function is primarily called for the side effect of writing to
disk. Returns (invisibly) TRUE
if the array is successfully written.
new_zarr_array <- file.path(tempdir(), "integer.zarr") x <- array(1:50, dim = c(10, 5)) write_zarr_array( x = x, zarr_array_path = new_zarr_array, chunk_dim = c(2, 5) )
new_zarr_array <- file.path(tempdir(), "integer.zarr") x <- array(1:50, dim = c(10, 5)) write_zarr_array( x = x, zarr_array_path = new_zarr_array, chunk_dim = c(2, 5) )
When reading a Zarr array using read_zarr_array()
it is necessary to know
it's shape and size. zarr_overview()
can be used to get a quick overview of
the array shape and contents, based on the .zarray metadata file each array
contains.
zarr_overview(zarr_array_path, s3_client, as_data_frame = FALSE)
zarr_overview(zarr_array_path, s3_client, as_data_frame = FALSE)
zarr_array_path |
A character vector of length 1. This provides the path to a Zarr array or group of arrays. This can either be on a local file system or on S3 storage. |
s3_client |
A list representing an S3 client. This should be produced
by |
as_data_frame |
Logical determining whether the Zarr array details
should be printed to screen ( |
The function currently prints the following information to the R console:
array path
array shape and size
chunk and size
the number of chunks
the datatype of the array
codec used for data compression (if any)
If given the path to a group of arrays the function will attempt to print the details of all sub-arrays in the group.
If as_data_frame = FALSE
the function invisible returns TRUE
if
successful. However it is primarily called for the side effect of printing
details of the Zarr array(s) to the screen. If as_data_frame = TRUE
then
a data.frame
containing details of the array is returned.
## Using a local file provided with the package z1 <- system.file("extdata", "zarr_examples", "row-first", "int32.zarr", package = "Rarr" ) ## read the entire array zarr_overview(zarr_array_path = z1) ## using a file on S3 storage ## don't run this on the BioC Linux build - it's very slow there is_BBS_linux <- nzchar(Sys.getenv("IS_BIOC_BUILD_MACHINE")) && Sys.info()["sysname"] == "Linux" if(!is_BBS_linux) { z2 <- "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0101A/13457539.zarr/1" zarr_overview(z2) }
## Using a local file provided with the package z1 <- system.file("extdata", "zarr_examples", "row-first", "int32.zarr", package = "Rarr" ) ## read the entire array zarr_overview(zarr_array_path = z1) ## using a file on S3 storage ## don't run this on the BioC Linux build - it's very slow there is_BBS_linux <- nzchar(Sys.getenv("IS_BIOC_BUILD_MACHINE")) && Sys.info()["sysname"] == "Linux" if(!is_BBS_linux) { z2 <- "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0101A/13457539.zarr/1" zarr_overview(z2) }
ZarrayArray
: function to create a new object of class ZarrArray
.
ZarrArray(zarr_array_path)
ZarrArray(zarr_array_path)
zarr_array_path |
Path to a Zarr array. A character vector of length 1. This can either be a location on a local file system or the URI to an array in S3 storage. |
Object of class ZarrArray
Write array data to a Zarr backend via DelayedArray's RealizationSink machinery.