One of the advantages of using HDF5 is that data stored on disk can be compressed, reducing both the space required to store them and the time needed to read those data. This data compression is applied as part of the HDF5 “filter pipeline” that modifies data during I/O operations. HDF5 includes several filter algorithms as standard, and the version of the HDF5 library found in Rhdf5lib is additionally compiled with support for the deflate and szip compression filters which rely on third-party compression libraries. Collectively HDF5 refer to these as the “internal” filters. It is possible to use any combination of these (including none) when writing data using rhdf5. The default filter pipeline is shown in Figure @ref(fig:filter-pipeline).
This pipeline approach has been designed so that filters can be chained together (as in the diagram above) or easily substituted for alternative filters. This allows tailoring the compression approach to best suit the data or application.
It may be case that for a specific usecase an alternative, third-party, compression algorithm would be the most appropriate to use. Such filters, which are not part of the standard HDF5 distribution, are referred to as “external” filters. In order to allow their use without requiring either the HDF5 library or applications to be built with support for all possible filters HDF5 is able to use dynamically loaded filters. These are compiled independently from the HDF5 library, but are available to an application at run time.
This package currently distributes external HDF5 filters employing bzip2, LZF, VBZ, Zstandard and the Blosc meta-compressor. In total rhdf5filters provides access to 111 compression filters than can be applied to HDF5 datasets. The full list of filters currently provided by the package is:
rhdf5filters is principally designed to be used via the rhdf5 package, where several functions are able to utilise the compression filters. For completeness those functions are described here and are also documented in the rhdf5 vignette.
The function h5createDataset()
within rhdf5 takes
the argument filter
which specifies which compression
filter should be used when a new dataset is created.
Also available in rhdf5 are
the functions H5Pset_bzip2()
, `H5Pset_lzf()
and H5Pset_blosc()
. These are not part of the standard HDF5
interface, but are modeled on the H5Pset_deflate()
function
and allow the bzip2, lzf and blosc filters to
be set on dataset create property lists.
As long as rhdf5filters is installed, rhdf5 will be able to transparently read data compressed using any of the filters available in the package without requiring any action on your part.
The dynamic loading design of the HDF5 compression filters means that
you can use the versions distributed with rhdf5filters
with other applications, including other R packages that interface HDF5
as well as external applications not written in R e.g. HDFVIEW. The
function hdf5_plugin_path()
will return the location of in
your packages library where the compiled plugins are stored. You can
pass this location to the HDF5 function H5PLprepend()
or
you can the set the environment variable HDF5_PLUGIN_PATH
,
and other applications will be able to dynamically load the compression
plugins found there if needed.
## [1] "/tmp/RtmpdVy6Ec/Rinst7b769a313b3/rhdf5filters/lib"
The next example demonstrates how the filters distributed by rhdf5filters can be used by external applications to decompress data. Do do this we’ll use the version of h5dump installed on the system2 and a file distributed with this package that has been compressed using the blosc filter.
## blosc compressed file
blosc_file <- system.file("h5examples/h5ex_d_blosc.h5",
package = "rhdf5filters")
Now we use system2()
to call the system version of
h5dump and capture the output, which is then printed
below. The most important parts to note are the FILTERS
section, which shows the dataset was indeed compressed with
blosc, and DATA
, where the error shows that
h5dump is currently unable to read the dataset.
h5dump_out <- system2('h5dump',
args = c('-p', '-d /dset', blosc_file),
stdout = TRUE, stderr = TRUE)
cat(h5dump_out, sep = "\n")
## HDF5 "rhdf5filters/h5examples/h5ex_d_blosc.h5" {
## DATASET "/dset" {
## DATATYPE H5T_IEEE_F32LE
## DATASPACE SIMPLE { ( 30, 10, 20 ) / ( 30, 10, 20 ) }
## STORAGE_LAYOUT {
## CHUNKED ( 10, 10, 20 )
## SIZE 3347 (7.171:1 COMPRESSION)
## }
## FILTERS {
## USER_DEFINED_FILTER {
## FILTER_ID 32001
## COMMENT blosc
## PARAMS { 2 2 4 8000 4 1 0 }
## }
## }
## FILLVALUE {
## FILL_TIME H5D_FILL_TIME_IFSET
## VALUE H5D_FILL_VALUE_DEFAULT
## }
## ALLOCATION_TIME {
## H5D_ALLOC_TIME_INCR
## }
## DATA {h5dump error: unable to print data
##
## }
## }
## }
Next we set HDF5_PLUGIN_PATH
to the location where
rhdf5filters
has stored the filters and re-run the call to h5dump.
Printing the output3 no longer returns an error in the
DATA
section, indicating that the blosc filter
plugin was found and used by h5dump.
## set environment variable to hdf5filter location
Sys.setenv("HDF5_PLUGIN_PATH" = rhdf5filters::hdf5_plugin_path())
h5dump_out <- system2('h5dump',
args = c('-p', '-d /dset', '-w 50', blosc_file),
stdout = TRUE, stderr = TRUE)
## find the data entry and print the first few lines
DATA_line <- grep(h5dump_out, pattern = "DATA \\{")
cat( h5dump_out[ (DATA_line):(DATA_line+2) ], sep = "\n" )
## DATA {
## (0,0,0): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
## (0,0,11): 11, 12, 13, 14, 15, 16, 17, 18,
In order to compile the compression filters
rhdf5filters needs to link them against libraries for
the relevant compression tools. These libraries include
bzip2
, blosc
and zstd
. By default
rhdf5filters will look for system installations of
those libraries and link against them. If they are not found then
versions that are bundled with the package will be compiled and used
instead. It is possible to use system libraries for some filters and the
bundled versions for others - rhdf5filters should be
able to determine this at compile time.
If you wish to force the use of the bundled libraries, perhaps to
ensure linking against the static versions, this can be done by
providing the configure argument "--with-bundled-libraries"
during package installation e.g.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0
## [4] xfun_0.49 maketools_1.3.1 cachem_1.1.0
## [7] rhdf5filters_1.15.5 knitr_1.48 htmltools_0.5.8.1
## [10] rmarkdown_2.28 buildtools_1.0.0 lifecycle_1.0.4
## [13] cli_3.6.3 sass_0.4.9 jquerylib_0.1.4
## [16] compiler_4.4.1 highr_0.11 sys_3.4.3
## [19] tools_4.4.1 evaluate_1.0.1 bslib_0.8.0
## [22] yaml_2.3.10 BiocManager_1.30.25 jsonlite_1.8.9
## [25] rlang_1.1.4