The HDF5Array
package provides an on-disk representation
of large datasets without the need to load them into memory. Convenient
lazy evaluation operations allow the user to manipulate such large data
files based on metadata. The DelayedMatrix
class in the
DelayedArray
package provides a way to connect to a large
matrix that is stored on disk.
First, we create a small matrix for constructing the
DelayedMatrix
class.
We add rownames and column names to the matrix object for
compatibility with the MultiAssayExperiment
representation.
rownames(smallMatrix) <- paste0("GENE", seq_len(nrow(smallMatrix)))
colnames(smallMatrix) <- paste0("SampleID", seq_len(ncol(smallMatrix)))
Here we use the DelayedArray
constructor function to
create a DelayedMatrix
object.
## [1] "DelayedMatrix"
## attr(,"package")
## [1] "DelayedArray"
## <50000 x 20> DelayedMatrix object of type "double":
## SampleID1 SampleID2 SampleID3 ... SampleID19 SampleID20
## GENE1 0.9960797 -0.1041617 0.2461355 . -2.0213702 2.2489391
## GENE2 -2.0629847 -0.2044650 1.2206311 . 0.6711273 -0.1406421
## GENE3 -0.7335276 1.2532517 0.3961783 . -0.5459943 -1.6890098
## GENE4 1.1534776 -0.4280339 -0.3796324 . -0.2904418 0.4954862
## GENE5 0.8964756 -0.7677761 -0.8130828 . -1.3989442 1.3746295
## ... . . . . . .
## GENE49996 0.02020769 1.11061289 -1.31179039 . -0.45233351 -0.07888681
## GENE49997 -1.35561221 0.10941767 -0.26420969 . -0.31641511 1.55691009
## GENE49998 1.23651178 -0.69187329 -0.01801316 . 1.15877074 -0.13309451
## GENE49999 -1.03600688 0.40751683 -2.56037562 . 0.13556957 2.39801692
## GENE50000 -0.53226157 -0.49318574 0.67838906 . -0.51100724 2.11980180
## [1] 50000 20
Finally, the rhdf5
package stores dimnames
in a standard location.
In order to make use of this functionality, we would use
writeHDF5Array
with the with.dimnames
argument:
testh5 <- tempfile(fileext = ".h5")
writeHDF5Array(smallMatrix, filepath = testh5, name = "smallMatrix",
with.dimnames = TRUE)
## <50000 x 20> HDF5Matrix object of type "double":
## SampleID1 SampleID2 SampleID3 ... SampleID19 SampleID20
## GENE1 0.9960797 -0.1041617 0.2461355 . -2.0213702 2.2489391
## GENE2 -2.0629847 -0.2044650 1.2206311 . 0.6711273 -0.1406421
## GENE3 -0.7335276 1.2532517 0.3961783 . -0.5459943 -1.6890098
## GENE4 1.1534776 -0.4280339 -0.3796324 . -0.2904418 0.4954862
## GENE5 0.8964756 -0.7677761 -0.8130828 . -1.3989442 1.3746295
## ... . . . . . .
## GENE49996 0.02020769 1.11061289 -1.31179039 . -0.45233351 -0.07888681
## GENE49997 -1.35561221 0.10941767 -0.26420969 . -0.31641511 1.55691009
## GENE49998 1.23651178 -0.69187329 -0.01801316 . 1.15877074 -0.13309451
## GENE49999 -1.03600688 0.40751683 -2.56037562 . 0.13556957 2.39801692
## GENE50000 -0.53226157 -0.49318574 0.67838906 . -0.51100724 2.11980180
To see the file structure we use h5ls
:
## group name otype dclass dim
## 0 / .smallMatrix_dimnames H5I_GROUP
## 1 /.smallMatrix_dimnames 1 H5I_DATASET STRING 50000
## 2 /.smallMatrix_dimnames 2 H5I_DATASET STRING 20
## 3 / smallMatrix H5I_DATASET FLOAT 50000 x 20
Note that a large matrix from an HDF5 file can also be loaded using
the HDF5ArraySeed
and DelayedArray
functions.
hdf5Data <- HDF5ArraySeed(file = testh5, name = "smallMatrix")
newDelayedMatrix <- DelayedArray(hdf5Data)
class(newDelayedMatrix)
## [1] "HDF5Matrix"
## attr(,"package")
## [1] "HDF5Array"
## <50000 x 20> HDF5Matrix object of type "double":
## SampleID1 SampleID2 SampleID3 ... SampleID19 SampleID20
## GENE1 0.9960797 -0.1041617 0.2461355 . -2.0213702 2.2489391
## GENE2 -2.0629847 -0.2044650 1.2206311 . 0.6711273 -0.1406421
## GENE3 -0.7335276 1.2532517 0.3961783 . -0.5459943 -1.6890098
## GENE4 1.1534776 -0.4280339 -0.3796324 . -0.2904418 0.4954862
## GENE5 0.8964756 -0.7677761 -0.8130828 . -1.3989442 1.3746295
## ... . . . . . .
## GENE49996 0.02020769 1.11061289 -1.31179039 . -0.45233351 -0.07888681
## GENE49997 -1.35561221 0.10941767 -0.26420969 . -0.31641511 1.55691009
## GENE49998 1.23651178 -0.69187329 -0.01801316 . 1.15877074 -0.13309451
## GENE49999 -1.03600688 0.40751683 -2.56037562 . 0.13556957 2.39801692
## GENE50000 -0.53226157 -0.49318574 0.67838906 . -0.51100724 2.11980180
DelayedMatrix
with
MultiAssayExperiment
A DelayedMatrix
alone conforms to the
MultiAssayExperiment
API requirements. Shown below, the
DelayedMatrix
can be put into a named list
and
passed into the MultiAssayExperiment
constructor
function.
## DataFrame with 20 rows and 3 columns
## assay primary colname
## <factor> <character> <character>
## 1 smallMatrix SampleID1 SampleID1
## 2 smallMatrix SampleID2 SampleID2
## 3 smallMatrix SampleID3 SampleID3
## 4 smallMatrix SampleID4 SampleID4
## 5 smallMatrix SampleID5 SampleID5
## ... ... ... ...
## 16 smallMatrix SampleID16 SampleID16
## 17 smallMatrix SampleID17 SampleID17
## 18 smallMatrix SampleID18 SampleID18
## 19 smallMatrix SampleID19 SampleID19
## 20 smallMatrix SampleID20 SampleID20
## DataFrame with 20 rows and 0 columns
SummarizedExperiment
with DelayedMatrix
backendA more information rich DelayedMatrix
can be created
when used in conjunction with the SummarizedExperiment
class and it can even include rowRanges
. The flexibility of
the MultiAssayExperiment
API supports classes with minimal
requirements. Additionally, this SummarizedExperiment
with
the DelayedMatrix
backend can be part of a bigger
MultiAssayExperiment
object. Below is a minimal example of
how this would work:
## <50000 x 20> DelayedMatrix object of type "double":
## SampleID1 SampleID2 SampleID3 ... SampleID19 SampleID20
## GENE1 0.9960797 -0.1041617 0.2461355 . -2.0213702 2.2489391
## GENE2 -2.0629847 -0.2044650 1.2206311 . 0.6711273 -0.1406421
## GENE3 -0.7335276 1.2532517 0.3961783 . -0.5459943 -1.6890098
## GENE4 1.1534776 -0.4280339 -0.3796324 . -0.2904418 0.4954862
## GENE5 0.8964756 -0.7677761 -0.8130828 . -1.3989442 1.3746295
## ... . . . . . .
## GENE49996 0.02020769 1.11061289 -1.31179039 . -0.45233351 -0.07888681
## GENE49997 -1.35561221 0.10941767 -0.26420969 . -0.31641511 1.55691009
## GENE49998 1.23651178 -0.69187329 -0.01801316 . 1.15877074 -0.13309451
## GENE49999 -1.03600688 0.40751683 -2.56037562 . 0.13556957 2.39801692
## GENE50000 -0.53226157 -0.49318574 0.67838906 . -0.51100724 2.11980180
## A MultiAssayExperiment object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] HDF5SE: SummarizedExperiment with 50000 rows and 20 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Additional scenarios are currently in development where an
HDF5Matrix
is hosted remotely. Many opportunities exist
when considering on-disk and off-disk representations of data with
MultiAssayExperiment
.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] HDF5Array_1.35.2 rhdf5_2.51.2
## [3] DelayedArray_0.33.3 SparseArray_1.7.2
## [5] S4Arrays_1.7.1 abind_1.4-8
## [7] Matrix_1.7-1 survminer_0.5.0
## [9] ggpubr_0.6.0 ggplot2_3.5.1
## [11] survival_3.8-3 UpSetR_1.4.0
## [13] RaggedExperiment_1.31.1 MultiAssayExperiment_1.33.5
## [15] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [17] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [19] IRanges_2.41.2 S4Vectors_0.45.2
## [21] BiocGenerics_0.53.3 generics_0.1.3
## [23] MatrixGenerics_1.19.1 matrixStats_1.5.0
## [25] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 rlang_1.1.4 magrittr_2.0.3
## [4] compiler_4.4.2 vctrs_0.6.5 reshape2_1.4.4
## [7] stringr_1.5.1 pkgconfig_2.0.3 crayon_1.5.3
## [10] fastmap_1.2.0 backports_1.5.0 XVector_0.47.2
## [13] labeling_0.4.3 KMsurv_0.1-5 rmarkdown_2.29
## [16] markdown_1.13 UCSC.utils_1.3.0 purrr_1.0.2
## [19] xfun_0.50 cachem_1.1.0 jsonlite_1.8.9
## [22] rhdf5filters_1.19.0 Rhdf5lib_1.29.0 broom_1.0.7
## [25] R6_2.5.1 bslib_0.8.0 stringi_1.8.4
## [28] car_3.1-3 jquerylib_0.1.4 Rcpp_1.0.14
## [31] knitr_1.49 zoo_1.8-12 R.utils_2.12.3
## [34] BiocBaseUtils_1.9.0 splines_4.4.2 R.cache_0.16.0
## [37] tidyselect_1.2.1 yaml_2.3.10 ggtext_0.1.2
## [40] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
## [43] withr_3.0.2 evaluate_1.0.3 xml2_1.3.6
## [46] survMisc_0.5.6 pillar_1.10.1 BiocManager_1.30.25
## [49] carData_3.0-5 munsell_0.5.1 commonmark_1.9.2
## [52] scales_1.3.0 xtable_1.8-4 glue_1.8.0
## [55] maketools_1.3.1 tools_4.4.2 sys_3.4.3
## [58] data.table_1.16.4 ggsignif_0.6.4 buildtools_1.0.0
## [61] grid_4.4.2 tidyr_1.3.1 colorspace_2.1-1
## [64] GenomeInfoDbData_1.2.13 Formula_1.2-5 cli_3.6.3
## [67] km.ci_0.5-6 dplyr_1.1.4 gtable_0.3.6
## [70] R.methodsS3_1.8.2 rstatix_0.7.2 R.rsp_0.46.0
## [73] sass_0.4.9 digest_0.6.37 farver_2.1.2
## [76] htmltools_0.5.8.1 R.oo_1.27.0 lifecycle_1.0.4
## [79] httr_1.4.7 gridtext_0.1.5