The MicrobiomeBenchamrkData
package provides access to a
collection of datasets with biological ground truth for benchmarking
differential abundance methods. The datasets are deposited on Zenodo: https://doi.org/10.5281/zenodo.6911026
## Install BioConductor if not installed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
## Release version (not yet in Bioc, so it doesn't work yet)
BiocManager::install("MicrobiomeBenchmarkData")
## Development version
BiocManager::install("waldronlab/MicrobiomeBenchmarkData")
All sample metadata is merged into a single data frame and provided as a data object:
data('sampleMetadata', package = 'MicrobiomeBenchmarkData')
## Get columns present in all samples
sample_metadata <- sampleMetadata |>
discard(~any(is.na(.x))) |>
head()
knitr::kable(sample_metadata)
dataset | sample_id | body_site | library_size | pmid | study_condition | sequencing_method |
---|---|---|---|---|---|---|
HMP_2012_16S_gingival_V13 | 700103497 | oral_cavity | 5356 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700106940 | oral_cavity | 4489 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700097304 | oral_cavity | 3043 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700099015 | oral_cavity | 2832 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700097644 | oral_cavity | 2815 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700097247 | oral_cavity | 6333 | 22699609 | control | 16S |
Currently, there are 6 datasets available through the
MicrobiomeBenchmarkData. These datasets are accessed through the
getBenchmarkData
function.
If no arguments are provided, the list of available datasets is printed on screen and a data.frame is returned with the description of the datasets:
dats <- getBenchmarkData()
#> 1 HMP_2012_16S_gingival_V13
#> 2 HMP_2012_16S_gingival_V35
#> 3 HMP_2012_16S_gingival_V35_subset
#> 4 HMP_2012_WMS_gingival
#> 5 Stammler_2016_16S_spikein
#> 6 Ravel_2011_16S_BV
#>
#> Use vignette('datasets', package = 'MicrobiomeBenchmarkData') for a detailed description of the datasets.
#>
#> Use getBenchmarkData(dryrun = FALSE) to import all of the datasets.
dats
#> Dataset Dimensions Body.site
#> 1 HMP_2012_16S_gingival_V13 33127 x 311 Gingiva
#> 2 HMP_2012_16S_gingival_V35 17949 x 311 Gingiva
#> 3 HMP_2012_16S_gingival_V35_subset 892 x 76 Gingiva
#> 4 HMP_2012_WMS_gingival 235 x 16 Gingiva
#> 5 Stammler_2016_16S_spikein 247 x 394 Stool
#> 6 Ravel_2011_16S_BV 4036 x 17 Vagina
#> Contrasts
#> 1 Subgingival vs Supragingival plaque.
#> 2 Subgingival vs Supragingival plaque.
#> 3 Subgingival vs Supragingival plaque.
#> 4 Subgingival vs Supragingival plaque.
#> 5 Pre-ASCT (allogeneic stem cell transplantation) vs 14 days after treatment.
#> 6 Healthy vs bacterial vaginosis
#> Biological.ground.truth
#> 1 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 2 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 3 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 4 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 5 Same bacterial loads of the spike-in bacteria across all samples: Salinibacter ruber (extreme halophilic), Rhizobium radiobacter (found in soils and plants), and Alicyclobacillus acidiphilu (thermo-acidophilic).
#> 6 Decrease of Lactobacillus and increase of bacteria isolated during bacterial vaginosis in samples with high Nugent scores (bacterial vaginosis).
In order to import a dataset, the getBenchmarkData
function must be used with the name of the dataset as the first argument
(x
) and the dryrun
argument set to
FALSE
. The output is a list vector with the dataset
imported as a TreeSummarizedExperiment object.
tse <- getBenchmarkData('HMP_2012_16S_gingival_V35_subset', dryrun = FALSE)[[1]]
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_subset_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_subset_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_subset_taxonomy_tree.newick'
#> Finished HMP_2012_16S_gingival_V35_subset.
tse
#> class: TreeSummarizedExperiment
#> dim: 892 76
#> metadata(0):
#> assays(1): counts
#> rownames(892): OTU_97.31247 OTU_97.44487 ... OTU_97.45365 OTU_97.45307
#> rowData names(7): kingdom phylum ... genus taxon_annotation
#> colnames(76): 700023057 700023179 ... 700114009 700114338
#> colData names(13): dataset subject_id ... sequencing_method
#> variable_region_16s
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (892 rows)
#> rowTree: 1 phylo tree(s) (892 leaves)
#> colLinks: NULL
#> colTree: NULL
Several datasets can be imported simultaneously by giving the names of the different datasets in a character vector:
list_tse <- getBenchmarkData(dats$Dataset[2:4], dryrun = FALSE)
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_taxonomy_tree.newick'
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_WMS_gingival_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_WMS_gingival_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_WMS_gingival_taxonomy_tree.newick'
#> Finished HMP_2012_WMS_gingival.
str(list_tse, max.level = 1)
#> List of 3
#> $ HMP_2012_16S_gingival_V35 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_WMS_gingival :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
If all of the datasets must to be imported, this can be done by
providing the dryrun = FALSE
argument alone.
mbd <- getBenchmarkData(dryrun = FALSE)
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V13_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V13_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V13_taxonomy_tree.newick'
#> Finished HMP_2012_16S_gingival_V13.
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> Finished HMP_2012_WMS_gingival.
#> adding rname 'https://zenodo.org/record/6911027/files/Ravel_2011_16S_BV_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/Ravel_2011_16S_BV_taxonomy_table.tsv'
#> Warning: No taxonomy_tree available for Ravel_2011_16S_BV.
#> Finished Ravel_2011_16S_BV.
#> adding rname 'https://zenodo.org/record/6911027/files/Stammler_2016_16S_spikein_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/Stammler_2016_16S_spikein_taxonomy_table.tsv'
#> Warning: No taxonomy_tree available for Stammler_2016_16S_spikein.
#> Finished Stammler_2016_16S_spikein.
str(mbd, max.level = 1)
#> List of 6
#> $ HMP_2012_16S_gingival_V13 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_WMS_gingival :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ Ravel_2011_16S_BV :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ Stammler_2016_16S_spikein :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
The biological annotations of each taxa are provided as a column in
the rowData
slot of the TreeSummarizedExperiment.
## In the case, the column is named as taxon_annotation
tse <- mbd$HMP_2012_16S_gingival_V35_subset
rowData(tse)
#> DataFrame with 892 rows and 7 columns
#> kingdom phylum class order
#> <character> <character> <character> <character>
#> OTU_97.31247 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.44487 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.34979 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.34572 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.42259 Bacteria Firmicutes Bacilli Lactobacillales
#> ... ... ... ... ...
#> OTU_97.44294 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45429 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.44375 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45365 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45307 Bacteria Firmicutes Bacilli Lactobacillales
#> family genus taxon_annotation
#> <character> <character> <character>
#> OTU_97.31247 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44487 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34979 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34572 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.42259 Streptococcaceae Streptococcus facultative_anaerobic
#> ... ... ... ...
#> OTU_97.44294 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45429 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44375 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45365 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45307 Streptococcaceae Streptococcus facultative_anaerobic
The datasets are cached so they’re only downloaded once. The cache
and all of the files contained in it can be removed with the
removeCache
function.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] purrr_1.0.2 MicrobiomeBenchmarkData_1.9.0
#> [3] TreeSummarizedExperiment_2.15.0 Biostrings_2.75.1
#> [5] XVector_0.47.0 SingleCellExperiment_1.29.1
#> [7] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [9] GenomicRanges_1.59.0 GenomeInfoDb_1.43.1
#> [11] IRanges_2.41.1 S4Vectors_0.45.2
#> [13] BiocGenerics_0.53.3 generics_0.1.3
#> [15] MatrixGenerics_1.19.0 matrixStats_1.4.1
#> [17] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] xfun_0.49 bslib_0.8.0 lattice_0.22-6
#> [4] yulab.utils_0.1.8 vctrs_0.6.5 tools_4.4.2
#> [7] curl_6.0.1 parallel_4.4.2 RSQLite_2.3.8
#> [10] tibble_3.2.1 fansi_1.0.6 blob_1.2.4
#> [13] pkgconfig_2.0.3 Matrix_1.7-1 dbplyr_2.5.0
#> [16] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.4.2
#> [19] treeio_1.31.0 codetools_0.2-20 htmltools_0.5.8.1
#> [22] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
#> [25] lazyeval_0.2.2 yaml_2.3.10 tidyr_1.3.1
#> [28] pillar_1.9.0 crayon_1.5.3 jquerylib_0.1.4
#> [31] BiocParallel_1.41.0 DelayedArray_0.33.2 cachem_1.1.0
#> [34] abind_1.4-8 nlme_3.1-166 tidyselect_1.2.1
#> [37] digest_0.6.37 dplyr_1.1.4 maketools_1.3.1
#> [40] fastmap_1.2.0 grid_4.4.2 cli_3.6.3
#> [43] SparseArray_1.7.2 magrittr_2.0.3 S4Arrays_1.7.1
#> [46] utf8_1.2.4 ape_5.8 withr_3.0.2
#> [49] filelock_1.0.3 UCSC.utils_1.3.0 bit64_4.5.2
#> [52] rmarkdown_2.29 httr_1.4.7 bit_4.5.0
#> [55] memoise_2.0.1 evaluate_1.0.1 knitr_1.49
#> [58] BiocFileCache_2.15.0 rlang_1.1.4 Rcpp_1.0.13-1
#> [61] DBI_1.2.3 glue_1.8.0 tidytree_0.4.6
#> [64] BiocManager_1.30.25 jsonlite_1.8.9 R6_2.5.1
#> [67] fs_1.6.5 zlibbioc_1.52.0