The MicrobiomeBenchamrkData package provides access to a
collection of datasets with biological ground truth for benchmarking
differential abundance methods. The datasets are deposited on Zenodo: https://doi.org/10.5281/zenodo.6911026
## Install BioConductor if not installed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
## Release version (not yet in Bioc, so it doesn't work yet)
BiocManager::install("MicrobiomeBenchmarkData")
## Development version
BiocManager::install("waldronlab/MicrobiomeBenchmarkData") All sample metadata is merged into a single data frame and provided as a data object:
data('sampleMetadata', package = 'MicrobiomeBenchmarkData')
## Get columns present in all samples
sample_metadata <- sampleMetadata |>
discard(~any(is.na(.x))) |>
head()
knitr::kable(sample_metadata)| dataset | sample_id | body_site | library_size | pmid | study_condition | sequencing_method |
|---|---|---|---|---|---|---|
| HMP_2012_16S_gingival_V13 | 700103497 | oral_cavity | 5356 | 22699609 | control | 16S |
| HMP_2012_16S_gingival_V13 | 700106940 | oral_cavity | 4489 | 22699609 | control | 16S |
| HMP_2012_16S_gingival_V13 | 700097304 | oral_cavity | 3043 | 22699609 | control | 16S |
| HMP_2012_16S_gingival_V13 | 700099015 | oral_cavity | 2832 | 22699609 | control | 16S |
| HMP_2012_16S_gingival_V13 | 700097644 | oral_cavity | 2815 | 22699609 | control | 16S |
| HMP_2012_16S_gingival_V13 | 700097247 | oral_cavity | 6333 | 22699609 | control | 16S |
Currently, there are 6 datasets available through the
MicrobiomeBenchmarkData. These datasets are accessed through the
getBenchmarkData function.
If no arguments are provided, the list of available datasets is printed on screen and a data.frame is returned with the description of the datasets:
dats <- getBenchmarkData()
#> 1 HMP_2012_16S_gingival_V13
#> 2 HMP_2012_16S_gingival_V35
#> 3 HMP_2012_16S_gingival_V35_subset
#> 4 HMP_2012_WMS_gingival
#> 5 Stammler_2016_16S_spikein
#> 6 Ravel_2011_16S_BV
#>
#> Use vignette('datasets', package = 'MicrobiomeBenchmarkData') for a detailed description of the datasets.
#>
#> Use getBenchmarkData(dryrun = FALSE) to import all of the datasets.dats
#> Dataset Dimensions Body.site
#> 1 HMP_2012_16S_gingival_V13 33127 x 311 Gingiva
#> 2 HMP_2012_16S_gingival_V35 17949 x 311 Gingiva
#> 3 HMP_2012_16S_gingival_V35_subset 892 x 76 Gingiva
#> 4 HMP_2012_WMS_gingival 235 x 16 Gingiva
#> 5 Stammler_2016_16S_spikein 247 x 394 Stool
#> 6 Ravel_2011_16S_BV 4036 x 17 Vagina
#> Contrasts
#> 1 Subgingival vs Supragingival plaque.
#> 2 Subgingival vs Supragingival plaque.
#> 3 Subgingival vs Supragingival plaque.
#> 4 Subgingival vs Supragingival plaque.
#> 5 Pre-ASCT (allogeneic stem cell transplantation) vs 14 days after treatment.
#> 6 Healthy vs bacterial vaginosis
#> Biological.ground.truth
#> 1 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 2 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 3 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 4 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 5 Same bacterial loads of the spike-in bacteria across all samples: Salinibacter ruber (extreme halophilic), Rhizobium radiobacter (found in soils and plants), and Alicyclobacillus acidiphilu (thermo-acidophilic).
#> 6 Decrease of Lactobacillus and increase of bacteria isolated during bacterial vaginosis in samples with high Nugent scores (bacterial vaginosis).In order to import a dataset, the getBenchmarkData
function must be used with the name of the dataset as the first argument
(x) and the dryrun argument set to
FALSE. The output is a list vector with the dataset
imported as a TreeSummarizedExperiment object.
tse <- getBenchmarkData('HMP_2012_16S_gingival_V35_subset', dryrun = FALSE)[[1]]
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_subset_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_subset_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_subset_taxonomy_tree.newick'
#> Finished HMP_2012_16S_gingival_V35_subset.
tse
#> class: TreeSummarizedExperiment
#> dim: 892 76
#> metadata(0):
#> assays(1): counts
#> rownames(892): OTU_97.31247 OTU_97.44487 ... OTU_97.45365 OTU_97.45307
#> rowData names(7): kingdom phylum ... genus taxon_annotation
#> colnames(76): 700023057 700023179 ... 700114009 700114338
#> colData names(13): dataset subject_id ... sequencing_method
#> variable_region_16s
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (892 rows)
#> rowTree: 1 phylo tree(s) (892 leaves)
#> colLinks: NULL
#> colTree: NULLSeveral datasets can be imported simultaneously by giving the names of the different datasets in a character vector:
list_tse <- getBenchmarkData(dats$Dataset[2:4], dryrun = FALSE)
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_taxonomy_tree.newick'
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_WMS_gingival_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_WMS_gingival_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_WMS_gingival_taxonomy_tree.newick'
#> Finished HMP_2012_WMS_gingival.
str(list_tse, max.level = 1)
#> List of 3
#> $ HMP_2012_16S_gingival_V35 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_WMS_gingival :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slotsIf all of the datasets must to be imported, this can be done by
providing the dryrun = FALSE argument alone.
mbd <- getBenchmarkData(dryrun = FALSE)
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V13_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V13_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V13_taxonomy_tree.newick'
#> Finished HMP_2012_16S_gingival_V13.
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> Finished HMP_2012_WMS_gingival.
#> adding rname 'https://zenodo.org/record/6911027/files/Ravel_2011_16S_BV_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/Ravel_2011_16S_BV_taxonomy_table.tsv'
#> Warning: No taxonomy_tree available for Ravel_2011_16S_BV.
#> Finished Ravel_2011_16S_BV.
#> adding rname 'https://zenodo.org/record/6911027/files/Stammler_2016_16S_spikein_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/Stammler_2016_16S_spikein_taxonomy_table.tsv'
#> Warning: No taxonomy_tree available for Stammler_2016_16S_spikein.
#> Finished Stammler_2016_16S_spikein.
str(mbd, max.level = 1)
#> List of 6
#> $ HMP_2012_16S_gingival_V13 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_WMS_gingival :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ Ravel_2011_16S_BV :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ Stammler_2016_16S_spikein :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slotsThe biological annotations of each taxa are provided as a column in
the rowData slot of the TreeSummarizedExperiment.
## In the case, the column is named as taxon_annotation
tse <- mbd$HMP_2012_16S_gingival_V35_subset
rowData(tse)
#> DataFrame with 892 rows and 7 columns
#> kingdom phylum class order
#> <character> <character> <character> <character>
#> OTU_97.31247 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.44487 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.34979 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.34572 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.42259 Bacteria Firmicutes Bacilli Lactobacillales
#> ... ... ... ... ...
#> OTU_97.44294 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45429 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.44375 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45365 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45307 Bacteria Firmicutes Bacilli Lactobacillales
#> family genus taxon_annotation
#> <character> <character> <character>
#> OTU_97.31247 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44487 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34979 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34572 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.42259 Streptococcaceae Streptococcus facultative_anaerobic
#> ... ... ... ...
#> OTU_97.44294 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45429 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44375 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45365 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45307 Streptococcaceae Streptococcus facultative_anaerobicThe datasets are cached so they’re only downloaded once. The cache
and all of the files contained in it can be removed with the
removeCache function.
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] purrr_1.2.2 MicrobiomeBenchmarkData_1.15.1
#> [3] TreeSummarizedExperiment_2.21.0 Biostrings_2.81.3
#> [5] XVector_0.53.0 SingleCellExperiment_1.35.1
#> [7] SummarizedExperiment_1.43.0 Biobase_2.73.1
#> [9] GenomicRanges_1.65.0 Seqinfo_1.3.0
#> [11] IRanges_2.47.2 S4Vectors_0.51.3
#> [13] BiocGenerics_0.59.7 generics_0.1.4
#> [15] MatrixGenerics_1.25.0 matrixStats_1.5.0
#> [17] BiocStyle_2.41.0
#>
#> loaded via a namespace (and not attached):
#> [1] httr2_1.2.2 xfun_0.58 bslib_0.11.0
#> [4] lattice_0.22-9 vctrs_0.7.3 tools_4.6.0
#> [7] yulab.utils_0.2.4 curl_7.1.0 parallel_4.6.0
#> [10] RSQLite_3.53.1 tibble_3.3.1 blob_1.3.0
#> [13] pkgconfig_2.0.3 Matrix_1.7-5 dbplyr_2.5.2
#> [16] lifecycle_1.0.5 compiler_4.6.0 treeio_1.37.0
#> [19] codetools_0.2-20 htmltools_0.5.9 sys_3.4.3
#> [22] buildtools_1.0.0 sass_0.4.10 yaml_2.3.12
#> [25] lazyeval_0.2.3 pillar_1.11.1 crayon_1.5.3
#> [28] jquerylib_0.1.4 tidyr_1.3.2 BiocParallel_1.47.0
#> [31] DelayedArray_0.39.3 cachem_1.1.0 abind_1.4-8
#> [34] nlme_3.1-169 tidyselect_1.2.1 digest_0.6.39
#> [37] dplyr_1.2.1 maketools_1.3.2 fastmap_1.2.0
#> [40] grid_4.6.0 cli_3.6.6 SparseArray_1.13.2
#> [43] magrittr_2.0.5 S4Arrays_1.13.0 ape_5.8-1
#> [46] withr_3.0.2 filelock_1.0.3 rappdirs_0.3.4
#> [49] bit64_4.8.2 rmarkdown_2.31 bit_4.6.0
#> [52] otel_0.2.0 memoise_2.0.1 evaluate_1.0.5
#> [55] knitr_1.51 BiocFileCache_3.3.0 rlang_1.2.0
#> [58] Rcpp_1.1.1-1.1 DBI_1.3.0 glue_1.8.1
#> [61] tidytree_0.4.7 BiocManager_1.30.27 jsonlite_2.0.0
#> [64] R6_2.6.1 fs_2.1.0