MicrobiomeBenchmarkData

Introduction

The MicrobiomeBenchamrkData package provides access to a collection of datasets with biological ground truth for benchmarking differential abundance methods. The datasets are deposited on Zenodo: https://doi.org/10.5281/zenodo.6911026

Installation

## Install BioConductor if not installed
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

## Release version (not yet in Bioc, so it doesn't work yet)
BiocManager::install("MicrobiomeBenchmarkData")

## Development version
BiocManager::install("waldronlab/MicrobiomeBenchmarkData") 
library(MicrobiomeBenchmarkData)
#> Warning: multiple methods tables found for 'union'
#> Warning: multiple methods tables found for 'intersect'
#> Warning: multiple methods tables found for 'setdiff'
library(purrr)

Sample metadata

All sample metadata is merged into a single data frame and provided as a data object:

data('sampleMetadata', package = 'MicrobiomeBenchmarkData')
## Get columns present in all samples
sample_metadata <- sampleMetadata |> 
    discard(~any(is.na(.x))) |> 
    head()
knitr::kable(sample_metadata)
dataset sample_id body_site library_size pmid study_condition sequencing_method
HMP_2012_16S_gingival_V13 700103497 oral_cavity 5356 22699609 control 16S
HMP_2012_16S_gingival_V13 700106940 oral_cavity 4489 22699609 control 16S
HMP_2012_16S_gingival_V13 700097304 oral_cavity 3043 22699609 control 16S
HMP_2012_16S_gingival_V13 700099015 oral_cavity 2832 22699609 control 16S
HMP_2012_16S_gingival_V13 700097644 oral_cavity 2815 22699609 control 16S
HMP_2012_16S_gingival_V13 700097247 oral_cavity 6333 22699609 control 16S

Accessing datasets

Currently, there are 6 datasets available through the MicrobiomeBenchmarkData. These datasets are accessed through the getBenchmarkData function.

Access a single dataset

In order to import a dataset, the getBenchmarkData function must be used with the name of the dataset as the first argument (x) and the dryrun argument set to FALSE. The output is a list vector with the dataset imported as a TreeSummarizedExperiment object.

tse <- getBenchmarkData('HMP_2012_16S_gingival_V35_subset', dryrun = FALSE)[[1]]
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_subset_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_subset_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_subset_taxonomy_tree.newick'
#> Finished HMP_2012_16S_gingival_V35_subset.
tse
#> class: TreeSummarizedExperiment 
#> dim: 892 76 
#> metadata(0):
#> assays(1): counts
#> rownames(892): OTU_97.31247 OTU_97.44487 ... OTU_97.45365 OTU_97.45307
#> rowData names(7): kingdom phylum ... genus taxon_annotation
#> colnames(76): 700023057 700023179 ... 700114009 700114338
#> colData names(13): dataset subject_id ... sequencing_method
#>   variable_region_16s
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (892 rows)
#> rowTree: 1 phylo tree(s) (892 leaves)
#> colLinks: NULL
#> colTree: NULL

Access a few datasets

Several datasets can be imported simultaneously by giving the names of the different datasets in a character vector:

list_tse <- getBenchmarkData(dats$Dataset[2:4], dryrun = FALSE)
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V35_taxonomy_tree.newick'
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_WMS_gingival_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_WMS_gingival_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_WMS_gingival_taxonomy_tree.newick'
#> Finished HMP_2012_WMS_gingival.
str(list_tse, max.level = 1)
#> List of 3
#>  $ HMP_2012_16S_gingival_V35       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_WMS_gingival           :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots

Access all of the datasets

If all of the datasets must to be imported, this can be done by providing the dryrun = FALSE argument alone.

mbd <- getBenchmarkData(dryrun = FALSE)
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V13_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V13_taxonomy_table.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/HMP_2012_16S_gingival_V13_taxonomy_tree.newick'
#> Finished HMP_2012_16S_gingival_V13.
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> Finished HMP_2012_WMS_gingival.
#> adding rname 'https://zenodo.org/record/6911027/files/Ravel_2011_16S_BV_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/Ravel_2011_16S_BV_taxonomy_table.tsv'
#> Warning: No taxonomy_tree available for Ravel_2011_16S_BV.
#> Finished Ravel_2011_16S_BV.
#> adding rname 'https://zenodo.org/record/6911027/files/Stammler_2016_16S_spikein_count_matrix.tsv'
#> adding rname 'https://zenodo.org/record/6911027/files/Stammler_2016_16S_spikein_taxonomy_table.tsv'
#> Warning: No taxonomy_tree available for Stammler_2016_16S_spikein.
#> Finished Stammler_2016_16S_spikein.
str(mbd, max.level = 1)
#> List of 6
#>  $ HMP_2012_16S_gingival_V13       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_16S_gingival_V35       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_WMS_gingival           :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ Ravel_2011_16S_BV               :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ Stammler_2016_16S_spikein       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots

Annotations for each taxa are included in rowData

The biological annotations of each taxa are provided as a column in the rowData slot of the TreeSummarizedExperiment.

## In the case, the column is named as taxon_annotation 
tse <- mbd$HMP_2012_16S_gingival_V35_subset
rowData(tse)
#> DataFrame with 892 rows and 7 columns
#>                  kingdom      phylum       class           order
#>              <character> <character> <character>     <character>
#> OTU_97.31247    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.44487    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.34979    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.34572    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.42259    Bacteria  Firmicutes     Bacilli Lactobacillales
#> ...                  ...         ...         ...             ...
#> OTU_97.44294    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.45429    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.44375    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.45365    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.45307    Bacteria  Firmicutes     Bacilli Lactobacillales
#>                        family         genus      taxon_annotation
#>                   <character>   <character>           <character>
#> OTU_97.31247 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44487 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34979 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34572 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.42259 Streptococcaceae Streptococcus facultative_anaerobic
#> ...                       ...           ...                   ...
#> OTU_97.44294 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45429 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44375 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45365 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45307 Streptococcaceae Streptococcus facultative_anaerobic

Cache

The datasets are cached so they’re only downloaded once. The cache and all of the files contained in it can be removed with the removeCache function.

removeCache()

Session information

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] purrr_1.0.2                     MicrobiomeBenchmarkData_1.9.0  
#>  [3] TreeSummarizedExperiment_2.15.0 Biostrings_2.75.1              
#>  [5] XVector_0.47.0                  SingleCellExperiment_1.29.1    
#>  [7] SummarizedExperiment_1.37.0     Biobase_2.67.0                 
#>  [9] GenomicRanges_1.59.0            GenomeInfoDb_1.43.1            
#> [11] IRanges_2.41.1                  S4Vectors_0.45.2               
#> [13] BiocGenerics_0.53.3             generics_0.1.3                 
#> [15] MatrixGenerics_1.19.0           matrixStats_1.4.1              
#> [17] BiocStyle_2.35.0               
#> 
#> loaded via a namespace (and not attached):
#>  [1] xfun_0.49               bslib_0.8.0             lattice_0.22-6         
#>  [4] yulab.utils_0.1.8       vctrs_0.6.5             tools_4.4.2            
#>  [7] curl_6.0.1              parallel_4.4.2          RSQLite_2.3.8          
#> [10] tibble_3.2.1            fansi_1.0.6             blob_1.2.4             
#> [13] pkgconfig_2.0.3         Matrix_1.7-1            dbplyr_2.5.0           
#> [16] lifecycle_1.0.4         GenomeInfoDbData_1.2.13 compiler_4.4.2         
#> [19] treeio_1.31.0           codetools_0.2-20        htmltools_0.5.8.1      
#> [22] sys_3.4.3               buildtools_1.0.0        sass_0.4.9             
#> [25] lazyeval_0.2.2          yaml_2.3.10             tidyr_1.3.1            
#> [28] pillar_1.9.0            crayon_1.5.3            jquerylib_0.1.4        
#> [31] BiocParallel_1.41.0     DelayedArray_0.33.2     cachem_1.1.0           
#> [34] abind_1.4-8             nlme_3.1-166            tidyselect_1.2.1       
#> [37] digest_0.6.37           dplyr_1.1.4             maketools_1.3.1        
#> [40] fastmap_1.2.0           grid_4.4.2              cli_3.6.3              
#> [43] SparseArray_1.7.2       magrittr_2.0.3          S4Arrays_1.7.1         
#> [46] utf8_1.2.4              ape_5.8                 withr_3.0.2            
#> [49] filelock_1.0.3          UCSC.utils_1.3.0        bit64_4.5.2            
#> [52] rmarkdown_2.29          httr_1.4.7              bit_4.5.0              
#> [55] memoise_2.0.1           evaluate_1.0.1          knitr_1.49             
#> [58] BiocFileCache_2.15.0    rlang_1.1.4             Rcpp_1.0.13-1          
#> [61] DBI_1.2.3               glue_1.8.0              tidytree_0.4.6         
#> [64] BiocManager_1.30.25     jsonlite_1.8.9          R6_2.5.1               
#> [67] fs_1.6.5                zlibbioc_1.52.0