| Title: | Curated Metagenomic Data of the Human Microbiome |
|---|---|
| Description: | The curatedMetagenomicData package provides standardized, curated human microbiome data for novel analyses. It includes gene families, marker abundance, marker presence, pathway abundance, pathway coverage, and relative abundance for samples collected from different body sites. The bacterial, fungal, and archaeal taxonomic abundances for each sample were calculated with MetaPhlAn3, and metabolic functional potential was calculated with HUMAnN3. The manually curated sample metadata and standardized metagenomic data are available as (Tree)SummarizedExperiment objects. |
| Authors: | Lucas schiffer [aut, cre] (ORCID: <https://orcid.org/0000-0003-3628-0326>), Levi Waldron [aut], Edoardo Pasolli [ctb], Jennifer Wokaty [ctb], Sean Davis [ctb], Audrey Renson [ctb], Chloe Mirzayi [ctb], Paolo Manghi [ctb], Samuel Gamboa-Tuz [ctb], Marcel Ramos [ctb], Valerie Obenchain [ctb], Kelly Eckenrode [ctb], Nicola Segata [ctb], Tuomas Borman [ctb] (ORCID: <https://orcid.org/0000-0002-8563-8884>), Sehyun Oh [ctb], Yoon-Ji Jung [ctb], NCI [fnd] (GrantNo.: R01CA230551) |
| Maintainer: | Lucas schiffer <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 3.21.1 |
| Built: | 2026-05-16 09:35:26 UTC |
| Source: | https://github.com/waldronlab/curatedMetagenomicData |
To access curated metagenomic data users will use curatedMetagenomicData()
after "shopping" the sampleMetadata data.frame for resources they are
interested in. The dryrun argument allows users to perfect a query prior to
returning resources. When dryrun = TRUE, matched resources will be printed
before they are returned invisibly as a character vector. When
dryrun = FALSE, a list of resources containing
SummarizedExperiment
and/or
TreeSummarizedExperiment
objects, each with corresponding sample metadata, is returned. Multiple
resources can be returned simultaneously and if there is more than one date
corresponding to a resource, the most recent one is selected automatically.
Finally, if a relative_abundance resource is requested and counts = TRUE,
relative abundance proportions will be multiplied by read depth and rounded
to the nearest integer.
curatedMetagenomicData( pattern, dryrun = TRUE, counts = FALSE, rownames = "long" )curatedMetagenomicData( pattern, dryrun = TRUE, counts = FALSE, rownames = "long" )
pattern |
regular expression pattern to look for in the titles of
resources available in curatedMetagenomicData; |
dryrun |
if |
counts |
if |
rownames |
the type of |
Above "resources" refers to resources that exists in Bioconductor's
ExperimentHub service. In the context of curatedMetagenomicData, these are
study-level (sparse) matrix objects used to create
SummarizedExperiment
and/or
TreeSummarizedExperiment
objects that are ultimately returned as the list of resources. Only the
gene_families dataType (see returnSamples) is stored as a sparse matrix
in ExperimentHub – this has no practical consequences for users and is done
to optimize storage. When searching for "resources", users will use the
study_name value from the sampleMetadata data.frame.
if dryrun = TRUE, a character vector of resource names is returned
invisibly; if dryrun = FALSE, a list of resources is returned
mergeData, returnSamples, sampleMetadata
curatedMetagenomicData("AsnicarF_20.+") curatedMetagenomicData("AsnicarF_2017.relative_abundance", dryrun = FALSE) curatedMetagenomicData("AsnicarF_20.+.relative_abundance", dryrun = FALSE, counts = TRUE)curatedMetagenomicData("AsnicarF_20.+") curatedMetagenomicData("AsnicarF_2017.relative_abundance", dryrun = FALSE) curatedMetagenomicData("AsnicarF_20.+.relative_abundance", dryrun = FALSE, counts = TRUE)
To merge the list elements returned from curatedMetagenomicData into a
single
SummarizedExperiment or
TreeSummarizedExperiment
object, users will use mergeData() provided elements are the same
dataType (see returnSamples). This is useful for analysis across entire
studies (e.g. meta-analysis); however, when doing analysis across individual
samples (e.g. mega-analysis) returnSamples is preferable.
mergeData(mergeList)mergeData(mergeList)
mergeList |
a |
Internally, mergeData() must full join assays and rowData slots of each
SummarizedExperiment or
TreeSummarizedExperiment
object (colData is merged slightly more efficiently by row binding). While
dplyr methods are used for maximum efficiency, users should be aware that
memory requirements can be large when merging many list elements.
when mergeList elements are of dataType (see returnSamples)
relative_abundance, a
TreeSummarizedExperiment
object is returned; otherwise, a
SummarizedExperiment
object is returned
curatedMetagenomicData, returnSamples
curatedMetagenomicData("LiJ_20.+.marker_abundance", dryrun = FALSE) |> mergeData() curatedMetagenomicData("LiJ_20.+.pathway_abundance", dryrun = FALSE) |> mergeData() curatedMetagenomicData("LiJ_20.+.relative_abundance", dryrun = FALSE) |> mergeData()curatedMetagenomicData("LiJ_20.+.marker_abundance", dryrun = FALSE) |> mergeData() curatedMetagenomicData("LiJ_20.+.pathway_abundance", dryrun = FALSE) |> mergeData() curatedMetagenomicData("LiJ_20.+.relative_abundance", dryrun = FALSE) |> mergeData()
To return samples across studies, users will use returnSamples() along with
the sampleMetadata data.frame subset to include only desired samples and
metadata. The subset sampleMetadata data.frame will be used to get the
desired resources, mergeData will be used to merge them, and the subset
sampleMetadata data.frame will be used again to subset the
SummarizedExperiment or
TreeSummarizedExperiment
object to include only desired samples and metadata.
returnSamples(sampleMetadata, dataType, counts = FALSE, rownames = "long")returnSamples(sampleMetadata, dataType, counts = FALSE, rownames = "long")
sampleMetadata |
the sampleMetadata |
dataType |
the data type to be returned; one of the following:
|
counts |
if |
rownames |
the type of |
At present, curatedMetagenomicData resources exists only as entire studies which requires potentially getting many resources for a limited number of samples. Furthermore, because it is necessary to use mergeData internally, the same caveats detailed under Details in mergeData apply here.
when dataType = "relative_abundance", a
TreeSummarizedExperiment
object is returned; otherwise, a
SummarizedExperiment
object is returned
sampleMetadata |> dplyr::filter(age >= 18) |> dplyr::filter(!base::is.na(alcohol)) |> dplyr::filter(body_site == "stool") |> dplyr::select(where(~ !base::all(base::is.na(.x)))) |> returnSamples("relative_abundance")sampleMetadata |> dplyr::filter(age >= 18) |> dplyr::filter(!base::is.na(alcohol)) |> dplyr::filter(body_site == "stool") |> dplyr::select(where(~ !base::all(base::is.na(.x)))) |> returnSamples("relative_abundance")
Manually curated sample metadata for all samples in curatedMetagenomicData.
sampleMetadatasampleMetadata
An object of class data.frame with 22588 rows and 141 columns.