Title: | Integrating Multi-modal Single Cell Experiment datasets |
---|---|
Description: | SingleCellMultiModal is an ExperimentHub package that serves multiple datasets obtained from GEO and other sources and represents them as MultiAssayExperiment objects. We provide several multi-modal datasets including scNMT, 10X Multiome, seqFISH, CITEseq, SCoPE2, and others. The scope of the package is is to provide data for benchmarking and analysis. To cite, use the 'citation' function and see <https://doi.org/10.1371/journal.pcbi.1011324>. |
Authors: | Marcel Ramos [aut, cre] , Ricard Argelaguet [aut], Al Abadi [ctb], Dario Righelli [aut], Christophe Vanderaa [ctb], Kelly Eckenrode [aut], Ludwig Geistlinger [aut], Levi Waldron [aut] |
Maintainer: | Marcel Ramos <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.17.3 |
Built: | 2024-09-16 05:53:14 UTC |
Source: | https://github.com/waldronlab/SingleCellMultiModal |
The SingleCellMultiModal package provides a convenient and user-friendly representation of multi-modal data from project such as 'scNMT' for mouse gastrulation.
Maintainer: Marcel Ramos [email protected] (ORCID)
Authors:
Ricard Argelaguet [email protected]
Dario Righelli [email protected]
Kelly Eckenrode [email protected]
Ludwig Geistlinger [email protected]
Levi Waldron [email protected]
Other contributors:
Al Abadi [contributor]
Christophe Vanderaa [email protected] [contributor]
Useful links:
Report bugs at https://github.com/waldronlab/SingleCellMultiModal/issues
help(package = "SingleCellMultiModal")
help(package = "SingleCellMultiModal")
addCTLabels
addCTLabels( cd, out, outname, ct, mkrcol = "markers", ctcol = "celltype", overwrite = FALSE, verbose = TRUE )
addCTLabels( cd, out, outname, ct, mkrcol = "markers", ctcol = "celltype", overwrite = FALSE, verbose = TRUE )
cd |
the |
out |
list data structure returned by |
outname |
character indicating the name of the out data structure |
ct |
character indicating the celltype to assign in the |
mkrcol |
character indicating the cd column to store the markers
indicated by |
ctcol |
character indicating the column in cd to store the cell type
indicated by |
overwrite |
logical indicating if the cell types have to be overwritten without checking if detected barcodes were already assigned to other celltypes |
verbose |
logical for having informative messages during the execution |
an updated version of the cd DataFrame
function assembles data on-the-fly from 'ExperimentHub' to provide a MultiAssayExperiment container. Actually the 'dataType' argument provides access to the available datasets associated to the package.
CITEseq( DataType = c("cord_blood", "peripheral_blood"), modes = "*", version = "1.0.0", dry.run = TRUE, filtered = FALSE, verbose = TRUE, DataClass = c("MultiAssayExperiment", "SingleCellExperiment"), ... )
CITEseq( DataType = c("cord_blood", "peripheral_blood"), modes = "*", version = "1.0.0", dry.run = TRUE, filtered = FALSE, verbose = TRUE, DataClass = c("MultiAssayExperiment", "SingleCellExperiment"), ... )
DataType |
character(1) indicating the identifier of the dataset to retrieve. (default "cord_blood") |
modes |
character() The assay types or modes of data to obtain these include scADT and scRNA-seq data by default. |
version |
character(1) Either version '1.0.0' depending on data version required. |
dry.run |
logical(1) Whether to return the dataset names before actual download (default TRUE) |
filtered |
logical(1) indicating if the returned dataset needs to have filtered cells. See Details for additional information about the filtering process. |
verbose |
logical(1) Whether to show the dataset currently being (down)loaded (default TRUE) |
DataClass |
either MultiAssayExperiment or SingleCellExperiment data classes can be returned (default MultiAssayExperiment) |
... |
Additional arguments passed on to the ExperimentHub-class constructor |
CITEseq data are a combination of single cell transcriptomics and about a hundread of cell surface proteins.
Available datasets are:
cord_blood: a dataset of single cells of cord blood as provided in Stoeckius et al. (2017).
scRNA_Counts - Stoeckius scRNA-seq gene count matrix
scADT - Stoeckius antibody-derived tags (ADT) data
peripheral_blood: a dataset of single cells of peripheral
blood as provided in Mimitou et al. (2019).
We provide two different conditions controls (CTRL) and
Cutaneous T-cell Limphoma (CTCL).
Just build appropriate modes
regex for subselecting the
dataset modes.
scRNA - Mimitou scRNA-seq gene count matrix
scADT - Mimitou antibody-derived tags (ADT) data
scHTO - Mimitou Hashtag Oligo (HTO) data
TCRab - Mimitou T-cell Receptors (TCR) alpha and beta available through the object metadata.
TCRgd - Mimitou T-cell Receptors (TCR) gamma and delta available through the object metadata.
If 'filtered' parameter is 'FALSE' (default), the 'colData' of the returned object contains multiple columns of 'logicals' indicating the cells to be discarded. In case 'filtered' is 'TRUE', the 'discard' column is used to filer the cells. Column 'adt.discard' indicates the cells to be discarded computed on the ADT assay. Column 'mito.discard' indicates the cells to be discarded computed on the RNA assay and mitocondrial genes. Column 'discard' combines the previous columns with an 'OR' operator. Note that for the 'peripheral_blood' dataset these three columns are computed and returned separately for the 'CTCL' and 'CTRL' conditions. In this case the additional 'discard' column combines the 'discard.CTCL' and 'discard.CTRL' columns with an 'OR' operator. Cell filtering has been computed for 'cord_blood' and 'peripheral_blood' datasets following section 12.3 of the Advanced Single-Cell Analysis with Bioconductor book. Executed code can be retrieved in the CITEseq_filtering.R script of this package.
A single cell multi-modal MultiAssayExperiment or informative 'data.frame' when 'dry.run' is 'TRUE'. When 'DataClass' is 'SingleCellExperiment' an object of this class is returned with an RNA assay as main experiment and other assay(s) as 'AltExp(s)'.
Dario Righelli
Stoeckius et al. (2017), Mimitou et al. (2019)
mae <- CITEseq(DataType="cord_blood", dry.run=FALSE) experiments(mae)
mae <- CITEseq(DataType="cord_blood", dry.run=FALSE) experiments(mae)
Shows the cells/barcodes in two different plots (scatter and density) divinding the space in four quadrant indicated by the two thresholds given as input parameters. The x/y-axis represent respectively the two ADTs given as input. It returns a list of one element for each quadrant, each with barcodes and percentage (see Value section for details).
getCellGroups(mat, adt1 = "CD19", adt2 = "CD3", th1 = 0.2, th2 = 0)
getCellGroups(mat, adt1 = "CD19", adt2 = "CD3", th1 = 0.2, th2 = 0)
mat |
matrix of counts or clr transformed counts for ADT data in CITEseq |
adt1 |
character indicating the name of the marker to plot on the x-axis (default is CD19). |
adt2 |
character indicating the name of the marker to plot on the y-axis (default is CD3). |
th1 |
numeric indicating the threshold for the marker on the x-axis (default is 0.2). |
th2 |
numeric indicating the threshold for the marker on the y-axis (default is 0). |
helps to do manual gating for cell type indentification with CITEseq or similar data, providing cell markers. Once identified two interesting markers for a cell type, the user has to play with the thresholds to identify the cell populations specified by an uptake (+) o downtake (-) of the couple of markers (ADTs) previously selected.
a list of four different element, each one indicating the quarter where the thresholds divide the plotting space, in eucledian order I, II, III, IV quadrant, indicating respectively +/+, +/-, -/+, -/- combinations for the couples of selected ADTs. Each element of the list contains two objects, one with the list of detected barcodes and one indicating the percentage of barcodes falling into that quadrant. .
GTseq assembles data on-the-fly from ExperimentHub
to
provide a MultiAssayExperiment container. The DataType
argument provides access to the mouse_embryo_8_cell
dataset as obtained
from Macaulay et al. (2015). Protocol information for this dataset is
available from Macaulay et al. (2016). See references.
GTseq( DataType = "mouse_embryo_8_cell", modes = "*", version = "1.0.0", dry.run = TRUE, verbose = TRUE, ... )
GTseq( DataType = "mouse_embryo_8_cell", modes = "*", version = "1.0.0", dry.run = TRUE, verbose = TRUE, ... )
DataType |
character(1) Indicates study that produces this type of data (default: 'mouse_embryo_8_cell') |
modes |
character() A wildcard / glob pattern of modes, such as
|
version |
character(1). Currently, only version '1.0.0'. |
dry.run |
logical(1) Whether to return the dataset names before actual download (default TRUE) |
verbose |
logical(1) Whether to show the dataset currently being (down)loaded (default TRUE) |
... |
Additional arguments passed on to the ExperimentHub-class constructor |
G&T-seq is a combination of Picoplex amplified gDNA sequencing (genome) and SMARTSeq2 amplified cDNA sequencing (transcriptome) of the same cell. For more information, see Macaulay et al. (2015).
mouse_embryo_8_cell: this dataset was filtered for bad cells as specified in Macaulay et al. (2015).
genomic - integer copy numbers as detected from scDNA-seq
transcriptomic - raw read counts as quantified from scRNA-seq
A single cell multi-modal MultiAssayExperiment or
informative data.frame
when dry.run
is TRUE
The MultiAssayExperiment
metadata includes the original function call
that saves the function call and the data version requested.
https://www.ebi.ac.uk/ena/browser/view/PRJEB9051
Macaulay et al. (2015) G&T-seq: parallel sequencing of single-cell genomes and transcriptomes. Nat Methods, 12:519–22.
Macaulay et al. (2016) Separation and parallel sequencing of the genomes and transcriptomes of single cells using G&T-seq. Nat Protoc, 11:2081–103.
SingleCellMultiModal-package
GTseq()
GTseq()
The 'ontomap' function provides a mapping of all the cell names across the all the data sets or for a specified data set.
ontomap(dataset = c("scNMT", "scMultiome", "SCoPE2", "CITEseq", "seqFISH"))
ontomap(dataset = c("scNMT", "scMultiome", "SCoPE2", "CITEseq", "seqFISH"))
dataset |
'character()' One of the existing functions within the package. If missing, a map of all cell types in each function will be provided. |
Note that 'CITEseq' does not have any cell annotations; therefore, no entries are present in the 'ontomap'.
A 'data.frame' of metadata with cell types and ontologies
ontomap(dataset = "scNMT")
ontomap(dataset = "scNMT")
Managing data downloads is important to save disk space and
re-downloading data files. This can be done effortlessly via the integrated
BiocFileCache
system.
scmmCache(...) setCache( directory = tools::R_user_dir("SingleCellMultiModal", "cache"), verbose = TRUE, ask = interactive() ) removeCache(accession)
scmmCache(...) setCache( directory = tools::R_user_dir("SingleCellMultiModal", "cache"), verbose = TRUE, ask = interactive() ) removeCache(accession)
... |
For |
directory |
character(1) The file location where the cache is located.
Once set, future downloads will go to this folder. See |
verbose |
Whether to print descriptive messages |
ask |
logical(1) (default TRUE when |
accession |
character(1) A single string indicating the accession number of the study |
The directory / option of the cache location
Get the directory location of the cache. It will prompt the user to create
a cache if not already created. A specific directory can be used via
setCache
.
Specify the directory location of the data cache. By default, it will go into the user's home and package name directory as given by R_user_dir (default: varies by system e.g., for Linux: '$HOME/.cache/R/SingleCellMultiModal').
Some files may become corrupt when downloading, this function allows the user to delete the tarball associated with a study number in the cache.
getOption("scmmCache") scmmCache()
getOption("scmmCache") scmmCache()
10x Genomics Multiome technology enables simultaneous profiling of the transcriptome (using 3’ gene expression) and epigenome (using ATAC-seq) from single cells to deepen our understanding of how genes are expressed and regulated across different cell types. Data prepared by Ricard Argelaguet.
scMultiome( DataType = "pbmc_10x", modes = "*", version = "1.0.0", format = c("MTX", "HDF5"), dry.run = TRUE, verbose = TRUE, ... )
scMultiome( DataType = "pbmc_10x", modes = "*", version = "1.0.0", format = c("MTX", "HDF5"), dry.run = TRUE, verbose = TRUE, ... )
DataType |
character(1) Indicates study that produces this type of data (default: 'mouse_gastrulation') |
modes |
character() A wildcard / glob pattern of modes, such as
|
version |
character(1) Either version '1.0.0' or '2.0.0' depending on data version required (default '1.0.0'). See version section. |
format |
Either MTX or HDF5 data format (default MTX) |
dry.run |
logical(1) Whether to return the dataset names before actual download (default TRUE) |
verbose |
logical(1) Whether to show the dataset currently being (down)loaded (default TRUE) |
... |
Additional arguments passed on to the ExperimentHub-class constructor |
Users are able to choose from either an MTX
or HDF5
file format
as the internal data representation. The MTX
(Matrix Market)
format allows users to load a sparse dgCMatrix
representation.
Choosing HDF5
gives users a sparse HDF5Array
class object.
pbmc_10x: 10K Peripheral Blood Mononuclear Cells provided by
10x Genomics website
Cell quality control filters are available in the object colData
together with the celltype
annotation labels.
A 10X PBMC MultiAssayExperiment
object
scMultiome(DataType = "pbmc_10x", modes = "*", dry.run = TRUE)
scMultiome(DataType = "pbmc_10x", modes = "*", dry.run = TRUE)
scNMT assembles data on-the-fly from ExperimentHub
to
provide a MultiAssayExperiment container. The DataType
argument provides access to the mouse_gastrulation
dataset as obtained
from Argelaguet et al. (2019; DOI: 10.1038/s41586-019-1825-8).
Pre-processing code can be seen at
https://github.com/rargelaguet/scnmt_gastrulation. Protocol
information for this dataset is available at Clark et al. (2018). See
the vignette for the full citation.
scNMT( DataType = "mouse_gastrulation", modes = "*", version = "1.0.0", dry.run = TRUE, verbose = TRUE, ... )
scNMT( DataType = "mouse_gastrulation", modes = "*", version = "1.0.0", dry.run = TRUE, verbose = TRUE, ... )
DataType |
character(1) Indicates study that produces this type of data (default: 'mouse_gastrulation') |
modes |
character() A wildcard / glob pattern of modes, such as
|
version |
character(1) Either version '1.0.0' or '2.0.0' depending on data version required (default '1.0.0'). See version section. |
dry.run |
logical(1) Whether to return the dataset names before actual download (default TRUE) |
verbose |
logical(1) Whether to show the dataset currently being (down)loaded (default TRUE) |
... |
Additional arguments passed on to the ExperimentHub-class constructor |
scNMT is a combination of RNA-seq (transcriptome) and an adaptation of Nucleosome Occupancy and Methylation sequencing (NOMe-seq, the methylome and chromatin accessibility) technologies. For more information, see Reik et al. (2018) DOI: 10.1038/s41467-018-03149-4
mouse_gastrulation:
this dataset provides cell quality control filters in the object
colData
starting from version 2.0.0.
Additionally, cell types annotations are provided through the lineage
colData
column.
rna - RNA-seq
acc_ - chromatin accessibility
met_ - DNA methylation
cgi - CpG islands
CTCF - footprints of CTCF binding
DHS - DNase Hypersensitive Sites
genebody - gene bodies
p300 - p300 binding sites
promoter - gene promoters
Special thanks to Al J Abadi for preparing the published data in time for the 2020 BIRS Workshop, see the link here: urlhttps://github.com/BIRSBiointegration/Hackathon/tree/master/scNMT-seq
A single cell multi-modal MultiAssayExperiment or
informative data.frame
when dry.run
is TRUE
Version '1.0.0' of the scNMT mouse_gastrulation dataset includes all of the above mentioned assay technologies with filtering of cells based on quality control metrics. Version '2.0.0' contains all of the cells without the QC filter and does not contain CTCF binding footprints or p300 binding sites.
The MultiAssayExperiment
metadata includes the original function call
that saves the function call and the data version requested.
http://ftp.ebi.ac.uk/pub/databases/scnmt_gastrulation/
Argelaguet et al. (2019)
SingleCellMultiModal-package
scNMT(DataType = "mouse_gastrulation", modes = "*", version = "1.0.0", dry.run = TRUE)
scNMT(DataType = "mouse_gastrulation", modes = "*", version = "1.0.0", dry.run = TRUE)
SCoPE2 assembles data on-the-fly from ExperimentHub
to provide a MultiAssayExperiment container. The
DataType
argument provides access to the SCoPE2
dataset as
provided by Specht et al. (2020; DOI: http://dx.doi.org/10.1101/665307).
The article provides more information about the data
acquisition and pre-processing.
SCoPE2( DataType = "macrophage_differentiation", modes = "*", version = "1.0.0", dry.run = TRUE, verbose = TRUE, ... )
SCoPE2( DataType = "macrophage_differentiation", modes = "*", version = "1.0.0", dry.run = TRUE, verbose = TRUE, ... )
DataType |
character(1) Indicates study that produces this type of data (default: 'macrophage_differentiation') |
modes |
character() A wildcard / glob pattern of modes, such as
|
version |
character(1), currently only version '1.0.0' is available |
dry.run |
logical(1) Whether to return the dataset names before actual download (default TRUE) |
verbose |
logical(1) Whether to show the dataset currently being (down)loaded (default TRUE) |
... |
Additional arguments passed on to the ExperimentHub-class constructor |
The SCoPE2 study combines scRNA-seq (transcriptome) and single-cell proteomics.
macrophage_differentiation: the cells are monocytes that undergo
macrophage differentiation. No annotation is available for the
transcriptome data, but batch and cell type annotations are
available for the proteomics data in the celltype
colData
column.
The transcriptomics and proteomics data were not measured from the same
cells but from a distinct set of cell cultures.
This dataset provides already filtered bad quality cells.
scRNAseq1 - single-cell transcriptome (batch 1)
scRNAseq2 - single-cell transcriptome (batch 2)
scp - single-cell proteomics
A single cell multi-modal MultiAssayExperiment or
informative data.frame
when dry.run
is TRUE
All files are linked from the slavovlab website https://scope2.slavovlab.net/docs/data
Specht, Harrison, Edward Emmott, Aleksandra A. Petelski, R. Gray Huffman, David H. Perlman, Marco Serra, Peter Kharchenko, Antonius Koller, and Nikolai Slavov. 2020. “Single-Cell Proteomic and Transcriptomic Analysis of Macrophage Heterogeneity.” bioRxiv. https://doi.org/10.1101/665307.
SingleCellMultiModal-package
SCoPE2(DataType = "macrophage_differentiation", modes = "*", version = "1.0.0", dry.run = TRUE)
SCoPE2(DataType = "macrophage_differentiation", modes = "*", version = "1.0.0", dry.run = TRUE)
seqFISH function assembles data on-the-fly from 'ExperimentHub' to provide a MultiAssayExperiment container. Actually the 'DataType' argument provides access to the available datasets associated to the package.
seqFISH( DataType = "mouse_visual_cortex", modes = "*", version, dry.run = TRUE, verbose = TRUE, ... )
seqFISH( DataType = "mouse_visual_cortex", modes = "*", version, dry.run = TRUE, verbose = TRUE, ... )
DataType |
character(1) indicating the identifier of the dataset to retrieve. (default "mouse_visual_cortex") |
modes |
character( ) The assay types or modes of data to obtain these include seq-FISH and scRNA-seq data by default. |
version |
character(1) Either version '1.0.0' or '2.0.0' depending on data version required (default '1.0.0'). See version section. |
dry.run |
logical(1) Whether to return the dataset names before actual download (default TRUE) |
verbose |
logical(1) Whether to show the dataset currently being (down)loaded (default TRUE) |
... |
Additional arguments passed on to the ExperimentHub-class constructor |
seq FISH data are a combination of single cell spatial coordinates and transcriptomics for a few hundreds of genes. seq-FISH data can be combined for example with scRNA-seq data to unveil multiple aspects of cellular behaviour based on their spatial organization and transcription.
Available datasets are:
mouse_visual_cortex: combination of seq-FISH data as obtained from Zhu et al. (2018) and scRNA-seq data as obtained from Tasic et al. (2016), Version 1.0.0 returns the full scRNA-seq data matrix, while version 2.0.0 returns the processed and subsetted scRNA-seq data matrix (produced for the Mathematical Frameworks for Integrative Analysis of Emerging Biological Data Types 2020 Workshop) The returned seqFISH data are always the processed ones for the same workshop. Additionally, cell types annotations are available in the 'colData' through the 'class' column in the seqFISH 'assay'.
scRNA_Counts - Tasic scRNA-seq gene count matrix
scRNA_Labels - Tasic scRNA-seq cell labels
seqFISH_Coordinates - Zhu seq-FISH spatial coordinates
seqFISH_Counts - Zhu seq-FISH gene counts matrix
seqFISH_Labels - Zhu seq-FISH cell labels
A MultiAssayExperiment of seq-FISH data
Dario Righelli <dario.righelli <at> gmail.com>
seqFISH(DataType = "mouse_visual_cortex", modes = "*", version = "2.0.0", dry.run = TRUE)
seqFISH(DataType = "mouse_visual_cortex", modes = "*", version = "2.0.0", dry.run = TRUE)
Combine multiple single cell modalities into one using the input of the individual functions.
SingleCellMultiModal( DataTypes, modes = "*", versions = "1.0.0", dry.run = TRUE, verbose = TRUE, ... )
SingleCellMultiModal( DataTypes, modes = "*", versions = "1.0.0", dry.run = TRUE, verbose = TRUE, ... )
DataTypes |
character() A vector of data types as indicated in each
individual function by the |
modes |
list() A list or CharacterList of modes for each data type where each element corresponds to one data type. |
versions |
character() A vector of versions for each DataType. By
default, version |
dry.run |
logical(1) Whether to return the dataset names before actual download (default TRUE) |
verbose |
logical(1) Whether to show the dataset currently being (down)loaded (default TRUE) |
... |
Additional arguments passed on to the ExperimentHub-class constructor |
A multi-modality MultiAssayExperiment
The metadata in the MultiAssayExperiment
contains the original
function call used to generate the object (labeled as call
),
a call_map
which provides traceability of technology functions to
DataType
prefixes, and lastly, R version information as version
.
SingleCellMultiModal(c("mouse_gastrulation", "pbmc_10x"), modes = list(c("acc*", "met*"), "rna"), version = c("2.0.0", "1.0.0"), dry.run = TRUE, verbose = TRUE )
SingleCellMultiModal(c("mouse_gastrulation", "pbmc_10x"), modes = list(c("acc*", "met*"), "rna"), version = c("2.0.0", "1.0.0"), dry.run = TRUE, verbose = TRUE )