This data set consists of about 10K Peripheral Blood Mononuclear Cells (PBMCs) derived from a single healthy donor. It is available from the 10x Genomics website.
Provided are the RNA expression counts quantified at the gene level
and the chromatin accessibility levels quantified at the peak level.
Here we provide the default peaks called by the CellRanger software. If
you want to explore other peak definitions or chromatin accessibility
quantifications (at the promoter level, etc.), you have download the
fragments.tsv.gz
file from the 10x Genomics website.
The user can see the available dataset by using the default options
## Working on: pbmc_atac_se.rds
## Working on: pbmc_atac.mtx.gz
## Working on: pbmc_rna_se.rds
## Working on: pbmc_rna.mtx.gz
## Working on: pbmc_atac,
## pbmc_rna
## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## Working on: pbmc_atac,
## pbmc_rna
## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## Working on: pbmc_colData
## Working on: pbmc_sampleMap
## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
There are two assays: rna
and atac
, stored
as SingleCellExperiment
objects
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] atac: SingleCellExperiment with 108344 rows and 10032 columns
## [2] rna: SingleCellExperiment with 36549 rows and 10032 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
where the cells are the same in both assays:
Columns:
Lymphoid
or
Myeloid
originThe cells have not been QC-ed, choosing a minimum number of genes/peaks per cell depends is left to you! In addition, there are further quality control criteria that you may want to apply, including mitochondrial coverage, fraction of reads overlapping ENCODE Blacklisted regions, Transcription start site enrichment, etc. See suggestions below for software that can perform a semi-automated quality control pipeline
## DataFrame with 6 rows and 6 columns
## nCount_RNA nFeature_RNA nCount_ATAC nFeature_ATAC
## <integer> <integer> <integer> <integer>
## AAACAGCCAAGGAATC 8380 3308 55582 13878
## AAACAGCCAATCCCTT 3771 1896 20495 7253
## AAACAGCCAATGCGCT 6876 2904 16674 6528
## AAACAGCCAGTAGGTG 7614 3061 39454 11633
## AAACAGCCAGTTTACG 3633 1691 20523 7245
## AAACAGCCATCCAGGT 7782 3028 22412 8602
## celltype broad_celltype
## <character> <character>
## AAACAGCCAAGGAATC naive CD4 T cells Lymphoid
## AAACAGCCAATCCCTT memory CD4 T cells Lymphoid
## AAACAGCCAATGCGCT naive CD4 T cells Lymphoid
## AAACAGCCAGTAGGTG naive CD4 T cells Lymphoid
## AAACAGCCAGTTTACG memory CD4 T cells Lymphoid
## AAACAGCCATCCAGGT non-classical monocy.. Myeloid
The RNA expression consists of 36,549 genes and 10,032 cells, stored
using the dgCMatrix
sparse matrix format
## [1] 36549 10032
## [1] "atac" "rna"
Let’s do some standard dimensionality reduction plot:
sce.rna <- experiments(mae)[["rna"]]
# Normalisation
sce.rna <- logNormCounts(sce.rna)
# Feature selection
decomp <- modelGeneVar(sce.rna)
hvgs <- rownames(decomp)[decomp$mean>0.01 & decomp$p.value <= 0.05]
sce.rna <- sce.rna[hvgs,]
# PCA
sce.rna <- runPCA(sce.rna, ncomponents = 25)
# UMAP
set.seed(42)
sce.rna <- runUMAP(sce.rna, dimred="PCA", n_neighbors = 25, min_dist = 0.3)
plotUMAP(sce.rna, colour_by="celltype", point_size=0.5, point_alpha=1)
The ATAC expression consists of 108,344 peaks and 10,032 cells:
## [1] 108344 10032
Let’s do some standard dimensionality reduction plot. Note that
scATAC-seq data is sparser than scRNA-seq, almost binary. The log
normalisation + PCA approach that scater
implements for
scRNA-seq is not a good strategy for scATAC-seq data. Topic modelling or
TFIDF+SVD are a better strategy. Please see the package recommendations
below.
sce.atac <- experiments(mae)[["atac"]]
# Normalisation
sce.atac <- logNormCounts(sce.atac)
# Feature selection
decomp <- modelGeneVar(sce.atac)
hvgs <- rownames(decomp)[decomp$mean>0.25]
sce.atac <- sce.atac[hvgs,]
# PCA
sce.atac <- runPCA(sce.atac, ncomponents = 25)
# UMAP
set.seed(42)
sce.atac <- runUMAP(sce.atac, dimred="PCA", n_neighbors = 25, min_dist = 0.3)
plotUMAP(sce.atac, colour_by="celltype", point_size=0.5, point_alpha=1)
These are my personal recommendations of R-based analysis software:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scater_1.35.0 ggplot2_3.5.1
## [3] scran_1.35.0 scuttle_1.17.0
## [5] HDF5Array_1.35.1 rhdf5_2.51.0
## [7] DelayedArray_0.33.2 SparseArray_1.7.2
## [9] S4Arrays_1.7.1 abind_1.4-8
## [11] Matrix_1.7-1 RaggedExperiment_1.31.0
## [13] SingleCellExperiment_1.29.1 SingleCellMultiModal_1.19.1
## [15] MultiAssayExperiment_1.33.1 SummarizedExperiment_1.37.0
## [17] Biobase_2.67.0 GenomicRanges_1.59.1
## [19] GenomeInfoDb_1.43.1 IRanges_2.41.1
## [21] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [23] generics_0.1.3 MatrixGenerics_1.19.0
## [25] matrixStats_1.4.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 gridExtra_2.3 formatR_1.14
## [4] rlang_1.1.4 magrittr_2.0.3 RcppAnnoy_0.0.22
## [7] compiler_4.4.2 RSQLite_2.3.8 png_0.1-8
## [10] vctrs_0.6.5 pkgconfig_2.0.3 SpatialExperiment_1.17.0
## [13] crayon_1.5.3 fastmap_1.2.0 dbplyr_2.5.0
## [16] magick_2.8.5 XVector_0.47.0 labeling_0.4.3
## [19] utf8_1.2.4 rmarkdown_2.29 ggbeeswarm_0.7.2
## [22] UCSC.utils_1.3.0 UpSetR_1.4.0 purrr_1.0.2
## [25] bit_4.5.0 xfun_0.49 bluster_1.17.0
## [28] zlibbioc_1.52.0 cachem_1.1.0 beachmat_2.23.1
## [31] jsonlite_1.8.9 blob_1.2.4 rhdf5filters_1.19.0
## [34] Rhdf5lib_1.29.0 BiocParallel_1.41.0 irlba_2.3.5.1
## [37] parallel_4.4.2 cluster_2.1.6 R6_2.5.1
## [40] bslib_0.8.0 limma_3.63.2 jquerylib_0.1.4
## [43] Rcpp_1.0.13-1 knitr_1.49 BiocBaseUtils_1.9.0
## [46] igraph_2.1.1 tidyselect_1.2.1 viridis_0.6.5
## [49] yaml_2.3.10 codetools_0.2-20 curl_6.0.1
## [52] plyr_1.8.9 lattice_0.22-6 tibble_3.2.1
## [55] withr_3.0.2 KEGGREST_1.47.0 evaluate_1.0.1
## [58] BiocFileCache_2.15.0 ExperimentHub_2.15.0 Biostrings_2.75.1
## [61] pillar_1.9.0 BiocManager_1.30.25 filelock_1.0.3
## [64] BiocVersion_3.21.1 munsell_0.5.1 scales_1.3.0
## [67] glue_1.8.0 metapod_1.15.0 maketools_1.3.1
## [70] tools_4.4.2 AnnotationHub_3.15.0 BiocNeighbors_2.1.0
## [73] sys_3.4.3 ScaledMatrix_1.15.0 locfit_1.5-9.10
## [76] buildtools_1.0.0 grid_4.4.2 colorspace_2.1-1
## [79] edgeR_4.5.0 AnnotationDbi_1.69.0 GenomeInfoDbData_1.2.13
## [82] beeswarm_0.4.0 BiocSingular_1.23.0 vipor_0.4.7
## [85] cli_3.6.3 rsvd_1.0.5 rappdirs_0.3.3
## [88] fansi_1.0.6 viridisLite_0.4.2 dplyr_1.1.4
## [91] uwot_0.2.2 gtable_0.3.6 sass_0.4.9
## [94] digest_0.6.37 ggrepel_0.9.6 dqrng_0.4.1
## [97] farver_2.1.2 rjson_0.2.23 memoise_2.0.1
## [100] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
## [103] statmod_1.5.0 mime_0.12 bit64_4.5.2