Package 'TCGAutils' reference manual

Title:	TCGA utility functions for data management
Description:	A suite of helper functions for checking and manipulating TCGA data including data obtained from the curatedTCGAData experiment package. These functions aim to simplify and make working with TCGA data more manageable. Exported functions include those that import data from flat files into Bioconductor objects, convert row annotations, and identifier translation via the GDC API.
Authors:	Marcel Ramos [aut, cre] , Lucas Schiffer [aut], Sean Davis [ctb], Levi Waldron [aut]
Maintainer:	Marcel Ramos <[email protected]>
License:	Artistic-2.0
Version:	1.27.6
Built:	2025-02-15 03:00:26 UTC
Source:	https://github.com/waldronlab/TCGAutils

TCGAutils: Helper functions for working with TCGA and MultiAssayExperiment data

Description

TCGAutils is a toolbox to work with TCGA specific datasets. It allows the user to manipulate and translate TCGA barcodes, conveniently convert a list of data files to GRangesList. Take datasets from GISTIC and return a SummarizedExperiment class object. The package also provides functions for working with data from the curatedTCGAData experiment data package. It provides convenience functions for extracting subtype metadata data and adding clinical data to existing MultiAssayExperiment objects.

Author(s)

Maintainer: Marcel Ramos [email protected] (ORCID)

Authors:

Lucas Schiffer
Levi Waldron

Other contributors:

Sean Davis [contributor]

Utilities for working with HUMAN genome builds

Description

A few functions are available to search for build versions, either from NCBI or UCSC.

translateBuild: translates between UCSC and NCBI build versions
extractBuild: use grep patterns to find the first build within the string input
uniformBuilds: replace build occurrences below a threshold level of occurence with the alternative build
correctBuild: Ensure that the build annotation is correct based on the NCBI/UCSC website. If not, use translateBuild with the indicated 'style' input
isCorrect: Check to see if the build is exactly as annotated

Usage

translateBuild(from, to = c("UCSC", "NCBI"))

correctBuild(build, style = c("UCSC", "NCBI"))

isCorrect(build, style = c("UCSC", "NCBI"))

extractBuild(string, build = c("UCSC", "NCBI"))

uniformBuilds(builds, cutoff = 0.2, na = c("", "NA"))
translateBuild(from, to = c("UCSC", "NCBI"))

correctBuild(build, style = c("UCSC", "NCBI"))

isCorrect(build, style = c("UCSC", "NCBI"))

extractBuild(string, build = c("UCSC", "NCBI"))

uniformBuilds(builds, cutoff = 0.2, na = c("", "NA"))

Arguments

`from`	character() A vector of build versions typically from `genome()` (e.g., "37"). The build vector must be homogenous (i.e., `length(unique(x)) == 1L`).
`to`	character(1) The name of the desired build version (either "UCSC" or "NCBI"; default: "UCSC")
`build`	A vector of build version names (default UCSC, NCBI)
`style`	character(1) The annotation style, either 'UCSC' or 'NCBI'
`string`	A single character string
`builds`	A character vector of builds
`cutoff`	numeric(1L) An inclusive threshold tolerance value for missing values and translating builds that are below the threshold
`na`	character() The values to be considered as missing (default: c("", "NA"))

Details

The correctBuild function takes the input and ensures that the style specified matches the input. Otherwise, it will return the correct style for use with seqlevelsStyle. Currently, the function does not support patched builds (e.g., 'GRCh38.p13') Build names are taken from the website: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/

Value

translateBuild: A character vector of translated genome builds

extractBuild: A character string of the build information available

uniformBuilds: A character vector of builds where all builds are
    identical `identical(length(unique(build)), 1L)`

correctBuild: A character string of the 'corrected' build name

isCorrect: A logical indicating if the build is exactly as annotated

Examples


translateBuild("GRCh35", "UCSC")


correctBuild("grch38", "NCBI")
correctBuild("hg19", "NCBI")


isCorrect("GRCh38", "NCBI")

isCorrect("hg19", "UCSC")


extractBuild(
"SCENA_p_TCGAb29and30_SNP_N_GenomeWideSNP_6_G05_569110.nocnv_grch38.seg.txt"
)


buildvec <- rep(c("GRCh37", "hg19"), times = c(5, 1))
uniformBuilds(buildvec)

navec <- c(rep(c("GRCh37", "hg19"), times = c(5, 1)), "NA")
uniformBuilds(navec)

translateBuild("GRCh35", "UCSC")


correctBuild("grch38", "NCBI")
correctBuild("hg19", "NCBI")


isCorrect("GRCh38", "NCBI")

isCorrect("hg19", "UCSC")


extractBuild(
"SCENA_p_TCGAb29and30_SNP_N_GenomeWideSNP_6_G05_569110.nocnv_grch38.seg.txt"
)


buildvec <- rep(c("GRCh37", "hg19"), times = c(5, 1))
uniformBuilds(buildvec)

navec <- c(rep(c("GRCh37", "hg19"), times = c(5, 1)), "NA")
uniformBuilds(navec)

Clinical dataset names in TCGA

Description

A dataset of names for each of the TCGA cancer codes available. These names were obtained by the clinical datasets from getFirehoseData. They serve to subset the current datasets provided by curatedTCGAData.

Usage

data("clinicalNames")
data("clinicalNames")

Format

A CharacterList of names for 33 cancer codes

Value

The clinical dataset column names in TCGA as provided by the RTCGAToolbox

Helper functions for managing MultiAssayExperiment from curatedTCGAData

Description

Additional helper functions for cleaning and uncovering metadata within a downloaded MultiAssayExperiment from curatedTCGAData.

Usage

getSubtypeMap(multiassayexperiment)

getClinicalNames(diseaseCode)

TCGAsplitAssays(multiassayexperiment, sampleCodes = NULL, exclusive = FALSE)

sampleTables(multiassayexperiment, vial = FALSE)
getSubtypeMap(multiassayexperiment)

getClinicalNames(diseaseCode)

TCGAsplitAssays(multiassayexperiment, sampleCodes = NULL, exclusive = FALSE)

sampleTables(multiassayexperiment, vial = FALSE)

Arguments

`multiassayexperiment`	A `MultiAssayExperiment` object
`diseaseCode`	A TCGA cancer code (e.g., "BRCA")
`sampleCodes`	character (default NULL) A string of sample type codes (refer to `data(sampleTypes)`; `TCGAsplitAssays` section)
`exclusive`	logical (default FALSE) Whether to return only assays that contain all codes in `sampleCodes`
`vial`	(logical default FALSE) whether to display vials in the table output

Details

Note that for getSubtypeMap, the column of in-data variable names may need to go through make.names to be found in the colData of the MultiAssayExperiment.

Value

getSubtypeMap: A data.frame with explanatory names and their in-data variable names. They may not be present for all cancer types.
getClinicalNames: A vector of common variable names that may be found across several cancer disease codes.

getSubtypeMap

provides a two column data.frame with interpreted names and in-data variable names. 'Name' usually refers to the colData row names a.k.a. the patientID.

getClinicalNames

provides a vector of common variable names that exist in the colData DataFrame of a curatedTCGAData MultiAssayExperiment object. These variables are directly obtained from the BroadFirehose clinical data (downloaded with getFirehoseData) and tend to be present across cancer disease codes.

TCGAsplitAssays

Separates samples by indicated sample codes into different assays in a MultiAssayExperiment. Refer to the sampleTypes data object for a list of available codes. This operation generates n times the number of assays based on the number of sample codes entered. By default, all assays will be split by samples present in the data.

sampleTables

Display all the available samples in each of the assays

Examples


library(curatedTCGAData)

gbm <- curatedTCGAData("GBM", c("RPPA*", "CNA*"), version = "2.0.1", FALSE)

getSubtypeMap(gbm)

sampleTables(gbm)

TCGAsplitAssays(gbm, c("01", "10"))

getClinicalNames("COAD")

library(curatedTCGAData)

gbm <- curatedTCGAData("GBM", c("RPPA*", "CNA*"), version = "2.0.1", FALSE)

getSubtypeMap(gbm)

sampleTables(gbm)

TCGAsplitAssays(gbm, c("01", "10"))

getClinicalNames("COAD")

TCGA Cancer Disease Codes Table

Description

A dataset for obtaining the cancer codes in TCGA for about 13 different types of cancers.

Usage

data("diseaseCodes")
data("diseaseCodes")

Format

A data frame with 37 rows and 2 variables:

Study.Abbreviation: Disease Code used in TCGA
Available: Cancer datasets available via curatedTCGAData
SubtypeData: Subtype curation data available via curatedTCGAData
Study.Name: The full length study name (i.e., type of cancer)

Value

The TCGA diseaseCodes table

Source

https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations

Obtain minimum necessary names for the creation of a GRangesList object

Description

This function attempts to match chromosome, start position, end position and strand names in the given character vector. Modified helper from the GenomicRanges package.

Usage

findGRangesCols(
  df_colnames,
  seqnames.field = c("seqnames", "seqname", "chromosome", "chrom", "chr",
    "chromosome_name", "seqid", "om"),
  start.field = "start",
  end.field = c("end", "stop"),
  strand.field = "strand",
  ignore.strand = FALSE
)
findGRangesCols(
  df_colnames,
  seqnames.field = c("seqnames", "seqname", "chromosome", "chrom", "chr",
    "chromosome_name", "seqid", "om"),
  start.field = "start",
  end.field = c("end", "stop"),
  strand.field = "strand",
  ignore.strand = FALSE
)

Arguments

`df_colnames`	A `character` vector of names in a dataset
`seqnames.field`	A `character` vector of the chromosome name
`start.field`	A `character` vector that indicates the column name of the start positions of ranged data
`end.field`	A `character` vector that indicates the end position of ranged data
`strand.field`	A `character` vector of the column name that indicates the strand type
`ignore.strand`	logical (default FALSE) whether to ignore the strand field in the data

Value

Index positions vector indicating columns with appropriate names

Examples

myDataColNames <- c("Start_position", "End_position", "strand",
                 "chromosome", "num_probes", "segment_mean")
findGRangesCols(myDataColNames)

myDataColNames <- c("Start_position", "End_position", "strand",
                 "chromosome", "num_probes", "segment_mean")
findGRangesCols(myDataColNames)

Create a sampleMap from an experiment list and phenoData dataframe

Description

This function helps create a sampleMap in preparation of a MultiAssayExperiment object. This especially useful when the sample identifiers are not very different, as in the case of TCGA barcodes. An idConverter function can be provided to truncate such sample identifiers and obtain patient identifiers.

Usage

generateMap(
  experiments,
  colData,
  idConverter = identity,
  sampleCol,
  patientCol,
  ...
)
generateMap(
  experiments,
  colData,
  idConverter = identity,
  sampleCol,
  patientCol,
  ...
)

Arguments

`experiments`	A named `list` of experiments compatible with the `MultiAssayExperiment` API
`colData`	A `data.frame` of clinical data with patient identifiers as rownames
`idConverter`	A function to be used against the sample or specimen identifiers to match those in the rownames of the `colData` (default NULL)
`sampleCol`	A single string indicating the sample identifiers column in the colData dataset
`patientCol`	A single string indicating the patient identifiers in colData, "row.names" extracts the colData row names
`...`	Additonal arguments to pass to the 'idConverter' function.

Value

A DataFrame class object of mapped samples and patient identifiers including assays

Author(s)

M. Ramos, M. Morgan, L. Schiffer

Examples

## Minimal example
expList <- list(assay1 = matrix(1:6, ncol = 2L,
        dimnames = list(paste0("feature", 1:3), c("A-J", "B-J"))),
    assay2 = matrix(1:4, ncol = 2,
        dimnames = list(paste0("gene", 1:2), c("A-L", "B-L"))))

## Mock colData
myPheno <- data.frame(var1 = c("Yes", "No"), var2 = c("High", "Low"),
    row.names = c("a", "b"))

## A look at the identifiers
vapply(expList, colnames, character(2L))
rownames(myPheno)

## Use 'idConverter' to correspond sample names to patient identifiers
generateMap(expList, myPheno,
    idConverter = function(x) substr(tolower(x), 1L, 1L))

## Minimal example
expList <- list(assay1 = matrix(1:6, ncol = 2L,
        dimnames = list(paste0("feature", 1:3), c("A-J", "B-J"))),
    assay2 = matrix(1:4, ncol = 2,
        dimnames = list(paste0("gene", 1:2), c("A-L", "B-L"))))

## Mock colData
myPheno <- data.frame(var1 = c("Yes", "No"), var2 = c("High", "Low"),
    row.names = c("a", "b"))

## A look at the identifiers
vapply(expList, colnames, character(2L))
rownames(myPheno)

## Use 'idConverter' to correspond sample names to patient identifiers
generateMap(expList, myPheno,
    idConverter = function(x) substr(tolower(x), 1L, 1L))

Find the file names used in RTCGAToolbox

Description

Part of this function is from the RTCGAToolbox. It aims to extract the file name used inside of the getFirehoseData function. The arguments of the function parallel those in the getFirehoseData function. It is only available for select data types.

Usage

getFileName(
  disease,
  runDate = "20160128",
  dataType = c("CNASNP", "CNVSNP", "CNAseq", "CNACGH", "Mutation")
)
getFileName(
  disease,
  runDate = "20160128",
  dataType = c("CNASNP", "CNVSNP", "CNAseq", "CNACGH", "Mutation")
)

Arguments

`disease`	The TCGA cancer disease code, e.g., "COAD"
`runDate`	The single `string` used in the `getFirehoseData` function (default "20160128")
`dataType`	A single character vector (default "CNASNP") indicating the data type for which to get the source file name

Value

A single character file name

Examples


getFileName("COAD", dataType = "CNASNP")

getFileName("COAD", dataType = "CNASNP")

Translate study identifiers from barcode to UUID and vice versa

Description

These functions allow the user to enter a character vector of identifiers and use the GDC API to translate from TCGA barcodes to Universally Unique Identifiers (UUID) and vice versa. These relationships are not one-to-one. Therefore, a data.frame is returned for all inputs. The UUID to TCGA barcode translation only applies to file and case UUIDs. Two-way UUID translation is available from 'file_id' to 'case_id' and vice versa. Please double check any results before using these features for analysis. Case / submitter identifiers are translated by default, see the from_type argument for details. All identifiers are converted to lower case.

Usage

UUIDtoBarcode(id_vector, from_type = c("case_id", "file_id", "aliquot_ids"))

UUIDtoUUID(id_vector, to_type = c("case_id", "file_id"))

barcodeToUUID(barcodes)

filenameToBarcode(filenames, slides = FALSE)

UUIDhistory(id, endpoint = .HISTORY_ENDPOINT)
UUIDtoBarcode(id_vector, from_type = c("case_id", "file_id", "aliquot_ids"))

UUIDtoUUID(id_vector, to_type = c("case_id", "file_id"))

barcodeToUUID(barcodes)

filenameToBarcode(filenames, slides = FALSE)

UUIDhistory(id, endpoint = .HISTORY_ENDPOINT)

Arguments

`id_vector`	character() A vector of UUIDs corresponding to either files or cases (default assumes case_ids)
`from_type`	character(1) Either `case_id` or `file_id` indicating the type of `id_vector` entered (default `"case_id"`)
`to_type`	character(1) The desired UUID type to obtain, can either be `"case_id"` (default) or `"file_id"`
`barcodes`	character() A vector of TCGA barcodes
`filenames`	`character()` A vector of file names usually obtained from a `GenomicDataCommons` query
`slides`	`logical(1L)` DEPRECATED: Whether the provided file names correspond to slides typically with an `.svs` extension. Note The barcodes returned correspond 1:1 with the `filename` inputs. Always triple check the output against the Genomic Data Commons Data Portal by searching the file name and comparing associated "Entity ID" with the `submitter_id` given by the function.
`id`	character(1) A UUID whose history of versions is sought
`endpoint`	character(1) Generally a constant pertaining to the location of the history api endpoint. This argument rarely needs to change.

Details

Based on the file UUID supplied, the appropriate entity_id (TCGA barcode) is returned. In previous versions of the package, the 'end_point' parameter would require the user to specify what type of barcode needed. This is no longer supported as entity_id returns the appropriate one.

When providing slide file names, the function will only work if all the provided files are slide files with an .svs extension.

Value

Generally, a data.frame of identifier mappings

UUIDhistory: A data.frame containting a list of associated UUIDs for the given input along with file_change status, data_release versions, etc.

Author(s)

Sean Davis, M. Ramos

Examples

## Translate UUIDs >> TCGA Barcode

uuids <- c("b4bce3ff-7fdc-4849-880b-56f2b348ceac",
"5ca9fa79-53bc-4e91-82cd-5715038ee23e",
"b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382")

UUIDtoBarcode(uuids, from_type = "file_id")

UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", from_type = "case_id")

UUIDtoBarcode("d85d8a17-8aea-49d3-8a03-8f13141c163b", "aliquot_ids")

## Translate file UUIDs >> case UUIDs

uuids <- c("b4bce3ff-7fdc-4849-880b-56f2b348ceac",
"5ca9fa79-53bc-4e91-82cd-5715038ee23e",
"b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382")

UUIDtoUUID(uuids)

## Translate TCGA Barcode >> UUIDs

fullBarcodes <- c("TCGA-B0-5117-11A-01D-1421-08",
"TCGA-B0-5094-11A-01D-1421-08",
"TCGA-E9-A295-10A-01D-A16D-09")

sample_ids <- TCGAbarcode(fullBarcodes, sample = TRUE)

barcodeToUUID(sample_ids)

participant_ids <- c("TCGA-CK-4948", "TCGA-D1-A17N",
"TCGA-4V-A9QX", "TCGA-4V-A9QM")

barcodeToUUID(participant_ids)

library(GenomicDataCommons)

### Query CNV data and get file names

cnv <- files() |>
    filter(
        ~ cases.project.project_id == "TCGA-COAD" &
        data_category == "Copy Number Variation" &
        data_type == "Copy Number Segment"
    ) |>
    results(size = 6)

filenameToBarcode(cnv$file_name)

### Query slides data and get file names

slides <- files() |>
    filter(
        ~ cases.project.project_id == "TCGA-BRCA" &
        cases.samples.sample_type == "Primary Tumor" &
        data_type == "Slide Image" &
        experimental_strategy == "Diagnostic Slide"
    ) |>
    results(size = 3)

filenameToBarcode(slides$file_name, slides = TRUE)

## Get the version history of a BAM file in TCGA-KIRC
UUIDhistory("0001801b-54b0-4551-8d7a-d66fb59429bf")

## Translate UUIDs >> TCGA Barcode

uuids <- c("b4bce3ff-7fdc-4849-880b-56f2b348ceac",
"5ca9fa79-53bc-4e91-82cd-5715038ee23e",
"b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382")

UUIDtoBarcode(uuids, from_type = "file_id")

UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", from_type = "case_id")

UUIDtoBarcode("d85d8a17-8aea-49d3-8a03-8f13141c163b", "aliquot_ids")

## Translate file UUIDs >> case UUIDs

uuids <- c("b4bce3ff-7fdc-4849-880b-56f2b348ceac",
"5ca9fa79-53bc-4e91-82cd-5715038ee23e",
"b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382")

UUIDtoUUID(uuids)

## Translate TCGA Barcode >> UUIDs

fullBarcodes <- c("TCGA-B0-5117-11A-01D-1421-08",
"TCGA-B0-5094-11A-01D-1421-08",
"TCGA-E9-A295-10A-01D-A16D-09")

sample_ids <- TCGAbarcode(fullBarcodes, sample = TRUE)

barcodeToUUID(sample_ids)

participant_ids <- c("TCGA-CK-4948", "TCGA-D1-A17N",
"TCGA-4V-A9QX", "TCGA-4V-A9QM")

barcodeToUUID(participant_ids)

library(GenomicDataCommons)

### Query CNV data and get file names

cnv <- files() |>
    filter(
        ~ cases.project.project_id == "TCGA-COAD" &
        data_category == "Copy Number Variation" &
        data_type == "Copy Number Segment"
    ) |>
    results(size = 6)

filenameToBarcode(cnv$file_name)

### Query slides data and get file names

slides <- files() |>
    filter(
        ~ cases.project.project_id == "TCGA-BRCA" &
        cases.samples.sample_type == "Primary Tumor" &
        data_type == "Slide Image" &
        experimental_strategy == "Diagnostic Slide"
    ) |>
    results(size = 3)

filenameToBarcode(slides$file_name, slides = TRUE)

## Get the version history of a BAM file in TCGA-KIRC
UUIDhistory("0001801b-54b0-4551-8d7a-d66fb59429bf")

This function imputes assays values inside a `MultiAssayExperiment`

Description

These function allow the user to enter a MultiAssayExperiment and impute all the NA values inside assays.

Usage

imputeAssay(multiassayexperiment, i = 1, ...)
imputeAssay(multiassayexperiment, i = 1, ...)

Arguments

multiassayexperiment

A MultiAssayExperiment with genes in the rows, samples in the columns

i

A numeric, logical, or character vector indicating the assays to perform imputation on (default 1L)

...

Arguments passed on to impute::impute.knn

data: An expression matrix with genes in the rows, samples in the columns
k: Number of neighbors to be used in the imputation (default=10)
rowmax: The maximum percent missing data allowed in any row (default 50%). For any rows with more than rowmax% missing are imputed using the overall mean per sample.
colmax: The maximum percent missing data allowed in any column (default 80%). If any column has more than colmax% missing data, the program halts and reports an error.
maxp: The largest block of genes imputed using the knn algorithm inside impute.knn (default 1500); larger blocks are divided by two-means clustering (recursively) prior to imputation. If maxp=p, only knn imputation is done.
rng.seed: The seed used for the random number generator (default 362436069) for reproducibility.

Value

A MultiAssayExperiment with imputed assays values

Examples


example(getSubtypeMap)

## convert data to matrix and add as experiment
gbm <-
  c(gbm, RPPA_matrix = data.matrix(assay(gbm[["GBM_RPPAArray-20160128"]])))

imputeAssay(gbm, i = "RPPA_matrix")

example(getSubtypeMap)

## convert data to matrix and add as experiment
gbm <-
  c(gbm, RPPA_matrix = data.matrix(assay(gbm[["GBM_RPPAArray-20160128"]])))

imputeAssay(gbm, i = "RPPA_matrix")

Make a GRangesList from TCGA Copy Number data

Description

makeGRangesListFromCopyNumber allows the user to convert objects of class data.frame or S4Vectors::DataFrame to a GRangesList. It includes additional features specific to TCGA data such as, hugo symbols, probe numbers, segment means, and ucsc build (if available).

Usage

makeGRangesListFromCopyNumber(
  df,
  split.field,
  names.field = "Hugo_Symbol",
  ...
)
makeGRangesListFromCopyNumber(
  df,
  split.field,
  names.field = "Hugo_Symbol",
  ...
)

Arguments

`df`	A `data.frame` or `DataFrame` class object. `list` class objects are coerced to `data.frame` or `DataFrame`.
`split.field`	A `character` vector of length one indicating the column to be used as sample identifiers
`names.field`	A `character` vector of length one indicating the column to be used as names for each of the ranges in the data
`...`	Additional arguments to pass on to GenomicRanges::makeGRangesListFromDataFrame

Value

A GRangesList class object

Examples

library(GenomicDataCommons)

manif <- files() |>
    filter(~ cases.project.project_id == "TCGA-COAD" &
        data_type == "Copy Number Segment") |>
    manifest(size = 1)

fname <- gdcdata(manif$id)

barcode <- UUIDtoBarcode(names(fname), from_type = "file_id")
barcode <- barcode[["associated_entities.entity_submitter_id"]]

cndata <- read.delim(fname[[1L]], nrows = 10L)

cngrl <- makeGRangesListFromCopyNumber(cndata, split.field = "GDC_Aliquot",
    keep.extra.columns = TRUE)

names(cngrl) <- barcode
GenomeInfoDb::genome(cngrl) <- extractBuild(fname[[1L]])
cngrl

library(GenomicDataCommons)

manif <- files() |>
    filter(~ cases.project.project_id == "TCGA-COAD" &
        data_type == "Copy Number Segment") |>
    manifest(size = 1)

fname <- gdcdata(manif$id)

barcode <- UUIDtoBarcode(names(fname), from_type = "file_id")
barcode <- barcode[["associated_entities.entity_submitter_id"]]

cndata <- read.delim(fname[[1L]], nrows = 10L)

cngrl <- makeGRangesListFromCopyNumber(cndata, split.field = "GDC_Aliquot",
    keep.extra.columns = TRUE)

names(cngrl) <- barcode
GenomeInfoDb::genome(cngrl) <- extractBuild(fname[[1L]])
cngrl

Read exon-level expression files and create a `GRangesList`

Description

This function serves to read exon-level expression data. It works for exon quantification (raw counts and RPKM) and junction quantification (raw counts) file paths and represents such data as a GRangesList. The data files can be downloaded via the Genomic Data Commons (GDC) Legacy Archive.

Usage

makeGRangesListFromExonFiles(
  filepaths,
  sampleNames = NULL,
  fileNames = basename(filepaths),
  getBarcodes = TRUE,
  rangesColumn = "exon",
  nrows = Inf
)
makeGRangesListFromExonFiles(
  filepaths,
  sampleNames = NULL,
  fileNames = basename(filepaths),
  getBarcodes = TRUE,
  rangesColumn = "exon",
  nrows = Inf
)

Arguments

`filepaths`	character() vector of file paths containing TCGA exon data usually obtained from the GDC
`sampleNames`	character() vector of TCGA barcodes to be used as names for the `GRangesList` output (default NULL)
`fileNames`	character() vector of file names as downloaded from the Genomic Data Commons Legacy archive (default `basename(filepaths)`)
`getBarcodes`	logical(1). Whether to query the GDC API with the `filenameToBarcode` and obtain the TCGA barcodes from the file names (default TRUE); see details.
`rangesColumn`	character(1). The name of the column in the data containing the ranges information (default "exon"); see details.
`nrows`	numeric(1). The number of rows to return from each of the files read in (all rows by default; default Inf)

Details

The rangesColumn name in the GDC data files is usually "exon" but can be changed with the rangesColumn argument, if different. To avoid programmatically obtaining TCGA barcodes from the GDC API, set the getBarcodes to FALSE. When getBarcodes is set to FALSE, the file names are used to name the elements of the GRangesList output.

Value

A GRangesList object

Author(s)

M. Ramos

Examples


## Load example file found in package
pkgDir <- system.file("extdata", package = "TCGAutils", mustWork = TRUE)
exonFile <- list.files(pkgDir, pattern = "cation\\.txt$", full.names = TRUE)

filePrefix <- "unc.edu.32741f9a-9fec-441f-96b4-e504e62c5362.1755371."

## Add actual file name manually (due to Windows OS restriction)
makeGRangesListFromExonFiles(exonFile,
    fileNames = paste0(filePrefix, basename(exonFile)),
    sampleNames = "TCGA-AA-3678-01A-01R-0905-07")

## Load example file found in package
pkgDir <- system.file("extdata", package = "TCGAutils", mustWork = TRUE)
exonFile <- list.files(pkgDir, pattern = "cation\\.txt$", full.names = TRUE)

filePrefix <- "unc.edu.32741f9a-9fec-441f-96b4-e504e62c5362.1755371."

## Add actual file name manually (due to Windows OS restriction)
makeGRangesListFromExonFiles(exonFile,
    fileNames = paste0(filePrefix, basename(exonFile)),
    sampleNames = "TCGA-AA-3678-01A-01R-0905-07")

Take a MultiAssayExperiment and include curated variables

Description

This function works on the colData of a MultiAssayExperiment object to merge curated variable columns or other clinical variables that would like to be added. It is recommended that the user run the scripts in the MultiAssayExperiment.TCGA repository that build the "enhanced" type of data but not necessary if using different clinical data. Please see the repository's README for more information.

Usage

mergeColData(MultiAssayExperiment, colData)
mergeColData(MultiAssayExperiment, colData)

Arguments

`MultiAssayExperiment`	A `MultiAssayExperiment` object
`colData`	A `DataFrame` or `data.frame` to merge with clinical data in the `MultiAssayExperiment` object

Value

A MultiAssayExperiment object

Examples


library(MultiAssayExperiment)

mergeColData(MultiAssayExperiment(), S4Vectors::DataFrame())

library(MultiAssayExperiment)

mergeColData(MultiAssayExperiment(), S4Vectors::DataFrame())

OncoPrint for TCGA Mutation Assays

Description

OncoPrint for TCGA Mutation Assays

Usage

oncoPrintTCGA(
  multiassayexperiment,
  matchassay = "*_Mutation-*",
  variantCol = "Variant_Classification",
  brewerPal = "Set3",
  ntop = 25,
  incl.thresh = 0.01,
  rowcol = "Hugo_Symbol"
)
oncoPrintTCGA(
  multiassayexperiment,
  matchassay = "*_Mutation-*",
  variantCol = "Variant_Classification",
  brewerPal = "Set3",
  ntop = 25,
  incl.thresh = 0.01,
  rowcol = "Hugo_Symbol"
)

Arguments

`multiassayexperiment`	A `MultiAssayExperiment`, usually from `curatedTCGAData`
`matchassay`	character(1) The name of the assay containing mutation data, this can be a pattern (e.g., "_Mutation-", the default)
`variantCol`	character(1) The name of the metadata column containing the mutation categories, usually "Variant_Classification" in TCGA
`brewerPal`	character(1) The name of the `RColorBrewer::brewer.pal` palette, (default: "Set3")
`ntop`	integer(1) The number of the top N genes for displaying based on per-sample mutation frequency
`incl.thresh`	double(1) The inclusion threshold for empirical mutations, mutations less frequent than this value will not be included
`rowcol`	character(1) The name of the column in the metadata to annotate the rows with either "Hugo_Symbol" (default) or

Value

An oncoPrint plot of mutations

Examples


library(curatedTCGAData)

acc <- curatedTCGAData("ACC", "Mutation", version = "1.1.38", FALSE)

oncoPrintTCGA(acc)

library(curatedTCGAData)

acc <- curatedTCGAData("ACC", "Mutation", version = "1.1.38", FALSE)

oncoPrintTCGA(acc)

Barcode Sample Type Table

Description

A dataset that contains the mappings for sample codes in the TCGA barcodes.

Usage

data("sampleTypes")
data("sampleTypes")

Format

A data frame with 19 rows and 3 variables:

Code: Two digit code number found in the barcode
Definition: Long name for the sample type
Short.Letter.Code: Letter code for the sample type

Value

The TCGA sampleTypes table

Source

https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes

Functions to convert rows annotations to ranges and RaggedExperiment to RangedSummarizedExperiment

Description

This group of functions will convert row annotations as either gene symbols or miRNA symbols to row ranges based on database resources 'TxDB' and 'org.Hs' packages. It will also simplify the representation of RaggedExperiment objects to RangedSummarizedExperiment.

Usage

simplifyTCGA(obj, keep.assay = FALSE, unmapped = TRUE)

symbolsToRanges(obj, keep.assay = FALSE, unmapped = TRUE)

mirToRanges(obj, keep.assay = FALSE, unmapped = TRUE)

CpGtoRanges(obj, keep.assay = FALSE, unmapped = TRUE)

qreduceTCGA(obj, keep.assay = FALSE, suffix = "_simplified")
simplifyTCGA(obj, keep.assay = FALSE, unmapped = TRUE)

symbolsToRanges(obj, keep.assay = FALSE, unmapped = TRUE)

mirToRanges(obj, keep.assay = FALSE, unmapped = TRUE)

CpGtoRanges(obj, keep.assay = FALSE, unmapped = TRUE)

qreduceTCGA(obj, keep.assay = FALSE, suffix = "_simplified")

Arguments

`obj`	A `MultiAssayExperiment` object obtained from `curatedTCGAData`
`keep.assay`	logical (default FALSE) Whether to keep the `SummarizedExperiment` assays that have been converted to `RangedSummarizedExperiment`
`unmapped`	logical (default TRUE) Include an assay of data that was not able to be mapped in reference database
`suffix`	character (default "_simplified") A character string to append to the newly modified assay for `qreduceTCGA`.

Details

The original SummarizedExperiment containing either gene symbol or miR annotations is replaced or supplemented by a RangedSummarizedExperiment for those that could be mapped to GRanges, and optionally another SummarizedExperiment for annotations that could not be mapped to GRanges.

Value

A MultiAssayExperiment with any gene expression, miRNA, copy number, and mutations converted to RangedSummarizedExperiment objects

qreduceTCGA

Using TxDb.Hsapiens.UCSC.hg19.knownGene as the reference, qreduceTCGA reduces the data by applying either the weightedmean or nonsilent function (see below) to non-mutation or mutation data, respectively. Internally, it uses RaggedExperiment::qreduceAssay() to reduce the ranges to the gene-level.

qreduceTCGA will update genome(x) based on the NCBI reference annotation which includes the patch number, e.g., GRCh37.p14, as provided by the seqlevelsStyle setter, seqlevelsStyle(gn) <- "NCBI". qreduceTCGA uses the NCBI genome annotation as the default reference.

nonsilent <- function(scores, ranges, qranges)
    any(scores != "Silent")

RaggedExperiment mutation objects become a genes by patients RangedSummarizedExperiment object containing '1' if there is a non-silent mutation somewhere in the gene, and '0' otherwise as obtained from the Variant_Classification column in the data.

weightedmean <- function(scores, ranges, qranges) {
    isects <- GenomicRanges::pintersect(ranges, qranges)
    sum(scores * BiocGenerics::width(isects)) /
        sum(BiocGenerics::width(isects))
}

"CNA" and "CNV" segmented copy number are reduced using a weighted mean in the rare cases of overlapping (non-disjoint) copy number regions.

These functions rely on TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db to map to the 'hg19' NCBI build. Use the liftOver procedure for datasets that are provided against a different reference genome (usually 'hg18'). See an example in the vignette.

Author(s)

L. Waldron

Examples


library(curatedTCGAData)
library(GenomeInfoDb)

accmae <-
    curatedTCGAData(diseaseCode = "ACC",
    assays = c("CNASNP", "Mutation", "miRNASeqGene", "GISTICT"),
    version = "1.1.38",
    dry.run = FALSE)

## update genome annotation
rex <- accmae[["ACC_Mutation-20160128"]]

## Translate build to "hg19"
tgenome <- vapply(genome(rex), translateBuild, character(1L))
genome(rex) <- tgenome

accmae[["ACC_Mutation-20160128"]] <- rex

simplifyTCGA(accmae)

library(curatedTCGAData)
library(GenomeInfoDb)

accmae <-
    curatedTCGAData(diseaseCode = "ACC",
    assays = c("CNASNP", "Mutation", "miRNASeqGene", "GISTICT"),
    version = "1.1.38",
    dry.run = FALSE)

## update genome annotation
rex <- accmae[["ACC_Mutation-20160128"]]

## Translate build to "hg19"
tgenome <- vapply(genome(rex), translateBuild, character(1L))
genome(rex) <- tgenome

accmae[["ACC_Mutation-20160128"]] <- rex

simplifyTCGA(accmae)

Parse data from TCGA barcode

Description

This function returns the specified snippet of information obtained from the TCGA barcode.

Usage

TCGAbarcode(
  barcodes,
  participant = TRUE,
  sample = FALSE,
  portion = FALSE,
  plate = FALSE,
  center = FALSE,
  index = NULL
)
TCGAbarcode(
  barcodes,
  participant = TRUE,
  sample = FALSE,
  portion = FALSE,
  plate = FALSE,
  center = FALSE,
  index = NULL
)

Arguments

`barcodes`	A character vector of TCGA barcodes
`participant`	Logical (default TRUE) participant identifier chunk
`sample`	Logical (default FALSE) includes the numeric sample code of the barcode and the vial letter
`portion`	Logical (default FALSE) includes the portion and analyte codes of the barcode
`plate`	Logical (default FALSE) returns the plate value
`center`	Logical (default FALSE) returns a matrix with the plate and center codes
`index`	An optional numeric vector indicating barcode positions when split by the delimiter (i.e., hyphen '-'). For example, an index of `c(1, 2)` corresponds to 'TCGA-ZZ' in `TCGA-ZZ-A1A1`.

Value

A character vector or data matrix of TCGA barcode information

Author(s)

M. Ramos

Examples

barcodes <- c("TCGA-B0-5117-11A-01D-1421-08",
"TCGA-B0-5094-11A-01D-1421-08",
"TCGA-E9-A295-10A-01D-A16D-09")

## Patient identifiers
TCGAbarcode(barcodes)

## Sample identifiers
TCGAbarcode(barcodes, sample = TRUE)

barcodes <- c("TCGA-B0-5117-11A-01D-1421-08",
"TCGA-B0-5094-11A-01D-1421-08",
"TCGA-E9-A295-10A-01D-A16D-09")

## Patient identifiers
TCGAbarcode(barcodes)

## Sample identifiers
TCGAbarcode(barcodes, sample = TRUE)

Extract biospecimen data from the TCGA barcode

Description

This function uses the full TCGA barcode to return a data frame of the data pertinent to laboratory variables such as vials, portions, analytes, plates and the center.

Usage

TCGAbiospec(barcodes)
TCGAbiospec(barcodes)

Arguments

barcodes

A character vector of TCGA barcodes

Value

A dataframe with sample type, sample code, portion, plate, and center columns.

Author(s)

M. Ramos

Examples

example("TCGAbarcode")
TCGAbiospec(barcodes)

example("TCGAbarcode")
TCGAbiospec(barcodes)

Select primary tumors from TCGA datasets

Description

Tumor selection is decided using the sampleTypes data. For 'LAML' datasets, the primary tumor code used is "03" otherwise, "01" is used.

Usage

TCGAprimaryTumors(multiassayexperiment)
TCGAprimaryTumors(multiassayexperiment)

Arguments

multiassayexperiment

A MultiAssayExperiment with TCGA data as obtained from curatedTCGAData::curatedTCGAData()

Value

A MultiAssayExperiment containing only primary tumor samples

Examples


example(getSubtypeMap)

TCGAprimaryTumors(gbm)

example(getSubtypeMap)

TCGAprimaryTumors(gbm)

Select samples from barcodes from lookup table

Description

The TCGA barcode contains several pieces of information which can be parsed by the TCGAbarcode function. To select a specific type of sample, enter the appropriate sampleCode argument from the lookup table. See lookup table in data("sampleTypes"). Barcode inputs can be a character vector or a CharacterList object.

Usage

TCGAsampleSelect(barcodes, sampleCodes)
TCGAsampleSelect(barcodes, sampleCodes)

Arguments

`barcodes`	Either a TCGA barcode vector or CharacterList containing patient identifiers, sample, portion, plate, and center codes.
`sampleCodes`	Either a character or numeric vector of TCGA sample codes. See the `sampleType` dataset.

Value

A logical vector or LogicalList of the same length as 'barcodes' indicating sample type matches

Examples


example("TCGAbarcode")
TCGAsampleSelect(barcodes, c(11, 01))

example("TCGAbarcode")
TCGAsampleSelect(barcodes, c(11, 01))

Minimize the number of variables in colData

Description

This function removes variables that have a high number of missing data and contain keywords.

Usage

trimColData(
  multiassayexperiment,
  maxNAfrac = 0.2,
  keystring = c("portion", "analyte")
)
trimColData(
  multiassayexperiment,
  maxNAfrac = 0.2,
  keystring = c("portion", "analyte")
)

Arguments

`multiassayexperiment`	A `MultiAssayExperiment` object with `colData`
`maxNAfrac`	(numeric default 0.2) A decimal between 0 and 1 to indicate the amount of NA values allowed per column
`keystring`	(character) A vector of keywords to match and remove variables

Value

A MultiAssayExperiment object

Examples


example(getSubtypeMap)

(gbm_trimmed <- trimColData(gbm))

head(colData(gbm_trimmed))[1:5]

example(getSubtypeMap)

(gbm_trimmed <- trimColData(gbm))

head(colData(gbm_trimmed))[1:5]

Package 'TCGAutils'

Help Index

TCGAutils: Helper functions for working with TCGA and MultiAssayExperiment data

Description

Author(s)

See Also

Utilities for working with HUMAN genome builds

Description

Usage

Arguments

Details

Value

Examples

Clinical dataset names in TCGA

Description

Usage

Format

Value

Helper functions for managing MultiAssayExperiment from curatedTCGAData

Description

Usage

Arguments

Details

Value

getSubtypeMap

getClinicalNames

TCGAsplitAssays

sampleTables

Examples

TCGA Cancer Disease Codes Table

Description

Usage

Format

Value

Source

Obtain minimum necessary names for the creation of a GRangesList object

Description

Usage

Arguments

Value

Examples

Create a sampleMap from an experiment list and phenoData dataframe

Description

Usage

Arguments

Value

Author(s)

Examples

Find the file names used in RTCGAToolbox

Description

Usage

Arguments

Value

Examples

Translate study identifiers from barcode to UUID and vice versa

Description

Usage

Arguments

Details

Value

Author(s)

Examples

This function imputes assays values inside a MultiAssayExperiment

Description

Usage

Arguments

Value

Examples

Make a GRangesList from TCGA Copy Number data

Description

Usage

Arguments

Value

Examples

Read exon-level expression files and create a GRangesList

Description

Usage

Arguments

Details

Value

This function imputes assays values inside a `MultiAssayExperiment`

Read exon-level expression files and create a `GRangesList`