Back to Article
Integrative Analysis with TCGA Data
Download Source

Integrative Analysis with TCGA Data

Analysis of Mutation, Transcription and Methylation Data from The Cancer Genome Atlas (TCGA)

In [1]:
library(knitr)
Warning: package 'knitr' was built under R version 4.3.3
knitr::opts_chunk$set(
  warning = FALSE,
  message = FALSE,
  dpi = 180,
  echo = TRUE
)

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.3.2
Warning: package 'tidyr' was built under R version 4.3.2
Warning: package 'readr' was built under R version 4.3.2
Warning: package 'purrr' was built under R version 4.3.2
Warning: package 'dplyr' was built under R version 4.3.2
Warning: package 'stringr' was built under R version 4.3.2
Warning: package 'lubridate' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_minimal())

Introduction

The Cancer Genome Atlas (TCGA) is a massive cancer genomics project compiling high-throughput multi-omic data on dozens of cancer types for public access.

We are gonna use the curatedTCGAData package to manipulate locally to multiple high-throughput datasets from the project. The package provides access to TCGA data that has been curated and stored as a MultiAssayExperiment object on the Bioconductor ExperimentHub.

First, let’s load the packages needed.

In [2]:
library(curatedTCGAData)
Loading required package: MultiAssayExperiment
Warning: package 'MultiAssayExperiment' was built under R version 4.3.1
Loading required package: SummarizedExperiment
Warning: package 'SummarizedExperiment' was built under R version 4.3.1
Loading required package: MatrixGenerics
Warning: package 'MatrixGenerics' was built under R version 4.3.1
Loading required package: matrixStats
Warning: package 'matrixStats' was built under R version 4.3.2

Attaching package: 'matrixStats'
The following object is masked from 'package:dplyr':

    count

Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Warning: package 'BiocGenerics' was built under R version 4.3.1

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:lubridate':

    intersect, setdiff, union
The following objects are masked from 'package:dplyr':

    combine, intersect, setdiff, union
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors

Attaching package: 'S4Vectors'
The following objects are masked from 'package:lubridate':

    second, second<-
The following objects are masked from 'package:dplyr':

    first, rename
The following object is masked from 'package:tidyr':

    expand
The following object is masked from 'package:utils':

    findMatches
The following objects are masked from 'package:base':

    expand.grid, I, unname
Loading required package: IRanges

Attaching package: 'IRanges'
The following object is masked from 'package:lubridate':

    %within%
The following objects are masked from 'package:dplyr':

    collapse, desc, slice
The following object is masked from 'package:purrr':

    reduce
The following object is masked from 'package:grDevices':

    windows
Loading required package: GenomeInfoDb
Warning: package 'GenomeInfoDb' was built under R version 4.3.2
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':

    rowMedians
The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians
library(TCGAutils)
Warning: package 'TCGAutils' was built under R version 4.3.2
library(MultiAssayExperiment)

Download the Data

To download the data we need to use curatedTCGADatafunction. The first argument is a four letter disease (cancer) code (A complete list of disease codes used by the TCGA project are available on the NCI Genomic Data Commons website), the second argument is a vector of data types we want to download. We need to specify dry.run = FALSE to download the data.

In this specific case, we are gonna work with RNA-Seq data, mutation data and methylation data from Rectum Adenocarcinoma (READ). The clinical data is included by default.

In [3]:

readData = curatedTCGAData("READ", 
                           c("RNASeq2GeneNorm", "Mutation", "Methylation_methyl450"), 
                           dry.run = FALSE, version = "2.1.1")
Working on: READ_Mutation-20160128
see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
loading from cache
require("RaggedExperiment")
Warning: package 'RaggedExperiment' was built under R version 4.3.1
Working on: READ_RNASeq2GeneNorm-20160128
see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
loading from cache
Working on: READ_Methylation_methyl450-20160128
see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
loading from cache
require("rhdf5")
Warning: package 'rhdf5' was built under R version 4.3.2
see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
loading from cache
Loading required package: HDF5Array
Warning: package 'HDF5Array' was built under R version 4.3.2
Loading required package: DelayedArray
Warning: package 'DelayedArray' was built under R version 4.3.1
Loading required package: Matrix
Warning: package 'Matrix' was built under R version 4.3.2

Attaching package: 'Matrix'
The following object is masked from 'package:S4Vectors':

    expand
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loading required package: S4Arrays
Warning: package 'S4Arrays' was built under R version 4.3.2
Loading required package: abind

Attaching package: 'S4Arrays'
The following object is masked from 'package:abind':

    abind
The following object is masked from 'package:base':

    rowsum
Loading required package: SparseArray
Warning: package 'SparseArray' was built under R version 4.3.2

Attaching package: 'DelayedArray'
The following object is masked from 'package:purrr':

    simplify
The following objects are masked from 'package:base':

    apply, scale, sweep

Attaching package: 'HDF5Array'
The following object is masked from 'package:rhdf5':

    h5ls
Working on: READ_colData-20160128
see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
loading from cache
Working on: READ_sampleMap-20160128
see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
loading from cache
Working on: READ_metadata-20160128
see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
loading from cache
harmonizing input:
  removing 1903 sampleMap rows not in names(experiments)
  removing 2 colData rownames not in sampleMap 'primary'

readData # this is a MultiAssayExperiment object
A MultiAssayExperiment object of 3 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 3:
 [1] READ_Mutation-20160128: RaggedExperiment with 22075 rows and 69 columns
 [2] READ_RNASeq2GeneNorm-20160128: SummarizedExperiment with 18115 rows and 177 columns
 [3] READ_Methylation_methyl450-20160128: SummarizedExperiment with 485577 rows and 106 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

Review the Clinical Metadata

We can see which patients have data for each assay. The assay column gives the experiment type, the primary column gives the unique patient ID and the colname gives the sample ID used as a identifier within a given experiment.

In [4]:
sampleMap(readData)
DataFrame with 352 rows and 3 columns
                                  assay      primary                colname
                               <factor>  <character>            <character>
1   READ_Methylation_methyl450-20160128 TCGA-AF-2687 TCGA-AF-2687-01A-02D..
2   READ_Methylation_methyl450-20160128 TCGA-AF-2690 TCGA-AF-2690-01A-02D..
3   READ_Methylation_methyl450-20160128 TCGA-AF-2693 TCGA-AF-2693-01A-02D..
4   READ_Methylation_methyl450-20160128 TCGA-AF-3911 TCGA-AF-3911-01A-01D..
5   READ_Methylation_methyl450-20160128 TCGA-AF-4110 TCGA-AF-4110-01A-02D..
...                                 ...          ...                    ...
348       READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02G        TCGA-AG-A02G-01
349       READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02N        TCGA-AG-A02N-01
350       READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02X        TCGA-AG-A02X-01
351       READ_RNASeq2GeneNorm-20160128 TCGA-AG-A032        TCGA-AG-A032-01
352       READ_RNASeq2GeneNorm-20160128 TCGA-AG-A036        TCGA-AG-A036-01

Not all patients have data for all assays, and some of them can have multiple data entries for one or more experiment type. This may correspond to multiple biopsies or matched tumor and normal samples from an individual patient.

In [5]:
sampleMap(readData) |> 
  as_tibble() |> 
  pull(primary) |> 
  table() |> 
  table()

  1   2   3   4 
  5 147   7   8 

We can see the metadata of the patients with colData. Note that there are more than 2000 columns of data per patient (not necessarily complete).

In [6]:
clin = colData(readData) |> 
  as_tibble()
dim(clin)
[1]  167 2260
head(colnames(clin), 10) 
 [1] "patientID"             "years_to_birth"        "vital_status"         
 [4] "days_to_death"         "days_to_last_followup" "tumor_tissue_site"    
 [7] "pathologic_stage"      "pathology_T_stage"     "pathology_N_stage"    
[10] "pathology_M_stage"    

As an example, for rectum adenocarcinoma, we can see the tumor stage.

In [7]:
clin |> 
  pull(pathology_T_stage) |> 
  table()

 t1  t2  t3  t4 t4a t4b 
  9  28 114   5   8   1 

Stage T4 have subgroups. To simplify the analysis, let’s combine all T4 tumors.

In [8]:
clin <- clin |> 
  mutate(t_stage = case_when(
    pathology_T_stage %in% c("t4","t4a","t4b") ~ "t4",
    .default = pathology_T_stage
  ))

clin$t_stage |> 
  table()

 t1  t2  t3  t4 
  9  28 114  14 

Also, we can see the vital status (alive=0, deceased=1)

In [9]:
clin$vital_status |> 
  table()

  0   1 
139  28 

Or combine tumor status and vital status.

In [10]:
table(clin$t_stage, clin$vital_status)
    
      0  1
  t1  9  0
  t2 24  4
  t3 96 18
  t4  9  5

Analyzing Mutation Data

Let’s begin analyzing the mutation data. Below is the code to retrieve the mutation data.

In [11]:
mut_data = readData[[1]]

mut_data
class: RaggedExperiment 
dim: 22075 69 
assays(34): Hugo_Symbol Entrez_Gene_Id ... COSMIC_Gene Drug_Target
rownames: NULL
colnames(69): TCGA-AF-2689-01A-01W-0831-10 TCGA-AF-2691-01A-01W-0831-10
  ... TCGA-AG-A032-01 TCGA-AG-A036-01
colData names(0):

From the inspection of the sample IDs we can see that the mutation colnames match to the primary column from he clinical data.

In [12]:
mut_sample_ids = colnames(mut_data)
head(mut_sample_ids)
[1] "TCGA-AF-2689-01A-01W-0831-10" "TCGA-AF-2691-01A-01W-0831-10"
[3] "TCGA-AF-2692-01A-01W-0831-10" "TCGA-AF-3400-01A-01W-0831-10"
[5] "TCGA-AF-3913-01"              "TCGA-AG-3574-01A-01W-0831-10"
head(clin$patientID)
[1] "TCGA-AF-2687" "TCGA-AF-2689" "TCGA-AF-2690" "TCGA-AF-2691" "TCGA-AF-2692"
[6] "TCGA-AF-2693"

We need to manipulate these by substracting 12 characters.

In [13]:
mut_sample_ids <- mut_sample_ids |> 
  stringr::str_sub(1,12)

all(mut_sample_ids %in% clin$patientID)
[1] TRUE

Is important to note that the data stored in assay(mut_data) is difficult to work with because is a sparse matrix that has a row for each GRanges with a mutation in at least one sample.

In [14]:
assay(mut_data)[1:3,1:3]
                      TCGA-AF-2689-01A-01W-0831-10 TCGA-AF-2691-01A-01W-0831-10
X:54241715-54241716:+ "WNK3"                       NA                          
7:111899511:+         "IFRD1"                      NA                          
7:113309556:+         "PPP1R3A"                    NA                          
                      TCGA-AF-2692-01A-01W-0831-10
X:54241715-54241716:+ NA                          
7:111899511:+         NA                          
7:113309556:+         NA                          
# Is a sparse matrix

assay(mut_data)[1,] |> 
  table(useNA="ifany")

WNK3 <NA> 
   1   68 

We can get more information if we look at the mutation information for each patient.

In [15]:
mut_assay = mut_data@assays

mut_assay # GRangesList
GRangesList object of length 69:
$`TCGA-AF-2689-01A-01W-0831-10`
GRanges object with 64 ranges and 34 metadata columns:
       seqnames            ranges strand | Hugo_Symbol Entrez_Gene_Id
          <Rle>         <IRanges>  <Rle> | <character>    <character>
   [1]        X 54241715-54241716      + |        WNK3          65267
   [2]        7         111899511      + |       IFRD1           3475
   [3]        7         113309556      + |     PPP1R3A           5506
   [4]        7         128146325      + |     FAM71F1          84691
   [5]        7         156447624      + |        NOM1          64434
   ...      ...               ...    ... .         ...            ...
  [60]       10          50616059      + |       OGDHL          55753
  [61]       10          96343335      + |       HELLS           3070
  [62]        5         139173166      + |        PSD2          84249
  [63]        5         147800932      + |      FBXO38          81545
  [64]        5          54365356      + |        GZMK           3003
             Center  NCBI_Build Variant_Classification Variant_Type
        <character> <character>            <character>  <character>
   [1] hgsc.bcm.edu          36        Frame_Shift_Ins          INS
   [2] hgsc.bcm.edu          36      Missense_Mutation          SNP
   [3] hgsc.bcm.edu          36      Missense_Mutation          SNP
   [4] hgsc.bcm.edu          36                 Silent          SNP
   [5] hgsc.bcm.edu          36      Missense_Mutation          SNP
   ...          ...         ...                    ...          ...
  [60] hgsc.bcm.edu          36                 Silent          SNP
  [61] hgsc.bcm.edu          36                 Silent          SNP
  [62] hgsc.bcm.edu          36      Nonsense_Mutation          SNP
  [63] hgsc.bcm.edu          36                 Silent          SNP
  [64] hgsc.bcm.edu          36      Missense_Mutation          SNP
       Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2    dbSNP_RS
            <character>       <character>       <character> <character>
   [1]                -                 -                 T       novel
   [2]                A                 A                 G       novel
   [3]                T                 T                 C       novel
   [4]                G                 G                 A       novel
   [5]                A                 A                 G       novel
   ...              ...               ...               ...         ...
  [60]                G                 G                 A       novel
  [61]                T                 T                 C       novel
  [62]                C                 C                 T       novel
  [63]                G                 G                 A       novel
  [64]                T                 T                 A       novel
       dbSNP_Val_Status Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1
            <character>                 <character>            <character>
   [1]          unknown      TCGA-AF-2689-10A-01W..                      -
   [2]          unknown      TCGA-AF-2689-10A-01W..                      A
   [3]          unknown      TCGA-AF-2689-10A-01W..                      T
   [4]          unknown      TCGA-AF-2689-10A-01W..                      G
   [5]          unknown      TCGA-AF-2689-10A-01W..                      A
   ...              ...                         ...                    ...
  [60]          unknown      TCGA-AF-2689-10A-01W..                      G
  [61]          unknown      TCGA-AF-2689-10A-01W..                      T
  [62]          unknown      TCGA-AF-2689-10A-01W..                      C
  [63]          unknown      TCGA-AF-2689-10A-01W..                      G
  [64]          unknown      TCGA-AF-2689-10A-01W..                      T
       Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2
                  <character>              <character>              <character>
   [1]                      -                        -                        T
   [2]                      A                        A                        G
   [3]                      T                        T                        C
   [4]                      G                        .                        .
   [5]                      A                        A                        G
   ...                    ...                      ...                      ...
  [60]                      G                        .                        .
  [61]                      T                        .                        .
  [62]                      C                        C                        T
  [63]                      G                        G                        A
  [64]                      T                        T                        A
       Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2
                         <character>                   <character>
   [1]                             -                             -
   [2]                             A                             A
   [3]                             T                             T
   [4]                             .                             .
   [5]                             A                             A
   ...                           ...                           ...
  [60]                             .                             .
  [61]                             .                             .
  [62]                             C                             C
  [63]                             G                             G
  [64]                             T                             T
       Verification_Status Validation_Status Mutation_Status Sequencing_Phase
               <character>       <character>     <character>      <character>
   [1]             Unknown             Valid         Somatic          Phase_I
   [2]             Unknown             Valid         Somatic          Phase_I
   [3]             Unknown             Valid         Somatic          Phase_I
   [4]             Unknown           Unknown         Somatic          Phase_I
   [5]             Unknown             Valid         Somatic          Phase_I
   ...                 ...               ...             ...              ...
  [60]             Unknown           Unknown         Somatic          Phase_I
  [61]             Unknown           Unknown         Somatic          Phase_I
  [62]             Unknown             Valid         Somatic          Phase_I
  [63]             Unknown             Valid         Somatic          Phase_I
  [64]             Unknown             Valid         Somatic          Phase_I
       Sequence_Source Validation_Method       Score    BAM_file   Sequencer
           <character>       <character> <character> <character> <character>
   [1]         Capture               454                               SOLID
   [2]         Capture               454                               SOLID
   [3]         Capture               454                               SOLID
   [4]         Capture                 .                               SOLID
   [5]         Capture               454                               SOLID
   ...             ...               ...         ...         ...         ...
  [60]         Capture                 .                               SOLID
  [61]         Capture                 .                               SOLID
  [62]         Capture               454                               SOLID
  [63]         Capture               454                               SOLID
  [64]         Capture               454                               SOLID
       TranscriptID        Exon     ChromChange    AAChange COSMIC_Codon
        <character> <character>     <character> <character>  <character>
   [1] NM_001002838      exon23 c.4999_5000insA   p.L1667fs            .
   [2]    NM_001550      exon10        c.A1043G     p.E348G            .
   [3]    NM_002711       exon2         c.A838G     p.T280A            .
   [4]    NM_032599       exon3         c.G639A     p.T213T            .
   [5]    NM_138400       exon5        c.A1651G     p.T551A            .
   ...          ...         ...             ...         ...          ...
  [60] NM_001143996      exon18        c.C2286T     p.S762S            .
  [61]    NM_018063      exon18        c.T2061C     p.D687D            .
  [62]    NM_032289       exon3         c.C460T     p.Q154X            .
  [63]    NM_205836      exon21        c.G3327A    p.R1109R            .
  [64]    NM_002104       exon5         c.T640A     p.S214T            .
       COSMIC_Gene Drug_Target
       <character> <character>
   [1]           .           .
   [2]           .           .
   [3]           .           .
   [4]           .           .
   [5]           .           .
   ...         ...         ...
  [60]           .           .
  [61]           .           .
  [62]           .           .
  [63]           .           .
  [64]           .           .
  -------
  seqinfo: 24 sequences from NCBI36 genome; no seqlengths

...
<68 more elements>
mut_assay |> class()
[1] "CompressedGRangesList"
attr(,"package")
[1] "GenomicRanges"
length(mut_assay)
[1] 69

Let’s inspect the data from the first patient. We can see from the metadata information the Hugo Symbol, mutation status and predicted effect of each mutation at variant classification.

In [16]:
mut_assay[[1]]
GRanges object with 64 ranges and 34 metadata columns:
       seqnames            ranges strand | Hugo_Symbol Entrez_Gene_Id
          <Rle>         <IRanges>  <Rle> | <character>    <character>
   [1]        X 54241715-54241716      + |        WNK3          65267
   [2]        7         111899511      + |       IFRD1           3475
   [3]        7         113309556      + |     PPP1R3A           5506
   [4]        7         128146325      + |     FAM71F1          84691
   [5]        7         156447624      + |        NOM1          64434
   ...      ...               ...    ... .         ...            ...
  [60]       10          50616059      + |       OGDHL          55753
  [61]       10          96343335      + |       HELLS           3070
  [62]        5         139173166      + |        PSD2          84249
  [63]        5         147800932      + |      FBXO38          81545
  [64]        5          54365356      + |        GZMK           3003
             Center  NCBI_Build Variant_Classification Variant_Type
        <character> <character>            <character>  <character>
   [1] hgsc.bcm.edu          36        Frame_Shift_Ins          INS
   [2] hgsc.bcm.edu          36      Missense_Mutation          SNP
   [3] hgsc.bcm.edu          36      Missense_Mutation          SNP
   [4] hgsc.bcm.edu          36                 Silent          SNP
   [5] hgsc.bcm.edu          36      Missense_Mutation          SNP
   ...          ...         ...                    ...          ...
  [60] hgsc.bcm.edu          36                 Silent          SNP
  [61] hgsc.bcm.edu          36                 Silent          SNP
  [62] hgsc.bcm.edu          36      Nonsense_Mutation          SNP
  [63] hgsc.bcm.edu          36                 Silent          SNP
  [64] hgsc.bcm.edu          36      Missense_Mutation          SNP
       Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2    dbSNP_RS
            <character>       <character>       <character> <character>
   [1]                -                 -                 T       novel
   [2]                A                 A                 G       novel
   [3]                T                 T                 C       novel
   [4]                G                 G                 A       novel
   [5]                A                 A                 G       novel
   ...              ...               ...               ...         ...
  [60]                G                 G                 A       novel
  [61]                T                 T                 C       novel
  [62]                C                 C                 T       novel
  [63]                G                 G                 A       novel
  [64]                T                 T                 A       novel
       dbSNP_Val_Status Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1
            <character>                 <character>            <character>
   [1]          unknown      TCGA-AF-2689-10A-01W..                      -
   [2]          unknown      TCGA-AF-2689-10A-01W..                      A
   [3]          unknown      TCGA-AF-2689-10A-01W..                      T
   [4]          unknown      TCGA-AF-2689-10A-01W..                      G
   [5]          unknown      TCGA-AF-2689-10A-01W..                      A
   ...              ...                         ...                    ...
  [60]          unknown      TCGA-AF-2689-10A-01W..                      G
  [61]          unknown      TCGA-AF-2689-10A-01W..                      T
  [62]          unknown      TCGA-AF-2689-10A-01W..                      C
  [63]          unknown      TCGA-AF-2689-10A-01W..                      G
  [64]          unknown      TCGA-AF-2689-10A-01W..                      T
       Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2
                  <character>              <character>              <character>
   [1]                      -                        -                        T
   [2]                      A                        A                        G
   [3]                      T                        T                        C
   [4]                      G                        .                        .
   [5]                      A                        A                        G
   ...                    ...                      ...                      ...
  [60]                      G                        .                        .
  [61]                      T                        .                        .
  [62]                      C                        C                        T
  [63]                      G                        G                        A
  [64]                      T                        T                        A
       Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2
                         <character>                   <character>
   [1]                             -                             -
   [2]                             A                             A
   [3]                             T                             T
   [4]                             .                             .
   [5]                             A                             A
   ...                           ...                           ...
  [60]                             .                             .
  [61]                             .                             .
  [62]                             C                             C
  [63]                             G                             G
  [64]                             T                             T
       Verification_Status Validation_Status Mutation_Status Sequencing_Phase
               <character>       <character>     <character>      <character>
   [1]             Unknown             Valid         Somatic          Phase_I
   [2]             Unknown             Valid         Somatic          Phase_I
   [3]             Unknown             Valid         Somatic          Phase_I
   [4]             Unknown           Unknown         Somatic          Phase_I
   [5]             Unknown             Valid         Somatic          Phase_I
   ...                 ...               ...             ...              ...
  [60]             Unknown           Unknown         Somatic          Phase_I
  [61]             Unknown           Unknown         Somatic          Phase_I
  [62]             Unknown             Valid         Somatic          Phase_I
  [63]             Unknown             Valid         Somatic          Phase_I
  [64]             Unknown             Valid         Somatic          Phase_I
       Sequence_Source Validation_Method       Score    BAM_file   Sequencer
           <character>       <character> <character> <character> <character>
   [1]         Capture               454                               SOLID
   [2]         Capture               454                               SOLID
   [3]         Capture               454                               SOLID
   [4]         Capture                 .                               SOLID
   [5]         Capture               454                               SOLID
   ...             ...               ...         ...         ...         ...
  [60]         Capture                 .                               SOLID
  [61]         Capture                 .                               SOLID
  [62]         Capture               454                               SOLID
  [63]         Capture               454                               SOLID
  [64]         Capture               454                               SOLID
       TranscriptID        Exon     ChromChange    AAChange COSMIC_Codon
        <character> <character>     <character> <character>  <character>
   [1] NM_001002838      exon23 c.4999_5000insA   p.L1667fs            .
   [2]    NM_001550      exon10        c.A1043G     p.E348G            .
   [3]    NM_002711       exon2         c.A838G     p.T280A            .
   [4]    NM_032599       exon3         c.G639A     p.T213T            .
   [5]    NM_138400       exon5        c.A1651G     p.T551A            .
   ...          ...         ...             ...         ...          ...
  [60] NM_001143996      exon18        c.C2286T     p.S762S            .
  [61]    NM_018063      exon18        c.T2061C     p.D687D            .
  [62]    NM_032289       exon3         c.C460T     p.Q154X            .
  [63]    NM_205836      exon21        c.G3327A    p.R1109R            .
  [64]    NM_002104       exon5         c.T640A     p.S214T            .
       COSMIC_Gene Drug_Target
       <character> <character>
   [1]           .           .
   [2]           .           .
   [3]           .           .
   [4]           .           .
   [5]           .           .
   ...         ...         ...
  [60]           .           .
  [61]           .           .
  [62]           .           .
  [63]           .           .
  [64]           .           .
  -------
  seqinfo: 24 sequences from NCBI36 genome; no seqlengths
mut_assay[[1]]$Hugo_Symbol
 [1] "WNK3"     "IFRD1"    "PPP1R3A"  "FAM71F1"  "NOM1"     "TRIM73"  
 [7] "C7orf51"  "DIDO1"    "DNMT1"    "CALR"     "SIN3B"    "ZNF569"  
[13] "SIGLEC12" "ZNF160"   "NLRP4"    "ENPP2"    "TRPA1"    "MMP16"   
[19] "GPR89A"   "SLC9A11"  "PAX7"     "PLA2G5"   "SLC26A9"  "PIK3R3"  
[25] "LPPR4"    "BCL9L"    "PRDM11"   "PEX3"     "DCDC2"    "KRT20"   
[31] "TP53"     "CDH8"     "SF3B3"    "SLC9A10"  "SLC7A14"  "NLGN1"   
[37] "MYRIP"    "CYP8B1"   "TGM4"     "COL7A1"   "P2RX7"    "KRAS"    
[43] "TPH2"     "ANO4"     "UBR1"     "LBXCOR1"  "PDLIM5"   "ODZ1"    
[49] "SMARCA1"  "CNKSR2"   "RBM10"    "RBM3"     "HUWE1"    "CYLC1"   
[55] "ATP6V1E2" "ASTN2"    "TAF1L"    "SGCG"     "GBF1"     "OGDHL"   
[61] "HELLS"    "PSD2"     "FBXO38"   "GZMK"    
mut_assay[[1]]$Mutation_Status |> 
  table()

Somatic 
     64 
mut_assay[[1]]$Variant_Classification |> 
  table()

  Frame_Shift_Ins Missense_Mutation Nonsense_Mutation            Silent 
                1                39                 5                19 

Now, is kind of a trouble to inspect manually each patient. So, lets get all mutation information from Hugo symbol and Variant classification for all the patients.

In [17]:
var_class_df = mapply(function(sample_id, mutation_assay){
  
  d = mcols(mutation_assay)[,c("Hugo_Symbol","Variant_Classification")] |> 
    as.data.frame()
  
  colnames(d) = c("symbol","variant_class")
  
  d$patientID = sample_id
  
  return(d)
  
}, sample_id=mut_sample_ids, mutation_assay = mut_assay,SIMPLIFY = F, USE.NAMES = F)


var_class_df = do.call(rbind, var_class_df)


head(var_class_df)
   symbol     variant_class    patientID
1    WNK3   Frame_Shift_Ins TCGA-AF-2689
2   IFRD1 Missense_Mutation TCGA-AF-2689
3 PPP1R3A Missense_Mutation TCGA-AF-2689
4 FAM71F1            Silent TCGA-AF-2689
5    NOM1 Missense_Mutation TCGA-AF-2689
6  TRIM73 Nonsense_Mutation TCGA-AF-2689

We can visualize the most common mutated genes genes in rectum adenocarcinoma

In [18]:

p <- var_class_df |> 
  as_tibble() |> 
  group_by(symbol,variant_class) |> 
  summarise(n = n()) |> 
  arrange(desc(n)) |> 
  ungroup() |> 
  slice_max(order_by = n, n = 20) |> 
  ungroup() |> 
  mutate(symbol = fct_reorder(symbol,n)) |> 
  ggplot(aes(n, symbol,fill=variant_class)) +
  geom_col() +
  facet_wrap(~variant_class) +
  paletteer::scale_fill_paletteer_d("awtools::a_palette") +
  labs(x = "Samples with specific mutation", y="Gene Symbol", fill="Variant Class",
       title="Top 20 Samples with READ")
`summarise()` has grouped output by 'symbol'. You can override using the
`.groups` argument.

path <- "images/"
ggsave(paste0(path,"samples_with_mut.jpeg"), device = "jpeg")
Saving 7 x 5 in image

Samples with specific mutation per gene ## Linking mutations and tumor stage

Now that we are familiarized with the mutation data, we can link mutations to the tumor stage from the patients with rectum adenocarcinoma.

We can filter the clinical data with the patients that have mutation data.

In [19]:
index <- which( (var_class_df$patientID |> unique()) %in% clin$patientID)

clin_and_mutation = clin[index,]

head(clin_and_mutation) 
# A tibble: 6 × 2,261
  patientID    years_to_birth vital_status days_to_death days_to_last_followup
  <chr>                 <int>        <int>         <int>                 <int>
1 TCGA-AF-2687             57            0            NA                  1427
2 TCGA-AF-2689             41            1          1201                    NA
3 TCGA-AF-2690             76            1           524                    NA
4 TCGA-AF-2691             48            0            NA                  1309
5 TCGA-AF-2692             54            0            NA                   412
6 TCGA-AF-2693             75            0            NA                  1155
# ℹ 2,256 more variables: tumor_tissue_site <chr>, pathologic_stage <chr>,
#   pathology_T_stage <chr>, pathology_N_stage <chr>, pathology_M_stage <chr>,
#   gender <chr>, date_of_initial_pathologic_diagnosis <int>,
#   days_to_last_known_alive <int>, radiation_therapy <chr>,
#   histological_type <chr>, tumor_stage <chr>, residual_tumor <chr>,
#   number_of_lymph_nodes <int>, ethnicity <chr>, admin.bcr <chr>,
#   admin.day_of_dcc_upload <int>, admin.disease_code <chr>, …

We can count the number of genes with mutations per patient and then make a left_join to the clinical data to plot the mutated genes per tumor stage.

In [20]:

df = var_class_df |> 
  group_by(patientID) |> 
  summarise(n = n())

p <- clin_and_mutation |> 
  left_join(df, by = join_by(patientID)) |> 
  ggplot(aes(t_stage,log(n), fill=t_stage))+
  geom_boxplot() +
  geom_jitter(width = 0.05, colour="red2") +
  paletteer::scale_fill_paletteer_d("awtools::a_palette")+
  labs(x = "Tumor stage", y="Number of mutated genes in log scale")

path <- "images/"
ggsave(paste0(path,"number_mut_genes_per_tumor_stage.jpeg"), device = "jpeg")
Saving 7 x 5 in image
Warning: Removed 20 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 20 rows containing missing values or values outside the scale range
(`geom_point()`).

Number of mutated genes per tumor stage

Number of mutated genes per tumor stage

Also, we can focus on a specific gene, e.g. TP53.

In [21]:

mut_on_tp53 <- var_class_df |> 
  dplyr::filter(symbol == "TP53") 


p <-clin_and_mutation |> 
  dplyr::filter(patientID %in%  mut_on_tp53$patientID) |> 
  dplyr::select(t_stage, patientID) |> 
  group_by(t_stage) |> 
  summarise(n = n()) |> 
  # mutate(t_stage = fct_reorder(t_stage,n)) |> 
  ggplot(aes(n, t_stage, fill=t_stage, label=n)) +
  geom_col() +
  geom_label(fill="white")  +
  paletteer::scale_fill_paletteer_d("awtools::a_palette")+
  labs(y = "Tumor Stage", x="N° of Mutations", fill="Tumor Stage", title="Number of Mutations on Gene TP53")

path <- "images/"
ggsave(paste0(path,"number_mut_tp53_per_tumor_stage.jpeg"), device = "jpeg")
Saving 7 x 5 in image

Number of mutations on gene TP53 per tumor stage

Number of mutations on gene TP53 per tumor stage

Linking expression and tumor stage

We also can link the expression data (RNA-Seq) with the tumor stage data from the clinical metadata.

Check the TCGA RNA-Seq protocol online
In [22]:
rnaseq = readData[[2]]

rnaseq
class: SummarizedExperiment 
dim: 18115 177 
metadata(3): filename build platform
assays(1): ''
rownames(18115): A1BG A1CF ... ZZZ3 psiTPTE22
rowData names(0):
colnames(177): TCGA-AF-2687-01 TCGA-AF-2689-11 ... TCGA-AG-A032-01
  TCGA-AG-A036-01
colData names(0):

As before, we need to shorten the sample IDs so they can match to the clinical data.

In [23]:
colnames(rnaseq) = colnames(rnaseq) |> 
  stringr::str_sub(1,12)

clin <- clin |> as.data.frame()
rownames(clin) <- clin$patientID

index <- which(colnames(rnaseq) %in% rownames(clin))

colData(rnaseq) = as(clin[colnames(rnaseq),],"DataFrame")


idx <- is.na(colData(rnaseq)$t_stage)

rnaseq <- rnaseq[,!idx]

Then, we can use the limma package to proceed with the differential expression analysis. In this case, we’ll ise the tumor stage as a variable to explain the expression of the genes.

In [24]:
library(limma)
Warning: package 'limma' was built under R version 4.3.1

Attaching package: 'limma'
The following object is masked from 'package:BiocGenerics':

    plotMA
mm = model.matrix(~t_stage, data=colData(rnaseq))
f1 = lmFit(assay(rnaseq), mm)
Warning: Partial NA coefficients for 5 probe(s)
ef1 = eBayes(f1)
topTable(ef1) |> 
  arrange(desc(AveExpr))
Removing intercept from test coefficients
           t_staget2   t_staget3  t_staget4     AveExpr         F      P.Value
PAM       -0.5791836  0.08813521  0.4984790 10.94216940  8.309141 3.409613e-05
PAIP2     -0.3240061  0.18551765  0.0958645 10.29398797  9.056896 1.331550e-05
SPP1      -0.3352086  1.56339367  2.6409082  9.61423581  9.316731 9.621982e-06
NIPSNAP3A -0.9988818 -0.35051776 -0.6085837  9.38910637  8.712515 2.051165e-05
ASPN      -0.7366657  0.75340984  1.9537316  7.45952050  8.323019 3.350393e-05
OLR1      -0.2557879  1.39542897  2.1822156  4.97144475  8.531715 2.597133e-05
SEMA5B     1.3576218  1.27252058  1.9271042  4.60337761  8.091237 4.490926e-05
CPSF4L     1.0629502 -0.17984587 -1.4274556  0.70036223  8.955085 2.311056e-05
GRM5       1.3027501 -0.14596612 -0.7395609  0.06134539 10.268851 7.301656e-06
C1orf223   0.7830779 -0.40735736 -0.7062534 -0.35136703  9.681632 1.743477e-05
           adj.P.Val
PAM       0.06860899
PAIP2     0.06719153
SPP1      0.06719153
NIPSNAP3A 0.06719153
ASPN      0.06860899
OLR1      0.06719153
SEMA5B    0.07597995
CPSF4L    0.06719153
GRM5      0.06719153
C1orf223  0.06719153

Let’s visualize two of the most expressed genes.

In [25]:
par(mfrow=c(1,2))
boxplot(split(assay(rnaseq)["PAM",], rnaseq$t_stage), main="PAM")    # higher expression in lower t_stage
boxplot(split(assay(rnaseq)["PAIP2",], rnaseq$t_stage), main="PAIP2")

Linking methylation and expression

Finally, we can use the methylation data with the expression data. This is important because methylated cytosines of the DNA change the expression of the genes.

Some patients have methylation data for multiples tissue types. This information is encoded in the fourth component of the sample names. The code 01A correspond to primary tumor samples and the code 11A correspond to normal tissue. We’ll keep the primary tumor samples.

In [26]:
methyl <- readData[[3]]

idx <- colnames(methyl) |> 
  str_split(pattern = "-") |> 
  map(4) |> 
  enframe() |> 
  unnest(value)

idx <- which(idx$value == "01A")

methyl = methyl[,idx]

methyl
class: SummarizedExperiment 
dim: 485577 95 
metadata(0):
assays(1): counts
rownames(485577): cg00000029 cg00000108 ... rs966367 rs9839873
rowData names(3): Gene_Symbol Chromosome Genomic_Coordinate
colnames(95): TCGA-AF-2687-01A-02D-1734-05 TCGA-AF-2690-01A-02D-1734-05
  ... TCGA-G5-6572-01A-11D-1828-05 TCGA-G5-6641-01A-11D-1828-05
colData names(0):

As before, let’s truncate the names of the sample to match the clinical data.

In [27]:
colnames(methyl) <- colnames(methyl) |> 
  str_sub(start = 1,end = 12)

We can add the clinical data to the methyl object and count the number of patients with methylation and transcription data.

In [28]:
colData(methyl) <- as(clin[colnames(methyl),],"DataFrame")


intersect(colnames(methyl), colnames(rnaseq)) |> length()
[1] 93

Let’s subset common sample names and check the methylation data as row ranges.

In [29]:
methyl_subset = methyl[,which(colnames(methyl) %in% colnames(rnaseq))]
rnaseq_subset = rnaseq[,which(colnames(rnaseq) %in% colnames(methyl))]

methyl_genes = rowData(methyl_subset)
methyl_genes
DataFrame with 485577 rows and 3 columns
           Gene_Symbol  Chromosome Genomic_Coordinate
           <character> <character>        <character>
cg00000029        RBL2          16           53468112
cg00000108     C3orf35           3           37459206
cg00000109      FNDC3B           3          171916037
cg00000165          NA           1           91194674
cg00000236       VDAC3           8           42263294
...                ...         ...                ...
rs9363764           NA          NA                  0
rs939290            NA          NA                  0
rs951295            NA          NA                  0
rs966367            NA          NA                  0
rs9839873           NA          NA                  0

This function takes a gene symbol and returns a scatter plot showing the relationship between 3 different sites near that gene and gene expression.

In [30]:

me_rna_cor = function(sym, mpick = 3){
    require(GGally)
    # subset methylation data to first mpick methylation sites for given gene symbol
    methyl_ind = which(methyl_genes$Gene_Symbol == sym)
    if (length(methyl_ind) > mpick){    
        methyl_ind = methyl_ind[1:mpick]
    }
    methyl_dat = assay(methyl_subset)[methyl_ind,]    # subset to selected methylation sites

    # subset expression data to selected gene symbol
    expr_ind = which(rownames(rnaseq_subset) == sym)    
    expr_dat = assay(rnaseq_subset)[expr_ind,]

    # combine methylation and expression data as data frame
    combined_dat = as(t(methyl_dat), "DataFrame")
    combined_dat$expr = expr_dat

    # plot pairs and calculate correlation coefficients between methylation and expression
    ggpairs(combined_dat) |> print()
    sapply(1:mpick, function(i){
      
        cor_to <- as.numeric(combined_dat[,mpick+1])
        
        df <- data.frame(v1= as.numeric(combined_dat[,i]),
                         v2 = cor_to)
        
        df <- df |> na.omit()
      
        cor(df[,1], df[,2])
    })
}

me_rna_cor("TAC1", mpick=3)
Loading required package: GGally
Warning: package 'GGally' was built under R version 4.3.2
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 5 rows containing missing values
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 5 rows containing missing values
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 5 rows containing missing values
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_density()`).

[1] 0.0410095 0.1457715 0.0856926