Analysis of RNA-Seq Data

Author

Piero Palacios Bernuy

Published

March 7, 2024

Abstract

This document is part of a series of the analysis of Omics data. Especifically, here is showed how to analyze bulk RNA-Seq data with Bioconductor packages. Also, it’s showcased how to make plots of the RNA data in the context of differentially gene expression and gene-sets.

Keywords

RNA, gene expression, gene sets, ontologies, network

1 Introduction

The analysis of RNA-Seq data involves two distinct parts. The first one needs the use of servers or HPC (high performance computers) and has to do with quality control. pre-processing, alignment-mapping (usually with the STAR software) and counting (can be done using RSEM software). The second one is called downstream analysis and this part involves differential gene expression, gene sets analysis, etc.

Due to the lack of facility to use a server or hpc the first part of the RNA-Seq analysis won’t be done but, you can use the rnaseq pipeline from nfcore. In my experience, the best combination is to use fastq + STAR + RSEM combination of software for this part of the analysis.

With respect to the second part of the analysis, on this document we’ll see how to analyze the airway data set which is a treatment of dexametasome on specific cell lines. Specifically, differentially gene expression, gene sets enrichment analysis and network analysis will be performed.

2 Data & Methods

We are gonna use the airway dataset from the airway package. Part of the analysis will be: - Differential gene expression using DESeq2. - Gene set analysis with fgsea from Cluster profiler. - Network analysis with networkx.

2.1 Data Loading

For all the analysis the Himes et al. (2014) dataset will be used. Also, the Love, Huber, and Anders (2014) package will be used for the differential gene expression step.

data("airway")

dds <- DESeqDataSet(se = airway, design = ~ cell + dex)
Source: Article Notebook

2.2 Pre-filtering

keep <- rowSums(counts(dds)>= 10) >= 3

dds <- dds[keep,]

2.2.1 Re-leveling Factors

dds$dex <- factor(dds$dex, levels = c("untrt","trt"))
dds$dex <- relevel(dds$dex, ref = "untrt")
dds$dex <- droplevels(dds$dex)

2.3 Quality Control of Samples

We can see how the variance is stabilized with the rlog transformation.

rld <- rlog(dds)

hex_df<-data.frame(Means=rowMeans(assay(rld)),
                   Sds=rowSds(assay(rld)))

gghex<-hex_df |> 
  ggplot(aes(Means,Sds))+
  geom_hex(alpha=0.8,bins=40)+
  guides(fill=guide_colourbar(title = "Counts"))+
  labs(x="Sorted mean of normalized counts per gene",
       y="Standard deviation of normalized counts per gene")+
  theme_minimal()+
  geom_smooth(aes(Means,Sds),colour="red",linewidth=0.5) +
  paletteer::scale_fill_paletteer_c("ggthemes::Green-Gold")

bslib::card(full_screen = T, bslib::card_title("Stabilized Variance"), plotly::ggplotly(gghex))
Stabilized Variance

Also, is important to check if appears some structure in the sample to sample distances plot.

dds<-estimateSizeFactors(dds)
select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:20]

sampleDists <- dist(t(assay(rld)))

sampleDistMatrix <- as.matrix(sampleDists)

ggheat<-heatmaply(sampleDistMatrix,
            row_side_colors = colData(rld)[,c("dex","cell")],
            row_dend_left = F,colors = viridis::magma(n=256, alpha = 0.8, begin = 0))

bslib::card(full_screen = T, bslib::card_title("Sample-Sample Distance"), ggheat)
Sample-Sample Distance

2.4 Differential Gene Expression (DGE)

Let’s calculate differential expressed genes using shrinked log2 fold changes with threshold of 1 and a FDR of 0.05.

dds <- DESeq(dds)
res <- results(dds, lfcThreshold = 1, alpha = 0.05, test = "Wald")

res.lfc <- lfcShrink(dds, type = "apeglm", lfcThreshold = 1, coef = 5)

We can see the MA plots as a general view of up and down regulated genes.

a<-plotMA(res.lfc,alpha=0.05,returnData=T)

a$symbol<-rownames(a)
a$label<-NA
a$label[a$isDE == TRUE] <- a$symbol[a$isDE == TRUE]

p <- ggplot(a, aes(log10(mean), lfc, colour = isDE)) +
    geom_point(alpha = 0.2, size = 0.8) +
    geom_hline(aes(yintercept = 0), col = "red", alpha = 0.6) +
    geom_rug() +
    geom_hline(yintercept = c(-1, 1), color = "midnightblue") +
    scale_colour_manual(
      name = paste0("FDR = ", 0.05),
      values = c(mnsl("10PB 5/8"), mnsl("5BG 7/8")),
      labels = c("no DE", "DE")
    ) +
    labs(
      y = "Shrunken log2foldchange",
      x = "Mean of normalized counts (log10 scale)",
      title = "Bland-Altman Plot"
    )

p

2.4.1 Top 20 Expressed Genes

As a general view, we can select the top 20 expressed genes ordered by the number of counts.

df<-heatmaply::heatmaply(assay(rld)[select,],
                         col_side_colors=colData(rld)[,c("dex","cell")],
                         colors = viridis::magma(n=256, alpha = 0.8, begin = 0))

bslib::card(df, full_screen = T)

2.5 Manipulating Annotation with a SQLite DB

For this part of the analysis, is important to use annotation databases, in this case we are gonna use the human h19 genome annotation from the Carlson (2023) package.

k<-keys(org.Hs.eg.db)


a <- AnnotationDbi::select(
        org.Hs.eg.db,
        keys = k,
        keytype = "ENTREZID",
        columns = c("SYMBOL", "GENENAME")
      )
anno_df <- data.frame(
  gene_id = rownames(dds),
  gene_name = mapIds(
    org.Hs.eg.db,
    keys = rownames(dds),
    column = "SYMBOL",
    keytype = "ENSEMBL"
  ),
  gene_entrez = mapIds(
    org.Hs.eg.db,
    keys = rownames(dds),
    keytype = "ENSEMBL",
    column = "ENTREZID"
  ),
  stringsAsFactors = FALSE,
  row.names = rownames(dds)
)
res.lfc$SYMBOL <-
      anno_df$gene_name[match(rownames(res.lfc), anno_df$gene_id)]

2.6 Construction of Ranks for fgsea

res.lfc_for_ranks <-
      res.lfc%>%
      as.data.frame()%>%
      dplyr::select(SYMBOL,log2FoldChange)%>%
      na.omit()%>%
      distinct()%>%
      group_by(SYMBOL)%>%
      summarize(log2FoldChange=mean(log2FoldChange))
    
res.lfc_ranks <-
      deframe(res.lfc_for_ranks)

2.7 Data Manipulation for GeneTonic

For this part of the process, the Wu et al. (2021) package will be used to make the gene sets enrichment analysis. Is important to mention that for this analysis we are gonna analyze only the biological process ontology,

ranks_entrez<-bitr(names(res.lfc_ranks),fromType = "SYMBOL",toType = "ENTREZID",OrgDb = org.Hs.eg.db)
entrez_res.lfc_ranks<-res.lfc_ranks[ranks_entrez$SYMBOL]

# names(res.lfc_ranks)<-ranks_entrez$ENTREZID
names(entrez_res.lfc_ranks) <- ranks_entrez$ENTREZID

res.lfc_fgsea_GOclu_BP <-
      gseGO(
        geneList = sort(entrez_res.lfc_ranks, decreasing = T),
        OrgDb = org.Hs.eg.db,
        ont = "BP",
        pvalueCutoff = 1,
        verbose = F
      )
res.lfc_fgsea_GOclu_BP <-
      setReadable(res.lfc_fgsea_GOclu_BP,
                  "org.Hs.eg.db",
                  keyType = "ENTREZID")

res.lfc_fgsea_GOclu_BP_shaked <-
  shake_gsenrichResult(res.lfc_fgsea_GOclu_BP)
res.lfc_fgsea_GOclu_BP_shaked_aggr <-
  get_aggrscores(
    res_enrich = res.lfc_fgsea_GOclu_BP_shaked,
    res_de = res.lfc,
    annotation_obj = anno_df,
    aggrfun = mean
  )

gtl_res_GOclu_BP <- GeneTonic_list(
  dds = dds,
  res_de = res,
  res_enrich = res.lfc_fgsea_GOclu_BP_shaked_aggr,
  annotation_obj = anno_df
)

2.7.1 Summary Heatmap of Genesets

p<-gs_summary_heat(gtl = gtl_res_GOclu_BP,
                    n_gs=15)

p

2.7.2 GeneSets - Genes Graph

GeneTonic package have a lot of interactive plots that we can use to dig into our data. An example is a network of Gene sets and genes from which we are gonna perform a network analysis.

# number_gs <- gtl_res_GOclu_BP$res_enrich$gs_id |> unique() |> length()

ggs <- ggs_graph(gtl = gtl_res_GOclu_BP, n_gs = 500)
ggs2 <- ggs_graph(gtl = gtl_res_GOclu_BP, n_gs = 20)

data <- toVisNetworkData(ggs)
data2 <- toVisNetworkData(ggs2)

data2$nodes <- data2$nodes %>%
  mutate(group = nodetype, nodetype = NULL) %>%
  dplyr::select(-color)

p <- visNetwork(nodes = data2$nodes, edges = data2$edges) %>%
  visIgraphLayout() %>%
  visOptions(
    highlightNearest = list(
      enabled = TRUE,
      degree = 1,
      hover = TRUE
    ),
    nodesIdSelection = TRUE
  ) %>%
  visPhysics(stabilization = FALSE) %>%
  visGroups(groupname = "GeneSet", color = "#48ABC7") %>%
  visGroups(groupname = "Feature", color = "#8BE64AD0")

htmltools::tagList(list(p))

2.8 igraph Object Manipulation

In this part of the analysis, we are gonna make a network analysis with Hagberg, Schult, and Swart (2008) networkx library from python. But, before that we need to manipulate the data to have a proper input.

Let’s build a bipartite undirected adjacency matrix.

\[ A = \begin{pmatrix} 0_{r,r} & B\\ B^{T} & 0_{s,s} \end{pmatrix} \]

V(ggs)$group <- data$nodes$nodetype

V(ggs)$type_bipartite <- as.integer(V(ggs)$nodetype == "Feature")


adj_matrix_bipartite <- as.matrix(as_adjacency_matrix(ggs, sparse = FALSE)) |> as.data.frame()
import pandas as pd
import numpy as np
import networkx as nx
import seaborn as sns 
from matplotlib import pyplot as plt

Checking the degree distribution of the graph.

graph = r.adj_matrix_bipartite

G = nx.from_pandas_adjacency(graph, create_using=nx.Graph)

degree_sequence = sorted((i[1] for i in G.degree), reverse=True)

sns.kdeplot(degree_sequence)
sns.rugplot(degree_sequence)
plt.show()

In this plot, we can see that there are two sub graphs with more than 120 connected components. We are gonna work with these.

2.8.1 Computing HITS (Hubs and Authorities) Scores.

In the context of bioinformatics, applying the concept of hubs and authorities to analyze gene sets and individual genes can offer insightful perspectives on the functional importance and interaction dynamics within genetic networks.

Gene sets or genes with high hub scores may represent critical regulatory functions or be involved in central biological processes. They could be key to understanding disease mechanisms, identifying potential therapeutic targets, or uncovering fundamental aspects of cellular function.

Gene sets or genes identified as high authorities could be central to multiple biological pathways, making them potential biomarkers for diseases or targets for therapeutic intervention. Their importance in various processes makes them subjects of interest for further research and analysis.

h, a = nx.algorithms.link_analysis.hits(G, max_iter=1000000)

df_hubs = pd.DataFrame.from_dict(h, orient="index").sort_values(by=[0], ascending=False)
df_authorities = pd.DataFrame.from_dict(a, orient="index").sort_values(by=[0], ascending=False)
df_a <- py$df_hubs |> 
  rename("Hubs Centrality"="0") |> 
  slice_head(n = 50)
  

df_b <- py$df_authorities |> 
  rename("Authorities Centrality" = "0") |> 
  slice_head(n = 50)

htmltools::tagList(list(DT::datatable(df_a,extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel')
  ))))
htmltools::tagList(list(DT::datatable(df_b,extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel')
  ))))

We can inspect the log fold change and p-values of some genes and gene sets with high scores.

# a gene

query <- res.lfc[which(res.lfc$SYMBOL=="EPHB2"),] |> rownames()

res[which(rownames(res)==query),]
log2 fold change (MLE): dex trt vs untrt 
Wald test p-value: dex trt vs untrt 
DataFrame with 1 row and 6 columns
                 baseMean log2FoldChange     lfcSE      stat      pvalue
                <numeric>      <numeric> <numeric> <numeric>   <numeric>
ENSG00000133216   344.728       -1.77321  0.159709  -4.84133 1.28972e-06
                       padj
                  <numeric>
ENSG00000133216 0.000143652
# a gene set

query <- OlsSearch("cell projection morphogenesis", exact = T, ontology = "GO")
query <- olsSearch(query)

qterms <- as(query,"Terms")

termOntology(qterms)
GO:0048858 
      "go" 
termNamespace(qterms)
          GO:0048858 
"biological_process" 
termLabel(qterms)
                     GO:0048858 
"cell projection morphogenesis" 
termDesc(qterms)
                                                                                        GO:0048858 
"The process in which the anatomical structures of a cell projection are generated and organized." 

Also, we can see the network colored by these scores.

import matplotlib.colors as mcolors

def draw(G, pos, measures, measure_name, norm=False):
    
    nodes = nx.draw_networkx_nodes(G, pos, node_size=10, cmap=plt.cm.plasma, 
                                   node_color=list(measures.values()),
                                   nodelist=measures.keys())
    if norm == True:
      nodes.set_norm(mcolors.SymLogNorm(linthresh=0.01, linscale=1, base=10))
    # labels = nx.draw_networkx_labels(G, pos, font_size=0.4)
    edges = nx.draw_networkx_edges(G, pos)

    plt.title(measure_name)
    plt.colorbar(nodes)
    plt.axis('off')
    plt.show()
    
pos = nx.spring_layout(G)

draw(G, pos, h, 'DiGraph HITS Hubs')

draw(G, pos, a, 'DiGraph HITS Authorities')

Detecting communities within the network can helps us understand better the underlying biological process.

from community import community_louvain


partitions = community_louvain.best_partition(G)
communities = [partitions.get(node) for node in G.nodes()]
nx.set_node_attributes(G, partitions, name='community')


colors = [G._node[n]['community'] for n in G._node]
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
ax.axis('off')
(0.0, 1.0, 0.0, 1.0)
spring_pos = nx.spring_layout(G, seed=123)
n = nx.draw_networkx(G,
                     spring_pos,
                     ax=ax,
                     node_size=30,
                     with_labels=False,
                     node_color=colors)
plt.show()


df = pd.DataFrame.from_dict(partitions, orient="index")

We can see the clusters and their respective members:

df = py$df 
df = df |> 
  rename("Cluster" = "0") |> 
  mutate(Nodes = rownames(df)) |> 
  as_tibble()


df |> 
    group_by(Cluster) |> 
    summarise(n = n())
# A tibble: 12 × 2
   Cluster     n
     <dbl> <int>
 1       0    14
 2       1   347
 3       2   291
 4       3   224
 5       4   337
 6       5   372
 7       6     8
 8       7    16
 9       8   393
10       9    71
11      10    98
12      11   436
htmltools::tagList(list(DT::datatable(df, extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel')
  ))))

Also, we can focus on the most connected component of the network

components = nx.connected_components(G)
largest_component = max(components, key=len)

print(f'The largest component have: {len(largest_component)}')
The largest component have: 1580
H = G.subgraph(largest_component)

partition = community_louvain.best_partition(H)
communities = [partition.get(node) for node in H.nodes()]

nx.set_node_attributes(H, partition, name='community')

draw(H, spring_pos, partition,'Most Connected Component Clustered by Louvain')

If instead of analysis a bipartite network we want to analyze only gene sets or only genes we need to create two matrices: genesets_coupling matrix and gene_co_expression matrix.

V(ggs)$group <- data$nodes$nodetype

# ontologies <- V(ggs)[group == "GeneSet"]
# genes <- V(ggs)[group == "Feature"]

adj_matrix_full <- as.matrix(as_adjacency_matrix(ggs, sparse = FALSE))

ontology_indices <- match(V(ggs)[group == "GeneSet"], V(ggs))
gene_indices <- match(V(ggs)[group == "Feature"], V(ggs))

adj_matrix_ontology_genes <- adj_matrix_full[ontology_indices, gene_indices] 


ontology_coupling <- adj_matrix_ontology_genes %*% t(adj_matrix_ontology_genes)

ontology_coupling <- ontology_coupling |> as.data.frame()

2.9 Gene Sets Similarity

Gene sets similarity can be thought of as the similarity between gene sets based on their shared genes. The strength of the similarity can be measured by the number of genes shared between gene sets.

From a matrix \(G_{mxn}\), with m gene sets and n genes, when can compute the gene set similarity matrix \(S_{mxm}\) (where each element \(s_{ij}\) represents the number of of shared genes between gene sets i and j) as followed:

\[ S = G*G^T \]

graph = r.ontology_coupling

G = nx.from_pandas_adjacency(graph, create_using=nx.Graph)
G.remove_edges_from(nx.selfloop_edges(G))
degree_centrality = nx.degree_centrality(G)
nx.set_node_attributes(G, degree_centrality, name='degree_centrality')

betweenness_centrality = nx.betweenness_centrality(G, normalized=True)
nx.set_node_attributes(G, betweenness_centrality, name='betweenness_centrality')

eigenvector_centrality = nx.eigenvector_centrality(G, max_iter=1000000)
nx.set_node_attributes(G, eigenvector_centrality, name='eigenvector_centrality')


def graph_to_dataframe(graph):
   node_data = {node: graph.nodes[node] for node in graph.nodes()}
   df = pd.DataFrame.from_dict(node_data, orient='index')
   return df
 

df_genesets_centralities = graph_to_dataframe(G)

pos = nx.spring_layout(G)
draw(G, pos, nx.get_node_attributes(G,name="degree_centrality"),"Degree Centrality")

draw(G, pos, nx.get_node_attributes(G,name="betweenness_centrality"),"Betweenness Centrality")

draw(G, pos, nx.get_node_attributes(G,name="eigenvector_centrality"),"Eigenvector Centrality")

df_c <- py$df_genesets_centralities

df_c <- df_c |> 
  arrange(desc(betweenness_centrality))

htmltools::tagList(list(DT::datatable(df_c, extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel')
  ))))

We can sort the table by eigenvector centrality to see which gene sets are the most important in the network

query <- OlsSearch("detoxification of inorganic compound", exact = T, ontology = "GO")
query <- olsSearch(query)

qterms <- as(query,"Terms")

termOntology(qterms)
GO:0061687 
      "go" 
termNamespace(qterms)
          GO:0061687 
"biological_process" 
termLabel(qterms)
                            GO:0061687 
"detoxification of inorganic compound" 
termDesc(qterms)
                                                                                                                                                                                                                            GO:0061687 
"Any process that reduces or removes the toxicity of inorganic compounds. These include transport of such compounds away from sensitive areas and to compartments or complexes whose purpose is sequestration of inorganic compounds." 

2.10 Gene Co-involvement

Gene co-involvement can be interpreted as the co-involvement of genes in multiple gene sets. The strength of co-involvement can be measured by the number of gene sets in which two genes are both involved.

From a matrix \(G_{mxn}\), with m gene sets and n genes, when can compute the gene set similarity matrix \(C_{mxm}\) (where each element \(c_{ij}\) represents the number of of shared genes between gene sets i and j) as followed:

\[ C = G^T*G \]

genes_coapprearence <- t(adj_matrix_ontology_genes) %*% adj_matrix_ontology_genes

genes_coapprearence <- genes_coapprearence |> as.data.frame()
graph = r.genes_coapprearence

G = nx.from_pandas_adjacency(graph, create_using=nx.Graph)
G.remove_edges_from(nx.selfloop_edges(G))
degree_centrality = nx.degree_centrality(G)
nx.set_node_attributes(G, degree_centrality, name='degree_centrality')

betweenness_centrality = nx.betweenness_centrality(G, normalized=True)
nx.set_node_attributes(G, betweenness_centrality, name='betweenness_centrality')

eigenvector_centrality = nx.eigenvector_centrality(G, max_iter=1000000)
nx.set_node_attributes(G, eigenvector_centrality, name='eigenvector_centrality')


def graph_to_dataframe(graph):
   node_data = {node: graph.nodes[node] for node in graph.nodes()}
   df = pd.DataFrame.from_dict(node_data, orient='index')
   return df
 

df_genes_centralities = graph_to_dataframe(G)

pos = nx.spring_layout(G)
draw(G, pos, nx.get_node_attributes(G,name="degree_centrality"),"Degree Centrality")

draw(G, pos, nx.get_node_attributes(G,name="betweenness_centrality"),"Betweenness Centrality")

draw(G, pos, nx.get_node_attributes(G,name="eigenvector_centrality"),"Eigenvector Centrality")

df_d <- py$df_genes_centralities

df_d <- df_d |> 
  arrange(desc(betweenness_centrality))

htmltools::tagList(list(DT::datatable(df_d, extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel')
  ))))

We can sort the table by betweenness centrality to look for genes that serve as bridges or connectors within the network, facilitating the flow of information.

query <- res.lfc[which(res.lfc$SYMBOL=="LEP"),] |> rownames()

res[which(rownames(res)==query),]
log2 fold change (MLE): dex trt vs untrt 
Wald test p-value: dex trt vs untrt 
DataFrame with 1 row and 6 columns
                 baseMean log2FoldChange     lfcSE      stat      pvalue
                <numeric>      <numeric> <numeric> <numeric>   <numeric>
ENSG00000174697   116.076        3.02938  0.326588   6.21389 5.16892e-10
                       padj
                  <numeric>
ENSG00000174697 8.66499e-08

3 Conclusion

  • Differential gene expression analysis was performed on airway dataset.
  • Gene sets enrichment analysis was performed on airway dataset.
  • Network analysis was performed on airway dataset.

References

Carlson, Marc. 2023. Org.hs.eg.db: Genome Wide Annotation for Human.
Hagberg, Aric A., Daniel A. Schult, and Pieter J. Swart. 2008. “Exploring Network Structure, Dynamics, and Function Using NetworkX.” In Proceedings of the 7th Python in Science Conference, edited by Gaël Varoquaux, Travis Vaught, and Jarrod Millman, 11–15. Pasadena, CA USA.
Himes, B. E., Jiang, X., Wagner, P., Hu, et al. 2014. RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS ONE 9 (6): e99625. http://www.ncbi.nlm.nih.gov/pubmed/24926665.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2” 15: 550. https://doi.org/10.1186/s13059-014-0550-8.
Wu, Tianzhi, Erqiang Hu, Shuangbin Xu, Meijun Chen, Pingfan Guo, Zehan Dai, Tingze Feng, et al. 2021. “clusterProfiler 4.0: A Universal Enrichment Tool for Interpreting Omics Data” 2: 100141. https://doi.org/10.1016/j.xinn.2021.100141.

Citation

BibTeX citation:
@online{palacios bernuy2024,
  author = {Palacios Bernuy, Piero},
  title = {Analysis of {RNA-Seq} {Data}},
  date = {2024-03-07},
  langid = {en},
  abstract = {This document is part of a series of the analysis of Omics
    data. Especifically, here is showed how to analyze bulk RNA-Seq data
    with Bioconductor packages. Also, it’s showcased how to make plots
    of the RNA data in the context of differentially gene expression and
    gene-sets.}
}
For attribution, please cite this work as:
Palacios Bernuy, Piero. 2024. “Analysis of RNA-Seq Data.” An Open Source Portfolio. March 7, 2024.