Analysis of Hyper-parameters on Single Cell RNA-Seq Data

Code

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.manifold import MDS
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import silhouette_score, silhouette_samples
import scanpy as sc
from sklearn.model_selection import train_test_split
import plotly.graph_objects as go
import plotly.figure_factory as ff

Getting the Data

We will analyze three types of hyper-parameters: Perplexity on t-SNE, number of clusters chosen from an unsupervised method and how these affect the quality of the selected features and how the number of PC’s affect the t-SNE.

To performs these three tasks we are gonna work with real data. Specifically, we’re gonna use scRNA-Seq data from a brain sample (GSM6900730). The data is available on this link.

In our case, let’s do a basic web scrapping code to get the data.

Code

# download the data and load it for the posterior analysis
import urllib.parse
import requests
import urllib
from bs4 import BeautifulSoup
import os
from ftplib import FTP

url = "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6900730"

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

table = soup.find('table')
links = table.find_all('a')

download_dir = os.getcwd()

for link in links:
    # Get the URL of the link
    link_url = link.get('href')
    # Check if the link contains 'ftp'
    if link_url and 'ftp' in link_url:
        # Parse the FTP URL
        link_url = urllib.parse.unquote(link_url)
        parsed_url = urllib.parse.urlparse(link_url)
        hostname = parsed_url.hostname
        path = parsed_url.path
        
        # Connect to the FTP server
        ftp = FTP(hostname)
        ftp.login()
        ftp.cwd(os.path.dirname(path))
    
        # Extract the file name from the path
        file_name = os.path.basename(path)
        local_file_path = os.path.join(download_dir, file_name)
        
        with open(local_file_path, "wb") as local_file:
            ftp.retrbinary(f"RETR {file_name}", local_file.write)
    
        ftp.quit()
        print(f"Downloaded {file_name}")

Downloaded GSM6900730_JLE16_B1_barcodes.tsv.gz
Downloaded GSM6900730_JLE16_B1_features.tsv.gz
Downloaded GSM6900730_JLE16_B1_matrix.mtx.gz

Now, let’s load the data (is a sparse matrix) with scanpy:

Code

scdata = sc.read_10x_mtx('D:/Data Analysis Statistical Modeling and Computation in Applications/data/scrnaseq data/',
                         var_names = 'gene_symbols',
                         cache=False)

X_sc = scdata.X.toarray()
print(X_sc.shape)

(5154, 36601)

Ok, now we are gonna tranform the data with (log(x+1)) due to the presence of genes with extremely high magnitudes of expression in only a few cells.

Code

X_sc_transformed = np.log2(X_sc + 1)

We can use the principal components analysis to get the top components that explains the 85% variability of the data.

Code

# PCA

pca = PCA()
pca_axes = pca.fit_transform(X_sc_transformed)


csum = np.cumsum(pca.explained_variance_ratio_)
plt.plot(np.arange(0,X_sc_transformed.shape[0]),csum)
plt.xlabel('Component #')
plt.ylabel('% of Variance Explained')
plt.title('% Variance explained by # Components')
plt.show()

print(f'Variance explained 85%: {np.where(csum >= 0.85)[0][0]}')

Variance explained 85%: 1758

Let’s save the number of components on a variable.

Code

top_pca_components = np.where(csum >= 0.85)[0][0]

Analysis of Perplexities

For the first part of the analysis, we are gonna analyze the following perplexities: 10,20,30,40,50 and see how this change the output of the t-SNE analysis.

Code

# let's write a helper function for the tsne

def tsne_plotter(data=None,per=40):

    tsne = TSNE(n_components=2, perplexity=per)

    x_tsne = tsne.fit_transform(data)

    ax.scatter(x_tsne[:, 0], x_tsne[:, 1])
    ax.set_xlabel('tSNE 1')
    ax.set_ylabel('tSNE 2')
    ax.set_title(f't-SNE Plot for log2\nTransformed Data, Perplexity:{per}')

Code

perplexities = [10,20,30,40,50]

num_plots = len(perplexities)
num_rows = 2
num_cols = 3
fig, axes = plt.subplots(num_rows, num_cols, figsize=(10,8))

for i,perplexity in enumerate(perplexities):

    row = i // num_cols
    col = i % num_cols
    ax = axes[row, col]

    tsne_plotter(pca_axes[:,0:50], per= perplexity)

plt.tight_layout()
plt.show()

We can see that increasing the perplexity values make the clusters more defined. This is in consonance with the definition of the perplexity parameter which says: “The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significantly different results. The perplexity must be less than the number of samples”. Also, we can notice that from the perplexity value 30 to 50 there are more defined clusters; specifically, for perplexity 50 we see six more defined clusters so, this is the perplexity value chosen for the following analyses.

Number of Clusters Chosen from an Unsupervised Method

For this part of the analysis on the scRNA-Seq data, we are gonna use the Gaussian Mixture algorithm to get predictec clusters on the data. Then, using these clusters as “labels” we can perform a logistic regression and see how the test score changes when we change the number of clusters.

For this part, we are gonna use the first 2256 principal components with the objective of a precise analysis.

Code

np.where(csum >= 0.9)[0][0]

With 2256 PC’s we achieve 90% of explained variability.

Now, let’s perform the clustering with the GMM model from 2 to 20 clusters. Also, let’s plot the BIC (Bayesian Information Criterion) and the Silhouette plot to see how many clusters we can select.

Code

## Clustering elbow plot GMM

n_components_range = range(2, 21)
bic_scores = []

for n_components in n_components_range:
    gmm = GaussianMixture(n_components=n_components, random_state=0, init_params='kmeans')
    gmm.fit(pca_axes[:,0:2256])
    # print(f'# comp: {n_components}')
    bic_scores.append(gmm.bic(pca_axes[:,0:2256]))

plt.plot(n_components_range, bic_scores, marker='o', linestyle='-')
plt.xlabel('Number of Components')
plt.ylabel('BIC Score')
plt.title('Elbow Plot for Gaussian Mixture Model')
plt.xticks(n_components_range)
plt.grid(True)

plt.show()

The BIC score tells us that between 4 and 5 clusters can be a good choice.

Code

## Silhouette plot
#
n_components_range = range(2, 21)  # Start from 2 components (minimum required for silhouette score)
silhouette_scores = []

for n_components in n_components_range:
    gmm = GaussianMixture(n_components=n_components, random_state=0, init_params='kmeans')
    gmm.fit(pca_axes[:, 0:2256])
    # print(f'# comp: {n_components}')
    labels = gmm.predict(pca_axes[:, 0:2256])
    silhouette_avg = silhouette_score(pca_axes[:, 0:2256], labels)
    silhouette_scores.append(silhouette_avg)

plt.plot(n_components_range, silhouette_scores, marker='o', linestyle='-')
plt.xlabel('Number of Components')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score Plot for Gaussian Mixture Model')
plt.xticks(n_components_range)
plt.grid(True)
plt.show()

The silhouette plot shows that 4, 8 and 9 can be good choices for number of clusters.

Let’s work with 4, 5, 8 and 9 clusters andd see how these perform.

Clusters Analysis

First, let’s write a helper function to plot the t-SNE and to manage some hyper-parameters.

Code

def tsne_plotter_pca(n_clusters:int, pca_dim:int, perplexity:int = 50):

    gmm = GaussianMixture(n_components=n_clusters, random_state=0, init_params="kmeans")
    clusters_gmm = gmm.fit_predict(pca_axes[:,0:pca_dim])

    tsne = TSNE(n_components=2, perplexity=perplexity)

    x_tsne = tsne.fit_transform(pca_axes[:,0:pca_dim])

    plt.scatter(x_tsne[:,0], x_tsne[:,1], c=clusters_gmm)
    plt.xlabel('tSNE 1')
    plt.ylabel('tSNE 2')
    plt.title(f"t-SNE Plot for log2 Transformed Data with GMM\nwith {pca_dim} PC's, {n_clusters} clusters and perplexity: {perplexity}")
    plt.axis('equal')
    plt.show()

    return clusters_gmm

Code

res_train = {}
res_test = {}

Code

np.random.seed(123)
clust = 4

clusters_gmm = tsne_plotter_pca(n_clusters=clust,
                    pca_dim=2256,
                    perplexity=50)    
    
X_train, X_test, y_train, y_test = train_test_split(X_sc_transformed, clusters_gmm, test_size=0.2, random_state=42)
    
log_reg = LogisticRegressionCV(cv=10,Cs=[0.01,0.1,1,10],penalty="l1",solver="liblinear",multi_class="ovr")

log_reg.fit(X_train,y_train)

res_train["Cluster "+str(clust)] = log_reg.score(X_train,y_train)
res_test["Cluster "+str(clust)] = log_reg.score(X_test,y_test)

print(f"Train score: {log_reg.score(X_train,y_train)}")
print(f"Test score: {log_reg.score(X_test,y_test)}")

Train score: 1.0
Test score: 0.9903006789524733

Code

np.random.seed(123)
clust = 5

clusters_gmm = tsne_plotter_pca(n_clusters=clust,
                    pca_dim=2256,
                    perplexity=50)    
    
X_train, X_test, y_train, y_test = train_test_split(X_sc_transformed, clusters_gmm, test_size=0.2, random_state=42)
    
log_reg = LogisticRegressionCV(cv=10,Cs=[0.01,0.1,1,10],penalty="l1",solver="liblinear",multi_class="ovr")

log_reg.fit(X_train,y_train)

res_train["Cluster "+str(clust)] = log_reg.score(X_train,y_train)
res_test["Cluster "+str(clust)] = log_reg.score(X_test,y_test)

print(f"Train score: {log_reg.score(X_train,y_train)}")
print(f"Test score: {log_reg.score(X_test,y_test)}")

Train score: 0.9997574581615328
Test score: 0.9408341416100873

Code

np.random.seed(123)
clust = 8

clusters_gmm = tsne_plotter_pca(n_clusters=clust,
                    pca_dim=2256,
                    perplexity=50)    
    
X_train, X_test, y_train, y_test = train_test_split(X_sc_transformed, clusters_gmm, test_size=0.2, random_state=42)
    
log_reg = LogisticRegressionCV(cv=10,Cs=[0.01,0.1,1,10],penalty="l1",solver="liblinear",multi_class="ovr")

log_reg.fit(X_train,y_train)

res_train["Cluster "+str(clust)] = log_reg.score(X_train,y_train)
res_test["Cluster "+str(clust)] = log_reg.score(X_test,y_test)

print(f"Train score: {log_reg.score(X_train,y_train)}")
print(f"Test score: {log_reg.score(X_test,y_test)}")

Train score: 0.9997574581615328
Test score: 0.9369544131910766

Code

np.random.seed(123)
clust = 9

clusters_gmm = tsne_plotter_pca(n_clusters=clust,
                    pca_dim=2256,
                    perplexity=50)    
    
X_train, X_test, y_train, y_test = train_test_split(X_sc_transformed, clusters_gmm, test_size=0.2, random_state=42)
    
log_reg = LogisticRegressionCV(cv=10,Cs=[0.01,0.1,1,10],penalty="l1",solver="liblinear",multi_class="ovr")

log_reg.fit(X_train,y_train)

res_train["Cluster "+str(clust)] = log_reg.score(X_train,y_train)
res_test["Cluster "+str(clust)] = log_reg.score(X_test,y_test)

print(f"Train score: {log_reg.score(X_train,y_train)}")
print(f"Test score: {log_reg.score(X_test,y_test)}")

Train score: 0.9980596652922629
Test score: 0.933074684772066

Code

plt.plot([4,5,8,9], list(res_train.values()),c="r",label="Train")
plt.plot([4,5,8,9], list(res_test.values()),c="b", label="Test")
plt.legend()
plt.show()

We can see that the best train and test score achieved was using 4 clusters.

Now, working only with four clusters we can see the effect on the clustering choosing different number of PC’s to build the t-SNE plot.

Effect of PC’s on t-SNE

We cann choose the number of PC’s based on percentage of variance explained. LEt’s work with three values: 50, 70 and 90.

Code

percentages_to_test = [0.5, 0.7, 0.9]

pcs_to_test = [np.where(csum >= i)[0][0] for i in percentages_to_test]

pcs_to_test

[90, 811, 2256]

Code

np.random.seed(123)

for pcas in pcs_to_test:
    
    tsne_plotter_pca(n_clusters=4, pca_dim=pcas, perplexity=50)

In practice, the analysis of scRNA-Seq data involves additional steps before and after the dimention reduction. For example, filtering and normalizing, validate the clusters with especific genes (this is because cell types have a specific expression of certain genes and we can identify these cell types with the gene expression related to the cluster) and do downstream analysis like differential gene expression per cell type, gene-set analysis per cell type and more advanced analysis like cell-cell communiation analysis.