Likelihood Ratio Test, Regression and Gradient Descent

Likelihood Ratio Test

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy as sc
import seaborn as sns

The file gamma-ray.csv contains a small quantity of data collected from the Compton Gamma Ray Observatory, a satellite launched by NASA in 1991 (http://cossc.gsfc.nasa.gov/). For each of 100 sequential time intervals of variable lengths (given in seconds), the number of gamma rays originating in a particular area of the sky was recorded.

For this analysis, we would like to check the assumption that the emission rate is constant.

First, let’s check the distribution of the number of gamma rays

Code

df = pd.read_csv("gamma-ray.csv", sep=",")

sns.set_theme(style="darkgrid")
sns.displot(df,x="count")

The number of these gamma rays is discrete and non-negative. We can assume that the gamma rays emerge independently of one another, and at constant rate in each time interval. Based on this assumption, a Poisson model is a good model assumption. Each observation \(G_i\) follows a poisson distribution with rate \(\lambda_i\) times the time interval \(t_i\).

\[G_i \thicksim Poisson(\lambda_i * t_i)\]

THe hypothesis that we wanna test is that the emission rate is constat so:

\(H_0: \lambda_0 = \lambda_1 = ... = \lambda_{99}\)

And the alternative hypothesis is that the emisison rates are different:

\(H_A: \lambda_i \neq \lambda_j\) for some i and j

The most plausible parameter for \(\lambda\) under the null hypothesis is the maximun likelihood estimator (MLE).

Given the poisson distribution, the derivation goes as follows:

\[f(G_0,G_1,...,G_{99} | \lambda) = \prod_{i=0}^{99} \frac{e^{-\lambda*t_i}*(\lambda*t_i)^{G_i}}{G_i!}\]

\[ln(f) = -\lambda \sum_{i=0}^{99} t_i + ln(\lambda) \sum_{i=0}^{99} G_i + ln(\prod_{i=0}^{99} t_i^{G_i})) - ln(\prod_{i=0}^{99} G_i!)\]

Then, we derivate w.r.t parameter \(\lambda\) and set the derivative to 0 to find the optimal \(\lambda\).

\[0 = -\sum_{i=0}^{99}t_i + \sum_{i=0}^{99} \frac{G_i}{\lambda}\]

\[\hat{\lambda} = \frac{\sum_{i=0}^{99} G_i}{\sum_{i=0}^{99} t_i}\]

Code

lambda_null = df["count"].sum() / df["seconds"].sum()

lambda_null

0.0038808514969907496

Similarly, we can calculate a plausible value for the alternative hypothesis with MLE.

\[ln(f) = -\sum_{i=0}^{99} \lambda_i*t_i + \sum_{i=0}^{99} G_i*ln(\lambda_i) + ln(\prod_{i=0}^{99} G_i*ln(t_i)) - ln(\prod_{i=0}^{99} G_i!)\]

WE take the partial derivative w.r.t each parameter \(\lambda_i\) to find the optimal \(\lambda_i\)

\[0 = -t_i + \frac{G_i}{\lambda_i}\]

\[\hat{\lambda_i} = \frac{G_i}{t_i}\]

Code

lambdas_alt = df["count"]/df["seconds"]

lambdas_alt

0     0.000000
1     0.000000
2     0.000000
3     0.000000
4     0.009804
        ...   
95    0.025840
96    0.000000
97    0.000000
98    0.000000
99    0.000000
Length: 100, dtype: float64

Now to test these hypotheses, we are gonna use the likelihood ratio test:

\[\Lambda(x) = -2ln(\frac{argmax f(G_0,G_1,...,G_{99}| \lambda)}{argmax f(G_0,G_1,...,G_{99}| \lambda_0,...,\lambda_{99})})\]

which asymptotic distribution \(X_{100-1 = 99}^{2}\)

Code

lh_null = sc.stats.poisson.pmf(k=df["count"], mu=lambda_null)

lh_alt = sc.stats.poisson.pmf(k=df["count"], mu=lambdas_alt)

llh_test = -2*np.log(lh_null.prod() / lh_alt.prod())

llh_test

104.33272117042188

We can see the distribution of the chi-squared pdf at 99 degrees of freedom

Code

plot_Xs = np.arange(50,150,0.1)
plt.plot(plot_Xs, sc.stats.chi2.pdf(plot_Xs, 99))
plt.show()

And see the value that give a p-value of 0.05

Code

# We can calculate the Lambda that would give a p-value of 0.05 by using the inverse survival function
sc.stats.chi2.isf(0.05, 99)

123.22522145336181

Calculating our p-value:

Code

pvalue = sc.stats.chi2.sf(llh_test, 99)
print(pvalue)

0.337398546433923

We can conclude that we cannot reject our null hypothesis, emission rate of gamma rays appear to be constant.

Regression and Gradient Descent

Regression

We are gonna use a dataset from General Motors about mortality. This data is from 59 US cities where it was studied the contribution of air pollution to mortality.

The dependen variable is the age adjusted mortality (Mortality). Also, the data includes variables measuring climate characteristics (JanTemp, JulyTemp, RelHum, Rain), demographic characteristics of the cities (Educ, Dens, NonWhite, WhiteCollar, Pop, House, Income), and variables recording the pollution potential of three diferent air pollutants (HC, NOx, SO2).

Let’s load the data:

Code

df = pd.read_csv("mortality.csv", sep=",")
df.head()

	City	Mortality	JanTemp	JulyTemp	RelHum	Rain	Educ	Dens	NonWhite	WhiteCollar	Pop	House	Income	HC	NOx	SO2
0	Akron, OH	921.87	27	71	59	36	11.4	3243	8.8	42.6	660328	3.34	29560	21	15	59
1	Albany-Schenectady-Troy, NY	997.87	23	72	57	35	11.0	4281	3.5	50.7	835880	3.14	31458	8	10	39
2	Allentown, Bethlehem,PA-NJ	962.35	29	74	54	44	9.8	4260	0.8	39.4	635481	3.21	31856	6	6	33
3	Atlanta, GA	982.29	45	79	56	47	11.1	3125	27.1	50.2	2138231	3.41	32452	18	8	24
4	Baltimore, MD	1071.29	35	77	55	43	9.6	6441	24.4	43.7	2199531	3.44	32368	43	38	206

Let’s write two helper function that adds the intercept to the data and perform the ordinary least squares estimation based on the formula:

\[\hat{\beta}=(X^{T}X)^{-1}X^{T}y\]

Code

def add_intercept(X: np.array):
  return np.concatenate((np.ones_like(X[:,:1]), X), axis=1)

def OLS(X: np.array, y: np.array):
  return np.linalg.inv(X.T.dot(X)).dot(X.T.dot(y))

Also we need to declare the matrix X and the dependent variable y and convert them to numpy objects.

Code

y = df["Mortality"]
X = df.loc[:,"JanTemp":"SO2"]

# Now lets add the intercept to X

X = X.to_numpy()
y = y.to_numpy()

Now, can calculate the betas:

Code

betas = OLS(add_intercept(X), y)

betas

array([ 1.40003471e+03, -1.44112034e+00, -2.95025005e+00,  1.36009551e-01,
        9.69507928e-01, -1.10466624e+01,  4.71706516e-03,  5.30271368e+00,
       -1.49231882e+00,  3.40228581e-06, -3.80416445e+01, -4.25186359e-04,
       -6.71247478e-01,  1.17850768e+00,  8.45829777e-02])

Great! This was an easy way to compute the OLS regression. We also can do the same with sklearn:

Code

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

reg.fit(add_intercept(X),y)

print(reg.intercept_, reg.coef_)

1400.0347123466809 [ 0.00000000e+00 -1.44112034e+00 -2.95025005e+00  1.36009551e-01
  9.69507928e-01 -1.10466624e+01  4.71706516e-03  5.30271368e+00
 -1.49231882e+00  3.40228581e-06 -3.80416445e+01 -4.25186359e-04
 -6.71247478e-01  1.17850768e+00  8.45829777e-02]

Gradient Descent

We also can do the regression task via optimization with the gradient descent algorithm. Let’s write some helper functions that will helps us to do the gradient descent algorithm.

Code

def loss_fn(beta, X, y):
  # (y - X beta)^T (y - X beta)
  return np.sum(np.square(y - X.dot(beta)))

def loss_grad(beta, X, y):
  # -2*(y - X beta)^T X
  return -2*(y - X.dot(beta)).T.dot(X)

def gradient_step(beta, step_size, X, y):
  loss, grads = loss_fn(beta, X, y), loss_grad(beta, X, y)
  
  beta = beta - step_size * grads.T
  return loss, beta

def gradient_descent(X, y, step_size, precision, max_iter=10000, warn_max_iter=True):
  beta = np.zeros_like(X[0])
  
  losses = [] # Array for recording the value of the loss over the iterations.
  graceful = False
  for _ in range(max_iter):
    beta_last = beta # Save last values of beta for later stopping criterion
    loss, beta = gradient_step(beta, step_size, X, y)
    losses.append(loss)
    # Use the euclidean norm of the difference between the new beta and the old beta as a stopping criteria
    if np.sqrt(np.sum(np.square((beta - beta_last)))) < precision:
      graceful = True
      break
  if not graceful and warn_max_iter:
    print("Reached max iterations.")
  return beta, np.array(losses)

Before applying the gradient descent, we need to standardize the data because the loss explote due to some outliers. > You can try it without the standardization

Code

# Standardize the data

X_scaled = (X - X.mean(axis=0))/X.std(axis=0, ddof=1)
Y_scaled = (y - y.mean())/y.std(ddof=1)
X_inter = add_intercept(X_scaled)

# Apply the gradient descent

beta_gd, losses = gradient_descent(X_inter, Y_scaled, 0.001, 1e-8, max_iter=10000)
print("BetaHat = {}".format(beta_gd))
print("Final loss value = {}".format(losses[-1]))
plt.plot(range(len(losses)), losses)
plt.title("Iteration vs Loss")
plt.show()

# As a sanity check, we can also do the matrix inversion with OLS
print("Matrix inverse BetaHat = {}".format(OLS(X_inter, Y_scaled)))

Reached max iterations.
BetaHat = [ 9.99200722e-19 -2.34387032e-01 -2.17491727e-01  1.17191047e-02
  1.79800652e-01 -1.50530629e-01  1.08943031e-01  7.64357603e-01
 -1.21223518e-01  8.40179286e-02 -1.11464155e-01 -3.04603568e-02
 -9.95161696e-01  8.80002697e-01  8.62630571e-02]
Final loss value = 13.809259921735789

Matrix inverse BetaHat = [ 3.21984155e-17 -2.34377091e-01 -2.17496828e-01  1.17238925e-02
  1.79754504e-01 -1.50544513e-01  1.08945825e-01  7.64348559e-01
 -1.21202349e-01  8.40325965e-02 -1.11479218e-01 -3.04687337e-02
 -9.96190618e-01  8.81043305e-01  8.61146149e-02]

Note that the betas changed because we standardized the data, thats why at the end we are checking with the OLS function.