Ridge Regularization in Hierarchical Linear Models: A Bayesian Interpretation

Abstract

This study presents mathematical proof for the Bayesian interpretation of Ridge (L2) regularization in hierarchical linear models. It explicates the role of regularization parameters in encoding prior beliefs about the model parameters, revealing their function in shrinking parameter estimates towards zero or the group mean. We formalize this interpretation in terms of a loss function, illustrating its practical utility for empirical studies.

Introduction

We consider a hierarchical linear regression problem with \(p\) predictors, \(m\) groups (or products), and \(n\) observations in each group. Our goal is to predict a continuous target variable from a set of features, allowing for group-specific deviations from a global relationship.

Model Definition

Let \(y_{i,j}\) denote the target variable for group \(i\) and observation \(j\), and let \(X_{i,j}\) denote the associated \(p\)-dimensional feature vector. We define a global \(p\)-dimensional coefficient vector \(\beta\), and a group-specific deviation \(p\)-dimensional vector \(\delta_i\) for each group \(i\). Our model is defined as

\[ y_{i,j} = X_{i,j}(\beta + \delta_i)^T + \epsilon_{i,j} \]

Where \(\epsilon_{i,j}\) are independently distributed error terms, following a Gaussian distribution with mean \(0\) and variance \(\sigma_\epsilon^2\).

Bayesian interpretation

We assume Gaussian priors on the global coefficients \(\beta\) and group-specific deviations \(\delta_i\).

Global coefficients: \(\beta \sim \mathcal{N}(0, \sigma_\beta^2I)\), representing the belief that the global coefficients are centered around zero.

Group-specific deviations: \(\delta_i \sim \mathcal{N}(0, \sigma_\delta²I)\) for each \(i\), representing the belief that each group's coefficients should be close to the global coefficients.

Posterior Distribution:

Given these priors, the joint posterior distribution for the global and group-specific coefficients, given the data, is

\[ \mathsf{P}(\beta, \delta_1, \dots, \delta_m \mid y, X) \propto \mathsf{P}(y \mid X, \beta, \delta_1, \dots,\delta_m) \cdot \mathsf{P}(\beta) \cdot \prod_{i=1}^m \mathsf{P}(\delta_i) \]

Assuming Gaussian noise, the likelihood of the data given the model is

\[ \mathsf{P}(y \mid X, \beta, \delta_1, \dots, \delta_m) \propto \mathcal{N}(X_{i,j}(\beta + \delta_i)^T, \sigma_\epsilon^2) \]

We can use the posterior distribution to derive our optimization problem.

Conclusion

The solution to the hierarchical linear model with Gaussian priors on the global and group-specific parameters corresponds to minimizing a regularized sum of squared errors. This loss function for the regression problem can be expressed as

\[ L(\beta, \delta_1, \dots, \delta_m) = \sum_i\sum_j(y_{i,j} - X_{i,j}(\beta + \delta_i)^T)^2 + \lambda_\beta \|\beta\|^2 + \lambda_\delta\sum_i\|\delta_i\|^2 \]

Where the first term is the sum of squared errors, the second term is the Ridge penalty on the global coefficients \(\beta\), and the third term is the sum of Ridge penalties on the group-specific deviations \(\delta_i\).

The regularization parameters \(\lambda_\beta\) and \(\lambda_\delta\) dictate the strength of shrinkage towards zero and the group mean, respectively. This study bridges the gap between frequentist and Bayesian interpretations of Ridge regularization in hierarchical models.

Appendix

Appendix A -- Traditional Math

Lemma 1

We will start with the definition of the negative log-posterior distribution (which is the loss function we aim to minimize in a Maximum A Posteriori (MAP) estimation setting) and expand it.

Our goal is to maximize the posterior probability of the parameters, given the data. Using Bayes' theorem, we know that this is proportional to the product of the likelihood of the data given the parameters and the prior probability of the parameters. Thus, the negative log-posterior distribution is

\[ L(\beta, \delta_1, \dots, \delta_m) = -\log \mathsf{P}(\beta, \delta_1, \dots, \delta_m \mid y, X) \]

According to the chain rule for probability, we know that the joint distribution of several random variables is the product of the conditional distribution of each variable and the marginal distribution of the preceding variables. In our case, the joint distribution of \(\beta, \delta_1, \dots, \delta_m\) given \(y\) and \(X\) is the product of the conditional distribution of \(y\) given \(X,\beta, \delta_1, \dots, \delta_m\) (this is the likelihood of the data) and the joint distribution of \(\beta, \delta_1, \dots, \delta_m\) (this is the prior on the parameters). So, we rewrite the posterior probability as follows.

\[ L(\beta, \delta_1, \dots, \delta_m) = -\log (\mathsf{P}(y | X, \beta, \delta_1, \dots, \delta_m) \cdot \mathsf{P}(\beta, \delta_1, \dots, \delta_m)) \]

The logarithm of a product is the sum of the logarithms. Therefore, we split the right-hand side into two parts.

\[ L(\beta, \delta_1, \dots, \delta_m) = -\log\mathsf{P}(y | X, \beta, \delta_1, \dots, \delta_m) - \log\mathsf{P}(\beta, \delta_1, \dots, \delta_m) \]

Now we will decompose the second part of the right-hand side. We have assumed that the global coefficients \(\beta\) and the group-specific deviations \(\delta_i\) are independent, so their joint prior distribution is the product of their marginal prior distributions.

\[ -\log \mathsf{P}(\beta, \delta_1, \dots, \delta_m) = -\log \mathsf{P}(\beta) - \sum_i \log \mathsf{P} (\delta_i) \]

Substituting this back in we get

\[ L(\beta, \delta_1, \dots, \delta_m) = -\log \mathsf{P}(y \mid X, \beta, \delta_1, \dots, \delta_m) - \log \mathsf{P}(\beta) - \sum_i \log \mathsf{P}(\delta_i) \]

This is the negative log-posterior distribution broken down into three parts: the negative log-likelihood of the data and the negative log-prior distributions of the global and group-specific parameters. In other words, our loss function consists of a term reflecting how well the model fits the data and terms penalizing the size of the coefficients, reflecting our prior beliefs about the parameters. Each term in this loss function corresponds to one part of our original model and prior assumptions.

Proof

We will examine how the prior assumptions translate to the regularization terms in the loss function. In the Bayesian framework, the objective is to find the maximum a posteriori (MAP) estimate of the model parameters. The MAP estimate is obtained by maximizing the posterior distribution \(\mathsf{P}(\beta, \delta_1, \dots, \delta_m \mid y, X)\), which, according to Bayes' theorem, is proportional to the product of the likelihood function and the prior distribution.

Taking the log of the posterior distribution, we have

\[ \log \mathsf{P}(\beta, \delta_1, \dots, \delta_m \mid y, X) = \log \mathsf{P}(y \mid X, \beta, \delta_1, \dots, \delta_m) + \log \mathsf{P}(\beta) + \sum_i \log \mathsf{P}(\delta_i) \]

The negative of this log-posterior distribution serves as our loss function, which we aim to minimize

\[ L(\beta, \delta_1, \dots, \delta_m) = -\log \mathsf{P}(\beta, \delta_1, \dots, \delta_m \mid y, X) \]

Substituting the right hand side from the equation above and applying Lemma 1, we get

\[ L(\beta, \delta_1, \dots, \delta_m) = -\log \mathsf{P}(y \mid X, \beta, \delta_1, \dots, \delta_m) - \log \mathsf{P}(\beta) - \sum_i \log \mathsf{P}(\delta_i) \]

Now, let's break this down:

\(-\log \mathsf{P}(y \mid X, \beta, \delta_1, \dots, \delta_m)\) corresponds to the data fitting term in the loss function. In our case, we have assumed Gaussian noise, so the likelihood term is simply the sum of squared residuals, \(\sum_i\sum_j(y_{i,j} - X_{i,j}(\beta + \delta_i)^T)^2\).
\(-\log \mathsf{P}(\beta)\) corresponds to the regularization term on the global coefficients. We have assumed a Gaussian prior on \(\beta\), so this term is proportional to \(\|\beta\|^2\). The proportionality constant is \(\frac{1}{2\sigma_\beta^2}\), which serves as the regularization parameter \(\lambda_\beta\).
\(-\sum_i \log \mathsf{P}(\delta_i)\) corresponds to the regularization term on the group-specific deviations. Each \(\delta_i\) also has a Gaussian prior, so this term is the sum of squares of each \(\delta_i\), \(\sum_i\|\delta_i\|^2\). Similarly, the proportionality constant \(\frac{1}{2\sigma_\delta^2}\) serves as the regularization parameter \(\lambda_\delta\).

Substituting these terms back into the loss function, we obtain the formula

\[ L(\beta, \delta_1, \dots, \delta_m) = \sum_i\sum_j(y_{i,j} - X_{i,j}(\beta + \delta_i)^T)^2 + \lambda_\beta\|\beta\|^2 + \lambda_\delta\sum_i\|\delta_i\|^2 \]

So, the loss function indeed corresponds to the negative log-posterior distribution under our model and prior assumptions. Therefore, minimizing this loss function is equivalent to finding the MAP estimate under the Bayesian model. This concludes the proof.

Appendix B -- Alternative Math

Let's start by considering the Gaussian prior we place on the global coefficients \(\beta\)

\[ \beta \sim \mathcal{N}(0, \sigma_\beta^2I) \]

If we look at the formula for a Gaussian distribution, it's of the form

\[ \mathsf{P}(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x - \mu)^2}{2\sigma^2}} \]

where

\(x\) is the variable.
\(\mu\) is the mean.
\(\sigma^2\) is the variance.

In our case \(\mu = 0\) and \(\sigma^2 = \sigma_\beta^2\). So the prior distribution \(\mathsf{P}(\beta)\) can be written as

\[ \mathsf{P}(\beta) = \frac{1}{\sqrt{2\pi\sigma_\beta^2}}^p e^{\frac{\|\beta\|^2}{2\sigma_\beta^2}} \]

Where \(p\) is the dimensionality of \(\beta\), considering we assume all \(\beta\) coefficients come from the same distribution with standard deviation \(\sigma_\beta\). Taking the negative log of this gives us

\[ -\log \mathsf{P}(\beta) = \frac{p}{2}\log(2\pi\sigma_\beta^2) + \frac{\|\beta\|^2}{2\sigma_\beta^2} \]

In the context of a loss function, only the term involving β is relevant (because we're finding \(\beta\) to minimize the loss), so we can ignore the first term. We're left with

\[ -\log \mathsf{P}(\beta) \propto \frac{\|\beta\|^2}{2\sigma_\beta^2} \]

Multiplying by \(2\sigma_\beta^2\), we obtain the regularization term \(\lambda_\beta\|\beta\|^2\) in the loss function where \(\lambda_\beta = \frac{1}{2\sigma_\beta^2}\)

We can follow a similar argument for the group-specific deviations \(\delta_i\). Each \(\delta_i\) has a Gaussian prior with mean \(0\) and variance \(\sigma_\delta^2\)

\[ \delta_i \sim \mathcal{N}(0, \sigma_\delta^2I) \]

The negative log of this prior gives us the corresponding regularization term in the loss function, \(\lambda_\delta\|\delta_i\|^2\), where \(\lambda_\delta = \frac{1}{2\sigma_\delta^2}\).

So the loss function

\[ L(\beta, \delta_1, \dots, \delta_m) = \sum_i\sum_j(y_{i,j} - X_{i,j}(\beta + \delta_i)^T)^2 + \lambda_\beta\|\beta\|^2 + \lambda_\delta\sum_i\|\delta_i\|^2 \]

indeed corresponds to the negative log-posterior distribution under our model and prior assumptions, with \(\lambda_\beta\) and \(\lambda_\delta\) serving as the regularization parameters that correspond to half the inverse variances in the Gaussian priors on \(\beta\) and \(\delta_i\). Minimizing this loss function is equivalent to maximizing the log-posterior distribution, which is the goal of MAP estimation in the Bayesian framework. This completes the proof.