Variable Selection with I-priors

5 minute read

I-prior model selection can easily be done by comparing (marginal) likelihoods. We also explore a fully Bayesian alternative to model selection for linear models. A simulation study gives positive results for the I-prior for linear models in situations with multicollinearity, beating other types of priors such as g-priors, independent priors (ridge regression), and also Laplace/double exponential priors (Lasso).

Tests of Significance

For the regression model given in (1) and the function f of the form

f(xi)=f1(xi1)++fp(xip)

where each fj lies in a RKHS Fλj, one may be interested in which of these functions contribute significantly in explaining the response. A valid way of doing this is by performing inference on the scale parameters λ1,,λp.

Where these parameters are estimated using maximum marginal likelihood, the distribution of the scale parameters are asymptotically normal. A straightforward test of significance can be done, and this would also correspond to the significance of the particular regression function since each fj is of the form

fj(xij)=λjnk=1h(xij,xik)wk.

Model Comparison via Empirical Bayes Factors

Alternatively, model comparison can be done by comparing likelihoods. Again, as these models are fitted using maximum marginal likelihood, the log-likelihood ratio comparing two different models will follow an asymptotic χ2 distribution with degrees of freedom equal to the difference in the number of scale parameters in the models being compared. If the number of parameters are the same, then the model with the higher likelihood can be chosen.

Such a method of comparing marginal likelihoods can be seen as Bayesian model selection using empirical Bayes factors, where the Bayes factor of comparing model M1 to model M2 is defined as

BF(M1,M2)=marginal likelihood for model M1marginal likelihood for model M2.

The word ‘empirical’ stems from the fact that the parameters are estimated via an empirical Bayes approach (maximum marginal likelihood).

Bayesian Model Selection

When the number of predictors p is large, then enumerating all possible model likelihoods for comparison becomes infeasible. Greedy selection methods such as forward or backward selection do exist, but one can never hope to completely explore the entire model space with these heuristic methods.

We employ a fully Bayesian treatment to explore the space of models and obtain posterior estimates for model probabilities given by

p(M|y)p(y|M,θ)p(θ|M)p(M)dθ,

where M is a model index and θ are the parameters of the model. The model with the highest probability can be chosen, or a quantity of interest Δ estimated using model averaging techniques over a model set M by way of

ˆΔ=MMp(M|y)E[Δ|M,y].
Posterior model probabilities may also be calculated using Bayes factors and prior model probabilities as follows: p(M|y)=BF(M,Mk)p(M)MBF(M,Mk)p(M) based on an anchoring model Mk, and typically the null model (model with intercept only) or the full model is chosen.

Variable Selection for Linear Models

For linear models of the form

(y1,,yn)Nn(β01n+pj=1βjXj,Ψ1),

the prior

(β1,,βp)Np(0,ΛXΨXΛ)

is an equivalent I-prior representation of model (1) subject to (2) in the feature space of β under the linear kernel (cf. here).

Stochastic Search Methods

While posterior model probabilities may be enumerated one by one, variable selection with large p would more than likely fail to list all 2p probabilities. Various methods exist in the literature for stochastic search methods via Gibbs sampling, such that only models that have considerable probability of being selected are visited in the posterior MCMC chain.

One such method is by Kuo and Mallick (1998). Each model is indexed by γ{0,1}p, with a jth value of zero indicating an exclusion, and one an inclusion of the variable Xj. The above model (4) is estimated via Gibbs sampling, but with the mean of y revised to incorporate γ:

μy=β0+γ1β1X1++γpβpXp.

This is done in conjunction with an I-prior on β as in (5), and some suitable priors on γ, the intercept β0, scale parameters Λ, and error precision Ψ.

Posterior inclusion probabilities for a particular variable Xj can be estimated as 1TTt=1γ(t)j, where γ(t)j is the tth MCMC sample. This gives an indication of how often the variable was chosen in all possible models.

More importantly, posterior model probabilities may be estimated by calculating the proportion of a particular sequence γ, corresponding to a particular model Mγ, appearing in the MCMC samples.

Simulation Study

A study was conducted to assess the performance of the I-prior, g-prior, independent prior and Lasso in choosing the correct variables across five different scenarios quantified by the signal to noise ratio (SNR). For each scenario, out of 100 variables X1,,X100 with pairwise correlation of about 0.5, only s were selected to form the “true” model and generate the responses according to the linear model above. The SNR as a percentage is defined as s%, and the five scenarios are made up of varying SNR from high to low: 90%, 75%, 50%, 25%, and 10%.

The experiment was conducted as follows:

  1. For each scenario, generate data (y,X).
  2. Obtain the highest probability model and count the number of false choices made by each method.
  3. Repeat 1-2 100 times to obtain averages.
  4. Repeat 1-3 for each of the five scenarios.

The results are tabulated below:

Table 1: Results for I-priors.
False choices 90% 75% 50% 25% 10%
0-2 0.93 0.92 0.90 0.79 0.55
3-5 0.07 0.07 0.10 0.20 0.27
>5 0.00 0.01 0.00 0.01 0.18

Table 2: Results for g-priors.
False choices 90% 75% 50% 25% 10%
0-2 0.00 0.00 0.00 0.78 0.86
3-5 0.00 0.00 0.00 0.14 0.03
>5 1.00 1.00 1.00 0.08 0.01

Table 3: Results for independent priors.
False choices 90% 75% 50% 25% 10%
0-2 0.00 0.00 0.00 0.44 1.00
3-5 0.00 0.00 0.00 0.30 0.00
>5 1.00 1.00 1.00 0.26 0.00

Table 4: Results for the Lasso.
False choices 90% 75% 50% 25% 10%
0-2 0.00 0.00 0.00 0.00 0.00
3-5 0.28 0.08 0.00 0.00 0.00
>5 0.72 0.92 1.00 1.00 1.00

Updated: