# Regression with I-priors

This section describes choosing appropriate functions depending on the desired effect of the covariates on the response variable. The output of an I-prior model is a posterior distribution for the regression function of interest. Some computational hurdles are also described.

## A Unifying Methodology for Regression

One of the advantages of I-prior regression is that it provides a unifying methodology for a variety of statistical models. This is done by selecting one or several appropriate RKHSs for the regression problem at hand. Here we illustrate the fitting of several types of models based on the general regression model

\begin{align} \begin{gathered} y_i = f(x_i) + \epsilon_i \\ (\epsilon_1, \dots, \epsilon_n)^\top \sim \text{N}_n(0, \Psi^{-1}) \\ \end{gathered} \end{align}

and an I-prior on (an RKHS).

### Linear Regression

For simple linear regression, take to be the canonical linear RKHS with kernel .

For multiple regression with real-valued covariates, i.e. , assume that

where each , the canonical linear RKHS with kernel . The kernel

then defines the RKHS .

Here we had assumed all covariates are real-valued. Any nominal valued covariate can be treated with the Pearson kernel instead. Also, higher order terms such as squared, cubic, etc. can be included by use of the polynomial kernel.

### Smoothing Models

Similar to the above, except that the fractional Brownian motion kernel is used instead. Other choices include the squared exponential kernel, or even the polynomial kernel with degree two or greater - though the fBm kernel is much preferred for I-prior smoothing.

### Multilevel Models

Suppose we had observations for each unit in group . We can model this using the function

Here we assume a linear effect on the covariates and a nominal effect on the groups. Thus, (linear RKHS) and (Pearson RKHS). Also assume that , the so-called tensor product . The kernel given by

defines the reproducing kernel for . This gives results similar to a varying slopes model, while the model without the “interaction” effect gives results similar to a varying intercept model.

The I-prior method estimates only three parameters - two RKHS scale parameters and the error precision. Standard multilevel models estimate at most six - two intercepts, three variances, and one covariance, all the while needing to ensure positive definiteness is adhered to.

### Longitudinal Models

Now suppose we had observations for each unit or individual measured at time . We have several choices for a model, such as modelling responses over time only:

\begin{align} f(x_{it}, t) = f(t) \nonumber \end{align}

or including as an explanatory variable:

\begin{align} f(x_{it}, t) = f_1(x_{it}) + f_2(t) + f_{12}(x_{it}, t), \nonumber \end{align}

which would then be similar to the multilevel model, except that we don’t necessarily have to assume a nominal effect of time. We can instead choose either a linear or smooth effect of time by choosing the canonical or fBm RKHS respectively for .

Note that the interaction effect between time and the explanatory variable need necessarily be present to model longitudinal effects. Otherwise, this would mean that explanatory variables has no time-varying effect.

### Models with Functional Covariates

By considering the functional covariates to be in a Sobolev-Hilbert space of continuous functions, we can apply either the linear kernel or fBm kernel on the discretised first differences, similar to a linear or smoothing model.

## Posterior Regression Function

Estimation methods described in the Introduction page yields a posterior distribution for our regression function which is normally distributed with mean

and variance

The posterior mean can then be taken as a point estimate for , the evaluation of the function at a point . As this has a distribution, we can also obtain credible intervals for the estimate. The equation defining can also be used to obtain the covariance between two points and .

The posterior mean is the same as the posterior mean for a prediction at a point , but the posterior variance is slightly different.

## Computational Hurdles

Computational complexity is dominated by an matrix inversion which is . In the case of Newton-based approaches, this needs to be evaluated at each Newton step. In the case of the EM algorithm, every update cycle involves such an inversion. For stochastic MCMC sampling methods, the inversion occurs in each sampling step.

Suppose that , with an matrix, is a valid low-rank decomposition. Then

obtained via the Woodbury matrix identity, is a much cheaper operation, especially if .

Storage requirements for I-prior models are of .

### Canonical Kernel Matrix

Under the canonical linear kernel, the kernel matrix is given by

where is the design matrix of the covariates, and is the diagonal matrix of RKHS scale parameters. The rank of is at most , and typically so any algorithm for estimating the I-prior model can be done in time instead.

### The Nyström Method

In other cases, such as under the fBm kernel, the matrix is full rank. Partition the kernel matrix as

The Nyström methods provides an approximation to by manipulating the eigenvectors and eigenvalues of together with the matrix to give

where is the diagonal matrix containing the eigenvalues of , and is the corresponding matrix of eigenvectors. An orthogonal version of this approximation is of interest, and can be achieved in . Estimation of I-prior models using the Nyström method takes time and storage, which is beneficial if .

There are many methods describing how to partition , but it can be done as simply as randomly sampling rows/columns without replacement to form the matrix . This method works well in practice.