Chapter 7
Conclusions and Further Work

Application of a cutting edge statistical methodology (INLA) to a large scale palaeoclimate reconstruction project has delivered two important research contributions. Firstly, the modelling associated with the palaeoclimate problem has been advanced. In particular, inference is now performed quickly, without recourse to MCMC. Secondly, the INLA methodology itself is challenged and extended. A method for fast cross-validation in the inverse sense is introduced. The richer models developed for the palaeoclimate problem are guided by the fast model validation procedure.

Both of these contributions are the subject of ongoing research. No claim is made to have developed a finished model for the palaeoclimate project; indeed the imperfection of the models contained in this thesis is demonstrated. Further extensions of the INLA method to cope with higher dimensions and modelling of data that are not conditionally independent is desirable.

7.1 Conclusions

The motivating pollen dataset is massively zero-inflated. In Haslett et al. (2006), this over-abundance of zero counts was dealt with via an overdispersion model. This method underestimates the mean of the response surfaces and overestimates the variance.

Zero-modified distributions are a flexible class of model that can account for zero-inflation of counts data. Typically, these models require an extra parameter to model the probability of potential presence. For spatial data, modeling of non-parametric response surfaces for zero-inflated counts data in this way requires doubling the number of latent parameters. MCMC inference may slow drastically due to this large increase in random variables.

A single-process zero-modified distribution is therefore developed that requires a single extra (hyper)parameter. This model is valid for data in which the probability of extra zeros and the abundance, given presence, are functions of a single underlying process. Justification of such a model is provided for the pollen dataset and a simple, yet flexible model for this data is constructed. Synergy of this model with the emerging INLA inference procedure is demonstrated, as is incompatibility of INLA with traditional zero-inflation models.

Multivariate counts data, constrained to sum to a total, may exhibit high correlation, even after taking the constraint into account. Modelling such dependencies is inherently difficult. Nested models may provide a solution for breaking down such dependencies in compositional counts data. The main advantage of such models is that, given the nesting structure, the joint model will disjoint-decompose exactly. This means that inference may be performed on many separate, smaller problems in parallel. More importantly, decomposition of the joint model is required in order to apply the fast INLA inference methods.

Gaussian and Laplace approximations are fitted to the posteriors for the parameters and hyperparameters of the pollen dataset. This results in a dramatic reduction in both the forward and inverse stages of the non-parametric inference.

The forward fitting stage using approximations takes approximately 40 minutes for all 32 pollen types in the nested model. For a similar, yet cruder (non-zero inflated), model MCMC runs were previously ran for up to several weeks. Cross-validation was impracticable and hyperparameters had to be fixed a-priori.

The marginals model required for use of the INLA methods may break down with large numbers of correlated taxa. This can be tested for using inverse cross-validation statistics such as the percentage of training data points that lie outside their corresponding 95% highest posterior predictive density region.

If the data are in fact nested, then this enables decomposition of the problem. This cannot be achieved without knowledge of the structure of the nesting. These nested models are a novel aspect of compositional data analysis. One such nesting structure, given by expert opinion, is applied to inference on the pollen dataset, with promising results.

Cross-validation is an important tool in model validation. A contribution in this thesis is a method for performing fast cross-validation in the inverse sense, using the INLA inference procedure. Approximating the saturated posterior for the latent variables with a multivariate normal permits fast updates to be made to correct for leaving out a single datum. This procedure has application in any spatial context where the interest is in the inverse problem and the forward model posterior is approximately Gaussian.

7.2 Further Work

This is perhaps the most important section in the thesis as there are several outstanding challenges in this project. Some of these are already the subject of preliminary investigation.

7.2.1 3 Dimensional Climate Space

The RS10 pollen dataset includes more than just two climate variables. Expert opinion in the botany community advocates using at least three in building the response surfaces.

In theory, the Gaussian approximation works in the same way in any number of dimensions; however, a second order intrinsic GMRF prior (such as used in Chapter 6) is far less sparse in 3D than in 2D. Thus, the fast sparse-matrix algorithms employed in fitting a GMRF approximation to the posterior will be disproportionally more labourious. In addition, the grid size G scales as GD where D is the dimension. Even moving up a single climate dimension from 2D to 3D can cause memory to become an issue as a single realisation of a response surface goes from taking around 50 kilobytes to around 2 megabytes for a grid with sides of length G 50. Cross-validation, involving a unique inverse predictive distribution for each datum, thus takes up 7742 times more for the RS10 pollen dataset.

Preliminary results in 3D climate space are encouraging, but inference is slow. There are further coding issues, such as how to create a buffer zone in 3 dimensions.

7.2.2 Covariates

The outlier of highest D value was found to be from the sampling location of highest altitude. Altitude measurements are in fact available for all samples. Careful modelling of the altitude as a covariate to account for its effect should eliminate this problem. There are several options for dealing with such nuisance covariates; they may be fully modelled as the climate variables are, leading to an increase in the dimensionality of the problem. A more crude treatment, such as fitting a linear smooth of the counts data to the altitudes might suffice.

7.2.3 Inference Procedures

For compatibility with fast approximate inference procedures, the inference-via-marginals model is required. This can lead to errors in the inverse problem if the model does not decompose exactly. Although the imposition of the nested structure greatly reduced these errors for the pollen dataset, the problem did not disappear altogether.

It should be noted that the nesting structure used here is based on the opinion of a single expert. Other nesting structures may well lead to a further reduction or even elimination of the dependence induced error. These structures can either be selected a-priori based on expert opinion, as here, or perhaps inferred from the data. The latter may prove to be a very interesting problem in itself.

Weighting the marginal predictive distributions with the inverse of the correlation between counts is one ad-hoc method to reduce the dependency / correlation of the response surfaces given climate. Another option might be to use the nest as specified, but only down to the lowest binary splittings.

The hyperparameters must be pre-specified at sensible values for the iterative search algorithm in the INLA method to converge to the mode. Trial and error is the current practise; this could be replaced by a crude but fast method based on the model and the data, which would further automate the inference procedure.

7.2.4 Model Validation

Two model validation summary statistics are used in the inverse sense; the number of training data that fall outside the 95% highest density region of the leave-one-out cross-validation posterior predictive distributions Δ and the mean expected squared distance to the observed climate D.

These statistics still require a scale to determine what values are significant. Monte-Carlo simulation could be used to simulate data from the fitted forward model; the test statistics Δ and D would then be calculated on this toy data. These inform what values these statistics take when the model fits the data. Repetition of this exercise should be used to build upper and lower bounds and confidence intervals for the statistics. The values pertaining to the real data would then be compared with these theoretical ranges.

Other cross-validation summary statistics should be developed for the inverse problem. Distance metrics that are commonly applied in forward cross-validation methods may be unsuitable as they frequently assume a unimodal predictive distribution. Inverse predictive distributions are commonly multimodal.

Preliminary work has begun on information criteria such as the Deviance Information Criterion for model comparison. The difficulty in the inverse setting is that the normalising constant for the inverse model is not known. In fact, unlike the forward model (likelihood), it is a function of the model parameters and thus the mean of the deviance is difficult to compute.