Application of a cutting edge statistical methodology (INLA) to a large scale palaeoclimate reconstruction project has delivered two important research contributions. Firstly, the modelling associated with the palaeoclimate problem has been advanced. In particular, inference is now performed quickly, without recourse to MCMC. Secondly, the INLA methodology itself is challenged and extended. A method for fast cross-validation in the inverse sense is introduced. The richer models developed for the palaeoclimate problem are guided by the fast model validation procedure.
Both of these contributions are the subject of ongoing research. No claim is made to have developed a finished model for the palaeoclimate project; indeed the imperfection of the models contained in this thesis is demonstrated. Further extensions of the INLA method to cope with higher dimensions and modelling of data that are not conditionally independent is desirable.
The motivating pollen dataset is massively zero-inflated. In Haslett et al. (2006), this over-abundance of zero counts was dealt with via an overdispersion model. This method underestimates the mean of the response surfaces and overestimates the variance.
Zero-modified distributions are a flexible class of model that can account for zero-inflation of counts data. Typically, these models require an extra parameter to model the probability of potential presence. For spatial data, modeling of non-parametric response surfaces for zero-inflated counts data in this way requires doubling the number of latent parameters. MCMC inference may slow drastically due to this large increase in random variables.
A single-process zero-modified distribution is therefore developed that requires a single extra (hyper)parameter. This model is valid for data in which the probability of extra zeros and the abundance, given presence, are functions of a single underlying process. Justification of such a model is provided for the pollen dataset and a simple, yet flexible model for this data is constructed. Synergy of this model with the emerging INLA inference procedure is demonstrated, as is incompatibility of INLA with traditional zero-inflation models.
Multivariate counts data, constrained to sum to a total, may exhibit high correlation, even after taking the constraint into account. Modelling such dependencies is inherently difficult. Nested models may provide a solution for breaking down such dependencies in compositional counts data. The main advantage of such models is that, given the nesting structure, the joint model will disjoint-decompose exactly. This means that inference may be performed on many separate, smaller problems in parallel. More importantly, decomposition of the joint model is required in order to apply the fast INLA inference methods.
Gaussian and Laplace approximations are fitted to the posteriors for the parameters and hyperparameters of the pollen dataset. This results in a dramatic reduction in both the forward and inverse stages of the non-parametric inference.
The forward fitting stage using approximations takes approximately 40 minutes for all 32 pollen types in the nested model. For a similar, yet cruder (non-zero inflated), model MCMC runs were previously ran for up to several weeks. Cross-validation was impracticable and hyperparameters had to be fixed a-priori.
The marginals model required for use of the INLA methods may break down with large numbers of correlated taxa. This can be tested for using inverse cross-validation statistics such as the percentage of training data points that lie outside their corresponding 95% highest posterior predictive density region.
If the data are in fact nested, then this enables decomposition of the problem. This cannot be achieved without knowledge of the structure of the nesting. These nested models are a novel aspect of compositional data analysis. One such nesting structure, given by expert opinion, is applied to inference on the pollen dataset, with promising results.
Cross-validation is an important tool in model validation. A contribution in this thesis is a method for performing fast cross-validation in the inverse sense, using the INLA inference procedure. Approximating the saturated posterior for the latent variables with a multivariate normal permits fast updates to be made to correct for leaving out a single datum. This procedure has application in any spatial context where the interest is in the inverse problem and the forward model posterior is approximately Gaussian.
This is perhaps the most important section in the thesis as there are several outstanding challenges in this project. Some of these are already the subject of preliminary investigation.
The RS10 pollen dataset includes more than just two climate variables. Expert opinion in the botany community advocates using at least three in building the response surfaces.
In theory, the Gaussian approximation works in the same way in any number of dimensions; however, a second order intrinsic GMRF prior (such as used in Chapter 6) is far less sparse in 3D than in 2D. Thus, the fast sparse-matrix algorithms employed in fitting a GMRF approximation to the posterior will be disproportionally more labourious. In addition, the grid size G scales as GD where D is the dimension. Even moving up a single climate dimension from 2D to 3D can cause memory to become an issue as a single realisation of a response surface goes from taking around 50 kilobytes to around 2 megabytes for a grid with sides of length G