cqo {VGAM}R Documentation

Fitting Constrained Quadratic Ordination (CQO)

Description

A constrained quadratic ordination (CQO; formerly called canonical Gaussian ordination or CGO) model is fitted using the quadratic reduced-rank vector generalized linear model (QRR-VGLM) framework.

Usage

cqo(formula, family, data = list(), weights = NULL, subset = NULL,
    na.action = na.fail, etastart = NULL, mustart = NULL,
    coefstart = NULL, control = qrrvglm.control(...), offset = NULL,
    method = "cqo.fit", model = FALSE, x.arg = TRUE, y.arg = TRUE,
    contrasts = NULL, constraints = NULL, extra = NULL,
    smart = TRUE, ...)

Arguments

In this documentation, M is the number of linear predictors, S is the number of responses (species). Then M=S for Poisson and binomial species data, and M=2S for negative binomial and gamma distributed species data.

formula a symbolic description of the model to be fit. The RHS of the formula is applied to each linear predictor. Different variables in each linear predictor can be chosen by specifying constraint matrices.
family a function of class "vglmff" (see vglmff-class) describing what statistical model is to be fitted. This is called a ``VGAM family function''. See CommonVGAMffArguments for general information about many types of arguments found in this type of function. Currently the following families are supported: poissonff, binomialff (logit and cloglog links available), negbinomial, gamma2, gaussianff. Sometimes special arguments are required for cqo(), e.g., binomialff(mv=TRUE). Also, quasipoissonff and quasibinomialff may or may not work.
data an optional data frame containing the variables in the model. By default the variables are taken from environment(formula), typically the environment from which cqo is called.
weights an optional vector or matrix of (prior) weights to be used in the fitting process. Currently, this argument should not be used.
subset an optional logical vector specifying a subset of observations to be used in the fitting process.
na.action a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ``factory-fresh'' default is na.omit.
etastart starting values for the linear predictors. It is a M-column matrix. If M=1 then it may be a vector. Currently, this argument probably should not be used.
mustart starting values for the fitted values. It can be a vector or a matrix. Some family functions do not make use of this argument. Currently, this argument probably should not be used.
coefstart starting values for the coefficient vector. Currently, this argument probably should not be used.
control a list of parameters for controlling the fitting process. See qrrvglm.control for details.
offset This argument must not be used.
method the method to be used in fitting the model. The default (and presently only) method cqo.fit uses iteratively reweighted least squares (IRLS).
model a logical value indicating whether the model frame should be assigned in the model slot.
x.arg, y.arg logical values indicating whether the model matrix and response matrix used in the fitting process should be assigned in the x and y slots. Note the model matrix is the LM model matrix.
contrasts an optional list. See the contrasts.arg of model.matrix.default.
constraints an optional list of constraint matrices. The components of the list must be named with the term it corresponds to (and it must match in character format). Each constraint matrix must have M rows, and be of full-column rank. By default, constraint matrices are the M by M identity matrix unless arguments in the family function itself override these values. If constraints is used it must contain all the terms; an incomplete list is not accepted. Constraint matrices for x_2 variables are taken as the identity matrix.
extra an optional list with any extra information that might be needed by the family function.
smart logical value indicating whether smart prediction (smartpred) will be used.
... further arguments passed into qrrvglm.control.

Details

QRR-VGLMs or constrained quadratic ordination (CQO) models are estimated here by maximum likelihood estimation. Optimal linear combinations of the environmental variables are computed, called latent variables (these appear as lv for R=1 else lv1, lv2, etc. in the output). Here, R is the rank or the number of ordination axes. Each species' response is then a regression of these latent variables using quadratic polynomials on a transformed scale (e.g., log for Poisson counts, logit for presence/absence responses). The solution is obtained iteratively in order to maximize the log-likelihood function, or equivalently, minimize the deviance.

The central formula (for Poisson and binomial species data) is given by

eta = B_1^T x_1 + A nu + sum_{m=1}^M (nu^T D_m nu) e_m

where x_1 is a vector (usually just a 1 for an intercept), x_2 is a vector of environmental variables, nu=C^T x_2 is a R-vector of latent variables, e_m is a vector of 0s but with a 1 in the mth position. The eta are a vector of linear/additive predictors, e.g., the mth element is eta_m = log(E[Y_m]) for the mth species. The matrices B_1, A, C and D_m are estimated from the data, i.e., contain the regression coefficients. The tolerance matrices satisfy T_s = -(0.5 D_s^(-1). Many important CQO details are directly related to arguments in qrrvglm.control, e.g., the argument Norrr specifies which variables comprise x_1.

Theoretically, the four most popular VGAM family functions to be used with cqo correspond to the Poisson, binomial, normal, and negative binomial distributions. The latter is a 2-parameter model. All of these are implemented, as well as the 2-parameter gamma. The Poisson is or should be catered for by quasipoissonff and poissonff, and the binomial by quasibinomialff and binomialff. Those beginning with "quasi" have dispersion parameters that are estimated for each species.

For initial values, the function .Init.Poisson.QO should work reasonably well if the data is Poisson with species having equal tolerances. It can be quite good on binary data too. Otherwise the Cinit argument in qrrvglm.control can be used.

It is possible to relax the quadratic form to an additive model. The result is a data-driven approach rather than a model-driven approach, so that CQO is extended to constrained additive ordination (CAO) when R=1. See cao for more details.

Value

An object of class "qrrvglm". Note that the slot misc has a list component called deviance.Bestof which gives the history of deviances over all the iterations.

Warning

Local solutions are not uncommon when fitting CQO models. To increase the chances of obtaining the global solution, increase the value of the argument Bestof in qrrvglm.control. For reproducibility of the results, it pays to set a different random number seed before calling cqo (the function set.seed does this). The function cqo chooses initial values for C using .Init.Poisson.QO() if Use.Init.Poisson.QO=TRUE, else random numbers.

Unless ITolerances=TRUE or EqualTolerances=FALSE, CQO is computationally expensive. It pays to keep the rank down to 1 or 2. If EqualTolerances=TRUE and ITolerances=FALSE then the cost grows quickly with the number of species and sites (in terms of memory requirements and time). The data needs to conform quite closely to the statistical model, and the environmental range of the data should be wide in order for the quadratics to fit the data well (bell-shaped response surfaces). If not, RR-VGLMs will be more appropriate because the response is linear on the transformed scale (e.g., log or logit) and the ordination is called constrained linear ordination or CLO.

Like many regression models, CQO is sensitive to outliers (in the environmental and species data), sparse data, high leverage points, multicollinearity etc. For these reasons, it is necessary to examine the data carefully for these features and take corrective action (e.g., omitting certain species, sites, environmental variables from the analysis, transforming certain environmental variables, etc.). Any optimum lying outside the convex hull of the site scores should not be trusted. Fitting a CAO is recommended first, then upon transformations etc., possibly a CQO can be fitted.

For binary data, it is necessary to have `enough' data. In general, the number of sites n ought to be much larger than the number of species S, e.g., at least 100 sites for two species. Compared to count (Poisson) data, numerical problems occur more frequently with presence/absence (binary) data. For example, if Rank=1 and if the response data for each species is a string of all absences, then all presences, then all absences (when enumerated along the latent variable) then infinite parameter estimates will occur. In general, setting ITolerances=TRUE may help.

This function was formerly called cgo. It has been renamed to reinforce a new nomenclature described in Yee (2006).

Note

By default, a rank-1 equal-tolerances QRR-VGLM model is fitted (see qrrvglm.control for the default control parameters). The latent variables are always transformed so that they are uncorrelated. By default, the argument trace is TRUE meaning a running log is printed out while the computations are taking place. This is because the algorithm is computationally expensive, therefore users might think that their computers have frozen if trace=FALSE!

The argument Bestof in qrrvglm.control controls the number of models fitted (each uses different starting values) to the data. This argument is important because convergence may be to a local solution rather than the global solution. Using more starting values increases the chances of finding the global solution. Always plot an ordination diagram (use the generic function lvplot) and see if it looks sensible. Local solutions arise because the optimization problem is highly nonlinear, and this is particularly true for CAO.

Many of the arguments applicable to cqo are common to vglm and rrvglm.control. The most important arguments are Rank, Norrr, Bestof, ITolerances, EqualTolerances, isdlv, and MUXfactor.

When fitting a 2-parameter model such as the negative binomial or gamma, it pays to set EqualTolerances=TRUE and ITolerances=FALSE. This is because numerical problems can occur when fitting the model far away from the global solution when ITolerances=TRUE. Setting the two arguments as described will slow down the computation considerably, however it is numerically more stable.

In Example 1 below, an unequal-tolerances rank-1 QRR-VGLM is fitted to the hunting spiders dataset. In Example 2 below, an equal-tolerances rank-2 QRR-VGLM is fitted to the hunting spiders dataset. The numerical difficulties encountered in fitting the rank-2 model suggests a rank-1 model is probably preferable. In Example 3 below, constrained binary quadratic ordination (in old nomenclature, constrained Gaussian logit ordination) is fitted to some simulated data coming from a species packing model. With multivariate binary responses, one must use mv=TRUE to indicate that the response (matrix) is multivariate. Otherwise, it is interpreted as a single binary response variable.

Sometime in the future, this function might handle input of the form cqo(x, y), where x and y are matrices containing the environmental and species data respectively.

Author(s)

Thomas W. Yee

References

Yee, T. W. (2004) A new technique for maximum-likelihood canonical Gaussian ordination. Ecological Monographs, 74, 685–701.

ter Braak, C. J. F. and Prentice, I. C. (1988) A theory of gradient analysis. Advances in Ecological Research, 18, 271–317.

Yee, T. W. (2006) Constrained additive ordination. Ecology, 87, 203–213.

See Also

qrrvglm.control, Coef.qrrvglm, predict.qrrvglm, rcqo, cao, uqo, rrvglm, poissonff, binomialff, negbinomial, gamma2, lvplot.qrrvglm, persp.qrrvglm, trplot.qrrvglm, vglm, set.seed, hspider.

Documentation accompanying the VGAM package at http://www.stat.auckland.ac.nz/~yee contains further information and examples.

Examples

# Example 1; Fit an unequal tolerances model to the hunting spiders data
data(hspider)
hspider[,1:6]=scale(hspider[,1:6]) # Standardize the environmental variables
set.seed(1234)
p1 = cqo(cbind(Alopacce, Alopcune, Alopfabr, Arctlute, Arctperi, Auloalbi, 
               Pardlugu, Pardmont, Pardnigr, Pardpull, Trocterr, Zoraspin) ~
         WaterCon + BareSand + FallTwig + CoveMoss + CoveHerb + ReflLux,
         fam=poissonff, data=hspider, Crow1positive=FALSE, ITol=FALSE)
sort(p1@misc$deviance.Bestof) # A history of all the iterations
if(deviance(p1) > 1177) stop("suboptimal fit obtained")

## Not run: 
S = ncol(p1@y) # Number of species
clr = (1:(S+1))[-7] # omits yellow
lvplot(p1, y=TRUE, lcol=clr, pch=1:S, pcol=clr, las=1) # ordination diagram
legend("topright", leg=dimnames(p1@y)[[2]], col=clr,
       pch=1:S, merge=TRUE, bty="n", lty=1:S, lwd=2)
## End(Not run)
(cp = Coef(p1))

(a = cp@lv[cp@lvOrder])  # The ordered site scores along the gradient
# Names of the ordered sites along the gradient:
rownames(cp@lv)[cp@lvOrder]
(a = (cp@Optimum)[,cp@OptimumOrder]) # The ordered optima along the gradient
a = a[!is.na(a)] # Delete the species that is not unimodal
names(a)         # Names of the ordered optima along the gradient

## Not run: 
trplot(p1, whichSpecies=1:3, log="xy", type="b", lty=1, lwd=2,
       col=c("blue","red","green"), label=TRUE) -> ii # trajectory plot
legend(0.00005, 0.3, paste(ii$species[,1], ii$species[,2], sep=" and "),
       lwd=2, lty=1, col=c("blue","red","green"))
abline(a=0, b=1, lty="dashed")

S = ncol(p1@y) # Number of species
clr = (1:(S+1))[-7] # omits yellow
persp(p1, col=clr, label=TRUE, las=1) # perspective plot
## End(Not run)

# Example 2: A rank-2 equal tolerances CQO model with Poisson data
# This example is numerically fraught.
set.seed(555)
p2 = cqo(cbind(Alopacce, Alopcune, Alopfabr, Arctlute, Arctperi, Auloalbi,
               Pardlugu, Pardmont, Pardnigr, Pardpull, Trocterr, Zoraspin) ~
         WaterCon + BareSand + FallTwig + CoveMoss + CoveHerb + ReflLux,
         fam=poissonff, data=hspider, Crow1positive=FALSE,
#        ITol=FALSE, EqualTol=TRUE,
         Rank=2, Bestof=1, isdlv=c(2.1,0.9))
sort(p2@misc$deviance.Bestof) # A history of all the iterations
if(deviance(p2) > 1127) stop("suboptimal fit obtained")
## Not run: 
lvplot(p2, ellips=FALSE, label=TRUE, xlim=c(-3,4),
       C=TRUE, Ccol="brown", sites=TRUE, scol="grey", 
       pcol="blue", pch="+", chull=TRUE, ccol="grey")
## End(Not run)

# Example 3: species packing model with presence/absence data
set.seed(2345)
n = 200; p = 5; S = 5
mydata = rcqo(n, p, S, fam="binomial", hiabundance=4,
              EqualTol=TRUE, ESOpt=TRUE, EqualMax=TRUE)
myform = attr(mydata, "formula")
set.seed(1234)
b1 = cqo(myform, fam=binomialff(mv=TRUE, link="cloglog"), data=mydata)
sort(b1@misc$deviance.Bestof) # A history of all the iterations
## Not run: 
lvplot(b1, y=TRUE, lcol=1:S, pch=1:S, pcol=1:S, las=1)
## End(Not run)
Coef(b1)

# Compare the fitted model with the 'truth'
cbind(truth=attr(mydata, "ccoefficients"), fitted=ccoef(b1))

[Package VGAM version 0.7-7 Index]