Comparation of logistic regression methods and discrete choice model in the selection of habitats

Based on a review of most recent data analyses on resource selection by animals as well as on recent suggestions that indicate the lack of an unified statistical theory that shows how resource selection can be detected and measured, the authors suggest that the concept of resource selection function (RSF) can be the base for the development of a theory. The revision of discrete choice models (DCM) is suggested as an approximation to estimate the RSF when the choice of animal or groups of animals involves different sets of available resource units. The definition of RSF requires that the resource which is being studied consists of discrete units. The statistical method often used to estimate the RSF is the logistic regression but DCM can also be used. The theory of DCM has been well developed for the analysis of data sets involving choices of products by humans, but it can also be applicable to the choice of habitat by animals, with some modifications. The comparison of the logistic regression with the DCM for one choice is made because the coefficient estimates of the logistic regression model include an intercept, which are not presented by the DCM. The objective of this work was to compare the estimates of the RSF obtained by applying the logistic regression and the DCM to the data set on habitat selection of the spotted owl (Strix occidentalis) in the north west of the United States.


Introduction
Natural resources include materials found in nature that permit a species to survive.These resources can be renewable or non renewable.Animal populations need these resources to survive.Differential selection of available resources is one of the primary factors that allow species to co-exist, and is therefore a priority in the preservation of endangered species (Rosenzweig, 1981).Consequently, an adequate supply of natural resources is needed to sustain animal populations, and when a species better selects its resources, the better.
Under certain assumptions, the population density of an animal species depends on the availability of a resource in equilibrium (Fagen, 1988).Resource selection (RS) is used in studies to identify resources critical to an animal population and to predict the incidence of the species.Frequently, animals are monitored individually and then grouped to estimate effects at a population level.Resource selection functions (RSFs) are statistical models that require the variables under study to consist of discrete units.The theory of discrete choice models has been well developed for analyses of human choice data (Train, 2003).McDonald et al. (2006) suggest that Sci.Agric.(Piracicaba, Braz.), v.67, n.3, p.327-333, May/June 2010 these models may also be modified to be applied to animal choices.These authors were motivated by studying the comparison of the discrete choice model with the logistic regression model and in this way compare the coefficient estimates.Here the RSF is compared with the exponential resource selection function (RSF).

Material and Methods
This study utilizes data from nocturnal activities of the spotted owl (Strix occidentalis), collected in two discrete areas (Klamath and Korbel) within the property of the Green Diamond Resource Company (GDRCo) in Del Norte and Humboldt countries of northwestern California, USA.
Twenty-eight areas occupied or used by owls during the nocturnal period were identified between April 1998 and September 2000, using radio-telemetry.McDonald et al. (2006) used back-pack harness mounted radio-transmitters, and in this way it was possible to verify that five owls resided in Klamath and twenty-three in Korbel.Forty-six explanatory variables were simultaneously observed (Table 1), resulting in a total of 8,739 observations (Ryan, 2004;McDonald et al., 2006).
According to McDonald et al. (2006), applications of discrete choice models generally assume that animals make a series of choices based on a finite set of discrete habitat units, known as choice sets.Other resource selection analyses include logistic regression that is applied to a sample of used and not used resource units and assumes that choices are made from a set of available resource units.
Discrete choice models (DCM) are usually applied in situations where n sets of resource units,  67, n.3, p.327-333, May/June 2010 2,..., n), are defined as available for selection, and one unit is chosen from each of the choice sets.It is assumed that the probability of selecting the j th unit of the i th choice set is proportional to exp(β 1 x ij1 + β 2 x ij2 + … + β p x ijp ), in which β 1 , … β p are coefficients to be estimated and x ij1 , …, x ijp are values of p covariates measured in the j th unit of the i th choice set.The probability of the j th unit being selected from the i th choice set is then: Then for S independent choices, the likelihood function is equal to the product of the probabilities of the successful choices.
in which y ij = 1, if the j th resource unit is chosen for choice set i and y ij = 0 , otherwise n i is the choice unit of the i th choice set, and p ij is the value given by the expression (1).Maximum likelihood estimates of the parameters β are obtained by maximizing L with respect to these parameters.This also, provides estimates of standard errors and allows significance tests.
According to Manly et al. (2002), In this case logistic regression can be used to relate the probability of use of variables x 1 to x p that are measured on the resource units.
The logistic regression is a special case of DCM that allows for a binary choice.The RSPF resource selection probability function is simply assumed to take the form, In this case of DCM with one choice unit (available or used), the probability of using the resource is ( ) ( ) This probability can be rewritten as, by letting x i = x 1i -x 0i , i = 0,...,p, where x 10 = 1 and x 00 = 0 in which x = (x 1 , …, x p ) is the vector of values of the explanatory variables X.The logistic function has the desired property of restricting the probability values of w*(x) to between 0 and 1.When using logistic regression with census data the assumption made is that there are N available resource units and it is known which of these have been used and which have not been used after a single period of selection, Manly et al. (2002).
Another justification for using the logistic function rather than other approaches to approximate RSPF is the fact that it is widely used for other statistical analyses in biology; consequently, several computer programs are currently available to estimate these parameters.
The estimated function, is then the RSF, gives the relative probability of use of different types of resource units.Computer software packages that estimate discrete choice model parameters by maximum likelihood include SAS/Proc PHREG and S-Plus routine COXPH, (Manly et al., 2002).When a parametric model for RS probability is used, parameters are estimated by the maximum likelihood.Therefore the quantity, D = -2{log e (L p )}+2p, ( 4) is called the deviance, which can be used as a measure of the agreement of the model, p is the number of unknown parameters in the model to be estimated (Akaike, 1974).If L M is the maximum likelihood of the adjusted model, and L F (≥ L p ) is the likelihood of the model perfectly fitting the data, then L F = L p corresponds to a Null Model (N.M).
Chi-squared tests of deviance may be used to evaluate the evidence of the probability of use in the study areas.Under certain distribution conditions, deviance statistics approximately follow a Chi-squared distribution with the degrees of freedom (df) defined by the number of observations less the number of parameters estimated.Deviance is analogous to the sum of squares in regression models of analysis of variance.
Design II (Manly et al., 2002) was used on the spotted owl data, in which animals are identified individually and use of the resource units is measured for each individual, but availability of the resource is measured for the whole population.Sample protocol C was used in which the resource units used and not used are sampled independently (Manly et al., 2002).
Logistic regression can still be used in this case.However, a special justification is needed depending on the types of samples involved.In the present case, for independent samples of used and available units, a population of available units of size N is assumed, with the i th unit assuming values x i = (x i1 , …, x ip ) for variables X 1 to X p and the relative probability of use of the different resource units corresponding to: The sampling plan is such that each available unit has a probability P a of being sampled, and each used unit has a probability P u of being sampled, with a sample of available units selected first with no replacement so that the units in this sample cannot appear in the sample of not used units.In this case the probability of a unit being used and sampled is (1 -P a )w*(x i )P u and the probability of a unit being in the sample of used units or in the sample of available units is given by: Sci.Agric.(Piracicaba, Braz.), v.67, n.3, p.327-333, May/June 2010 Prob(ith unit sample)= P a + (1 -P a )w*(x i )P u (6) Consequently, the probability of the i th unit being in the sample of used units, given that it was sampled is given by: Prob(ith unit used/ sampled) = Prob ( used and sample ) / Prob (sampled) Given that the RSPF defined in equation ( 5) assumes a particular exponential form, the probability of expression (7) may also be written as: This corresponds to an expression of logistic regression in which the parameter β 0 is modified as follows, to allow for the sampling probabilities of available and used units: Assuming independent observations, x i represents the probability of observing resource unit i as being used, and the probability of observing that same unit as being available given by 1 -τ(x i ).Let y i be the indicator of use or non-use of a sample unit, so that y i = 0 for sampled unit i pertaining to the sample of available units and y i = 1 for sampled unit i pertaining to the sample of used units.
The probability of observing unit i could then be written as, and the logarithm of the likelihood of observing the complete sample is: Computer programs for logistic regression can be used to estimate coefficients β 0 , β 1 , …, β p of the linear logarithm function of the expression (5).
The fact that the logistic regression constant β' 0 assumes the expression form (9) means that if the probabilities of samples P u and P a are known, then the parameter b 0 of RSPF in the expression (5) can be estimated subtracting the quantity from the constant estimated in the logistic regression equation.If the sampling fractions are not known, then b 0 cannot be estimated; however, it is still possible to estimate RSF, w* (x) = exp(β 1 x 1 + … + β p x p ) (13) and use this function to compare resource units.Note that the correct relative probabilities of use are obtained by substituting estimates of β 0 , β 1 , …, β p in the linear logarithm function of expressions (5) or ( 13).The probabilities obtained using computer programs to adjust the logistic regression τ(x 1 ) in expression (8) are not correct estimates for selecting the probability of resource w*( x i ), or for the resource selection function w(x i ), since the total number of units used by the animals is not assumed to be known.

Results and Discussion
To compare the logistic regression and the discrete choice model, a random sample of 390 observations was selected from the spotted owl data with one choice.Variable selection followed the Akaike information criteria (AIC).Minitab (1997) and The R Development Core Team (2006) software's were used for the logistic regression estimates, and Fortran programming language (Fortran, 1977) was used to estimate the DCM parameters.
The adjustment of the binomial distribution with a logit link function for the selected variables can be seen in Figure 1.The "worm" graph in Figure 2 is a general diagnostic tool for residual analyses.The vertical axis represents the differences between theoretical and empirical distributions.The "worm" graph should be in the form of a cord, indicating a binomial data distribution in the present case in which consecutive points can be observed (Buuren and Fredricks, 2001).Figures 2 show that the worm graph of the binomial distribution with a logit link function is not adjusted very well.
The parameter estimates for the logistic regression and discrete choice model are shown in (Table 2).
The comparison of the estimates of the two methods differs with respect to the intercept.However, when analyzing data of the behavior of the animals it is a little difficult to interpret the intercept in the logistic regression model.Here the discrete choice model is proposed with one choice for the analyses of data from animals.With a discrete choice model for resource selection, the i-th choice is described by the choice set of resource units (habitat or food) that are available to be chosen; and values for variables that characterize all resource units in the choice set (e.g., vegetation type, elevation, etc.).
A comparison of models by the chi-squared test using the logistic regression and discrete choice model parameters is shown in (Table 3).This table shows the deviance of DCM is -56.76 less than that of RL, and AIC of DCM is -58.76 less than that of RL.
Note that logistic regression has one degree of freedom less than DCM since the latter does not have an intercept.
In Figure 3, we can see that the spotted owl visited many places, although it used few of them.The situation is more complex for the independent random          In estimating each of the owl choices, it can be observed that the coefficients estimated for the logistic regression differ from the coefficients of all owl choices.The same occurred with DCM in estimates of choice model parameters for all owls.In the graph of the estimates of logistic regression and DCM there is a better adjustment with respect to DCM, and Table 3 indicates the best estimate of the deviance and Akaike information criteria (AIC) of the DCM model (Figure 6).

Conclusions
• Resource selection functions estimated by logistic regression successfully identified the resources critical to an animal population and predicted the occurrence of species in different locations.
• An adjusted logistic regression and the discrete choice Estimates of Logistic Regression and Discrete Choice Model parameters (DCM) for habitat selection of the spotted owl. 1 Estimated standard errors output from the fitting process. 2 The p-values shown are obtained by calculating the ratios of estimates to their standard errors and finding.

Figure 1 -
Figure 1 -Distribution of residual frequencies and QQ plot for Binomial distribution with logit link function.

Figure 2 -
Figure 2 -Worm graph of binomial distribution with logit link function.

Figure 3 -
Figure 3 -Comparison of uses of the Spotted Owl (Strix occidentalis) with logistic regression and discrete choice model (DCM).

Figure 4
Figure 4 -a.Height of trees vs Log ( Logistic Regression), b.Height of trees vs Log (Discrete Choice Model (DCM ).

Table 3 -
Comparison of models by the Q-squared test.