Zero-inflated Poisson regression models for QTL mapping applied to tick-resistance in a Gyr × Holstein F2 population

Now a days, an important and interesting alternative in the control of tick-infestation in cattle is to select resistant animals, and identify the respective quantitative trait loci (QTLs) and DNA markers, for posterior use in breeding programs. The number of ticks/animal is characterized as a discrete-counting trait, which could potentially follow Poisson distribution. However, in the case of an excess of zeros, due to the occurrence of several noninfected animals, zero-inflated Poisson and generalized zero-inflated distribution (GZIP) may provide a better description of the data. Thus, the objective here was to compare through simulation, Poisson and ZIP models (simple and generalized) with classical approaches, for QTL mapping with counting phenotypes under different scenarios, and to apply these approaches to a QTL study of tick resistance in an F2 cattle (Gyr × Holstein) population. It was concluded that, when working with zero-inflated data, it is recommendable to use the generalized and simple ZIP model for analysis. On the other hand, when working with data with zeros, but not zero-inflated, the Poisson model or a data-transformation-approach, such as square-root or Box-Cox transformation, are applicable.


Introduction
In the tropics, cattle can be afflicted by various tick species and the diseases they transmit, possibly leading to significant loss in production systems. Among these, Rhipicephalus (Boophilus) microplus is outstanding. Apart from the reduction in production, infection can even be mortal to more susceptible animals.
Regarding losses caused by Boophilus microplus, Frisch et al. (2000) estimated that an animal with an average of 40 ticks per day could lose weight equivalent to 20 kg/year, whereas Furlong et al. (1996) calculated a reduction of 23% in the daily milk yield, when crossbred Holstein-Zebu cows were infested by 105 ticks, on an average. Thus, the control of infestation is of extreme importance for the dairy and beef industries.
An important and interesting alternative in the control of tick infestation in cattle is to select resistant animals, and identify those specific quantitative trait loci (QTLs) and DNA markers, which could be useful in marker-assisted selection (MAS) or introgressive strategies, as a part of inbreeding programs. Although costly and time consuming, this could be better than control by acaricides, in which misuse has given rise to tick-resistance against pesticides, as well as increased environmental contamination.
In the search for markers of tick-resistance in bovines, Martinez et al. (2006) found associations between certain BoLA class II microsatellite alleles and susceptibility to R. microplus. Furthermore, it was shown that B. indicus cattle are generally more resistant to parasite challenges than B. Taurus. Wambura et al. (1998) found that genetic variability among subspecies can be exploited when mapping populations for QTL analysis, e.g. Taurus x Indicus F2 animals.
The major concern regarding QTL mapping for tickresistance refers to its statistical analysis. As opposed to most QTL studies, which consider continuous phenotypes and normal assumptions, the number of ticks/animal is characterized as a discrete trait, more specifically as a counting variable, which could potentially follow Poisson distribution. However, if there is an excess of zeros in em-pirical phenotype distribution, for example due to several noninfected animals, zero-inflated Poisson distribution (ZIP) (Lambert, 1992) may provide a better description of the data.
Zero inflation of ticks on any specific animal may be due to either animal resistance, or simply their absence. Consequently, there are two different kinds of zeros, whose probability of occurrence must be modeled separately in QTL analysis. Although Poisson and ZIP models are potentially more appropriate statistical alternatives for QTL detection of the number of ticks per animal, it is, as yet, not clear whether these strategies are better than certain classical alternatives in data transformation.
In agreement with Cui et al. (2006), Yang (2009) andErhardt et al. (2010), although the problem of zero excess can be solved by ZIP modeling, when the assumption that trait variance is equal to its mean is not proved, i.e. in the presence of over/under dispersion, the ZIP model can be further improved by the addition of a new parameter, thereby characterizing the Generalized ZIP (GZIP) model.
In view of the above mentioned issues, the aim was to compare, by way of simulation studies, Poisson and ZIP models (simple and generalized) with classical approaches, as applied to QTL mapping, with counting phenotypes under different scenarios, and to apply these approaches to a QTL study of tick-resistance in F2 cattle (Gyr x Holstein) population.

Materials and Methods
The QTL mapping methodologies discussed here, besides being based on the regression approach described by Haley et al. (1994), were further extended to the context of generalized linear models (GLM) for count data (McCullagh and Nelder, 1989), more specifically to Poisson, Zero-Inflated Poisson (ZIP) and Generalized ZIP (GZIP) regression models. In the next few sections, the implementation of QTL regression interval mapping for approaches presenting normally distributed traits is presented, together with data transformation strategies that will be compared to the GLM methods. In addition, the simulation study and motivating data-set of tick counting in Gyr x Holstein cows are described.

QTL mapping approaches
The classical regression approach for interval mapping in outbred F 2 populations (Haley et al., 1994) can be described as: where y i is the observed trait (phenotype) relative to the ith F 2 individual, m a general mean, q i the additive QTL coefficient for the ith individual, a the additive effect of putative QTL, and e i a residual term, assumed as normally distrib-uted with mean 0 and variance s 2 , i.e., e i~N (0, s 2 ). The coefficient q i is given in terms of the probability of line-origin combination, conditional on the marker's genotypes (Haley et al., 1994), whose values were obtained by QXPAK software (Pérez-Enciso and Misztal, 2004). The likelihood ratio test (LRT) is usually employed (Baret et al., 1998) for evaluating the significance of the additive QTL effect. Here, for every 1 cM on the chromosome, maximum likelihood (ML) estimates were obtained using the simplex method proposed by Nelder and Mead (1965), whereby LRT was applied to compare the full model (1) against a reduced model with no QTL.
Given the normality assumption in model (1), it is evidently not optimal for analyzing counting data, such as the number of ticks per cow. Tilquin et al. (2001) proposed that appropriate data transformation should be used. Their results showed that, on analyzing nontransformed data, Gaussian-based methods can lose around 50% of QTL detection potential.
A flexible methodology for data transformation, when the phenotypic data is not normally distributed, as proposed by Box and Cox (1964), is given by: where z i is the transformed variable, y i the original variable, and l the transformation factor, which can be inferred using, for example, maximum likelihood techniques. Box-Cox transformation has already been successful applied to QTL mapping with continuous data traits and skewed distribution (Yang et al., 2006). In addition to the Box-Cox transformation, which includes searching for the value of l which best approximates the data to a normal distribution, here we used also the classical square root transformation, this being traditionally considered as a suitable transformation when the data follow a Poisson distribution. The square root is equivalent to a specific type of the Box-Cox transformation, where l = 1/2, given by: However, according to Yamamura (1999), the Box-Cox transformation has serious limitations, when the response-variable contains zeros. In such cases, the addition of a small constant 'c' to each observation is generally recommended, to so assure the most nearly constant variance possible. Specifically for Poisson data, the value c = 3/8 is optimal, in the sense that variance converges more rapidly as the mean increases (Ascombe, 1948). Thus, this constant was adopted for both transformation procedures presented by Eqs. (2) and (3).
Here, the analysis using data transformations were performed using TRANSREG procedure of SAS ® 9.1.3. This procedure was chosen especially for the Box-Cox transformation, as it allows simultaneous estimation of l and model fitting.
An alternative to data transformation is direct modeling of the data in its original scale using a more general model. Under this approach, we considered a Poisson regression as an extension of model (1). In this case, it is assumed that each observation y i is drawn from a Poisson distribution, whose probability function is given by: where m i is the mean of this distribution, i.e., E(Y) = m i , which, according to GLM theory (McCullagh and Nelder, 1989), can be related to model (1) by a linear predictor given, in this case, by h i = m + aq i . The relationship between the linear predictor (h i ) and the mean (m i ) is provided by a link function. In the case of a Poisson distribution, the link function is logarithmic, i.e., h i = log(m i ). Thus, to enable use of the Poisson means with the linear predict of model (1) it is necessary to obtain the inverse of the link function; i.e., m i = exp(h i ) = exp(m + aq i ).
As mentioned earlier, tick-counting data can present an excess of zeros, since a cow may have no ticks, either through being resistant (so called "true zero"), or, although susceptible, for any other reason, e.g. by chance or because no tick has touched her). To accommodate such a scenario, a Zero-Inflated Poisson (ZIP) model was also considered and compared to the previously desired modeling approach. The ZIP model can be formulated as a mixture of probability p of 'true zero' (tick-resistance) and a probability distribution of the number of ticks (which would also include zero), according to a Poisson process (Lambert, 1992). In order to define the ZIP probability density function, w i = 1 if the cow is resistant and w i = 0 if the cow is susceptible. Thus, considering w i as a latent variable, P(w i = 1) = p i and P(w i = 0) = 1 -p i , whereby: where 0 = p i = 1 and m i is the expected number of ticks on cow i, given this cow is susceptible.
In agreement with Cui et al. (2006), Cui and Yang (2009) and Erhardt et al. (2010), the ZIP model can be improved by the addition of a new parameter (f), to so describe the occurrence of under (if 0 < f < 1) or over (if f > 1) dispersion. This model is denominated Generalized ZIP (GZIP), which is given by: Poisson regression was implemented using the GENMOD procedure of SAS, using the log link-function. The ZIP model was fitted using the NLMIXED procedure, according to syntax described by Liu and Cela (2008), and the GZIP using countreg SAS Macro (Chow and Steenhard, 2009) with the statement method = 2 (Generalized Poisson) and zindep = (Name of the independent variables used in the zero-inflated model). All the SAS codes used are available, direct from the authors.

Monte Carlo study
Monte Carlo simulations were applied to investigating statistical behavior of the proposed methods. Simulations consisted of a single 45 cM long chromosome based on real data, with 5 non-evenly spaced markers (0, 14.8, 29.9, 37.4, 43.3 cM), for an F2 population with a sample-size equal to 263, mean (m) equal to 2, and QTL additive effect given by a = 0.05, 0.1, 0.2. Although the putative QTL that affects the phenotype of interest is located at 20 cM from the first marker on the linkage group, a data set with no QTL was also simulated to account for false positive and negative rates.
Simulation settings were chosen to mimic real data sets, while real pedigree structure was used to calculate the probability of line-origin combination, conditional on marker genotypes. Phenotypic data from Poisson and Zero-Inflated Poisson were generated, with two different "true-zeros" probability (0.2, 0.5), using SAS ® IML.
In each simulation scenario, 1000 Monte Carlo repetitions are performed, and five different models of analysis were tested for each Monte Carlo sample. The original count data and two different transformation approaches (Box-Cox and Square-root) were analyzed with the Gaussian model (1), whereas for the GLM approach, the Poisson (4), Zero-Inflated Poisson (5) and Generalized Zero-Inflated Poisson (6) were considered.
The following measures were used for evaluating the performance of each model: variance (VAR) of the additive effect (a) calculated from 1000 repetitions; bias of a estimates, calculated as the sum of the difference between estimated and true values divided by 1000; the mean of absolute distances (MAD), defined as the absolute values of the difference between true and estimated QTL position (the position with the highest log-likelihood); false positive rates (FPR), given by the percentage that a QTL was considered significant (p < 0.05) in any position, when the simulation did not include the additive QTL effect; and false negative rate (FNR), defined as the percentage ofa QTL not presenting a significant (p < 0.05) effect, when considered in the simulation.

Motivating data set
A population was developed by EMBRAPA -Dairy Cattle Research Center, from four Holstein sires x 28 Gyr dams to obtain F1 animals, among which five sires and 68 dams were intercrossed to form an F2 population. This population was challenged for tick infestation with no tick control until 10-14 months of age.
These infestations were carried out by placing tick larvae on the dorsal-lumbar region of the animals. Counting took place on the morning of the 21 st day after infestation, before detachment of mature female ticks. Counting was restricted to three different regions on one side of the animal. Details regarding population and experiment design can be found in Gasparin et al. (2007).

Monte Carlo study
FNRs and FPRs obtained from the simulation study are respectively presented in Tables 1 and 2.
In the case of the probability of zeros being equal to zero (Poisson distribution), the percentage of false negative rate (FNR) would be around 74% and 25%, if the QTL effect were 0.05 and 0.1, respectively, and almost none when 0.2 (Table 1). The Box-Cox transformation performed better with a slightly smaller FNR than the other approaches, although interestingly, the ZIP and GZIP models presented the best results. Thus, it seems that inclusion of the extra parameters p and f, respectively relative to perfect zero occurrence probability and under/over dispersion, bears a small penalty, if the data has Poisson distribution with no inflation, and in the absence of under/over dispersion, especially in situations with reduced QTL effect.
With moderate zero inflation, and only slight QTL effect, the ZIP and GZIP models presented slightly higher FNRs than the Poisson, and almost twice as high as the Gaussian approach, with or without data transformation. However, with greater QTL effects (0.1 and 0.2), the ZIP and GZIP models performed much better than all the other approaches.
Finally, with extreme zero inflation (P(0) = 0.5), the Poisson model performed the best in situations with small to moderate QTL effects, and the ZIP second best. Nonetheless, with greater QTL effects, the ZIP and GZIP models surpassed all the other methods.
In terms of FPR (Table 2), since defined by any significant statistical test for QTL effects, without any addi-tion, all the models presented roughly the same rates in the absence of zero inflation. However, even with moderate zero inflation, the Poisson model presented almost three times FPR than the remainder, thus indicating it to be the less conservative. Interestingly, either with or without data transformation, the Gaussian model presented very similar results to the ZIP and GZIP.
In general, the Gaussian models had higher FNRs and were even in the FPRs. The Poisson model showed the same FPRs when compared to the ZIP and GZIP models and performed a little better than this one in terms of FNRs, when the QTL effects were small. However, when the QTL effect was moderate to high, the ZIP and GZIP models overall FNRs and FPRs were smaller than with the other approaches.
Finally, when working with count-data, a previous descriptive-data analysis is always advisable. In the case of zero inflation, the ZIP and GZIP models are indicated, the latter being an alternative, when solving the problem of under/over dispersion as called for. With zero inflation, either the Poisson model or a transformation approach is the most indicated. The advantage of the latter is that one can use any available software. The disadvantage is that when using data transformation, interpretation of the results is not straightforward. 578 Silva et al. The well-known results of SQRT transformation working fine with Poisson distribution data, are here confirmed. However, in the event of not wishing to rely on a Poisson assumption, a more general data transformation (Box-Cox, for example) is advised. (Table 3) indicate that, notwithstanding the scenario, the Gaussian model without transformation presented the worst results for VAR and bias, with transformations, more precisely by Box-Cox and SQRT, assuming a better stance, whereas with assumed normal SQRT transformation, the simplest and most used for count data, performance was the best. As expected, the superiority of GLM approaches was eminently clear, especially of ZIP and GZIP models, whose performances, although practically the same as Poisson when P(0) = 0, were far superior when P(0) = 0.2 and P(0) = 0.5.

VAR, bias and MAD values
As to MAD with a small additive QTL effect (a = 0.05), values were very similar for all the models. However, with an increase in QTL effect, ZIP and GZIP performances improved substantially compared to the other models. Generally speaking, all presented good values for MAD when a = 0.2, explainable by the higher the QTL effect the higher the probability for detecting this QTL in the right position.
In summary, in view of the results (Tables 1, 2 and 3), it is possible to conclude that when there is moderate or extreme inflation of zeros, the best option is to adopt either the ZIP or GZIP models, since both are appropriate when reckoning with the simultaneous occurrence of true zeros and zeros from Poisson distribution. Furthermore, with both, final results were very similar in all scenarios, since in the simulation studies under/over dispersion was disregarded, i.e. f = 1. Thus, as could be expected, GZIP can be characterized as a ZIP model. In any case, SQRT transformation Regression models for QTL mapping 579    Box-Cox (c, d), SQRT (e, f), Poisson (g, h), ZIP (i, j) and GZIP (k,l) models. The letters inside brackets denote first and second tick-count phenotypes, respectively. The line above the graph shows the 5% threshold.
proved to be an interesting and easy alternative to QTL detection in the presence of count data.

Motivating data set
QTL profile-plots from five different models applied to first and second tick-count phenotypes in Holstein x Gyr cows, appear in Figure 1.
As there were less zeros in the data used for the first count than for the second, this could be used as a parameter for checking how the different models perform in the presence of zero-counts and real zero-inflated counts. From the set of graphs related to the first count, it can be inferred that only simple (Figure 1 i) and generalized (Figure 1 k) ZIP models had come close to the 0.5% threshold, and that this was the only model that presented a "qtl" like curve. For the second set of graphs, except for that of Box-Cox transformation, all the curves looked the same. As there apparently was a QTL around 21 cM, the ZIP models ( Figure 1 j and l) showed that the curve had reached close to the 0.5% threshold. Through being compatible with those obtained with the simulated data set, the results revealed the superior performance of ZIP models when working with counting data in the high presence of zeros. Furthermore, the ZIP and GZIP models presented similar results in both tick counts, since in the GZIP fit, the estimated f parameter came close to one, thereby indicating the absence of under/over dispersion.
In summary, we conclude that when working with nonormal distribution data, such as count data, prior descriptive analysis is always advisable. If working with zero-inflated data, it is recommended to use both simple and generalized ZIP models for data analysis. If working with data with zeros, but not zero inflated, either the Poisson model or a data transformation approach, such as square-root transformation, can be employed. Square-root transformation can be used for Poisson distribution data, although in the case of uncertainty as to this being the case, a more general transformation, as Box-Cox, should be used instead.