Detection of determinant genes and diagnostic via Item Response Theory

This work presents a method to analyze characteristics of a set of genes that can have an influence in a certain anomaly, such as a particular type of cancer. A measure is proposed with the objective of diagnosing individuals regarding the anomaly under study and some characteristics of the genes are analyzed. Maximum likelihood equations for general and particular cases are presented.


Introduction
In many practical situations, decisions have to be taken based upon individual quantities that cannot be observed directly. These quantities are referred to by latent variables that are given different names according to the areas in which they are applied: ability or proficiency in educational and psychological areas; purchasing power in marketing; life quality or predisposition to a certain disease in the biological and medical areas (see Andrade et al., (2000), Paas (1998), for example). These types of analysis are, in general, based upon the responses of a set of variables often referred to as items that comprise the measuring tool. In educational evaluations, for example, items are represented by questions in a test that might have their answers categorized as right/wrong, A/B/C/D/E with only one correct alternative or in a way where A is the least correct, and E is the most correct alternative. Other extensions are available, such as for each item a weight like 1 (right) or 0 (wrong) is attached. These types of study were, for sometime, based upon scores for each individual, that is, upon the number of items with weight one. However, this type of approach has many drawbacks mainly because it does not make a difference among the items which lead to the development of a theory based upon the items themselves and not upon the overall results, named Item Response Theory. In such a theory each item has a set of well defined characteristics that are estimated. The estimation procedure of the latent variable of an individual takes into account each one of the items of the test and reveals, for example, the level of knowledge of that individual in a certain area or his purchasing power as related to a certain product.
Some times there is more than one population being studied. For instance, in the educational area the interest can be the estimation of the average proficiencies regarding sex or geographical location.
In a similar situation, a set of genes is studied in order to appraise the predisposition of an individual related to a certain illness. A set of items (genes) are taken into account and their answers can be activated or deactivated or in the categorized form as A/B/C/D/E representing different levels of activity of the genes. Genes have peculiar characteristics that need to be incorporated into a model so that they can be evaluated. Suggestions have been advanced on the way to pinpoint genetic influences (Vanyukov and Tarder, 2000), but with some shortcomings. For example, the conclusions reached depend upon the sample chosen.

Models for Response Functions
The Item Response Theory is based upon models that represent the probability of response to an item as function of the parameters of the item and of the individual predisposition. These functions are treated as Item Response Functions (IRF) or Item Charasteristic Curve (ICC). The different models proposed in the literature depend basically upon the type of item.
For explanatory reasons we will consider that there are K populations in study and each of them has the same n genes being analyzed. The sample related to the population k is composed by N k individuals, k = 1,..., K. Following, the model used in this paper is the unidimensional logistic model of 4 parameters for each item of two categories (of the type activated/deactivated). Its expression is given by .., n, j = 1,..., N k and k = 1, 2,..., K, where U ijk is a dichotomous variable that takes on the values 1,when the individual j of the population k has gene i activated, or 0 when the gene is deactivated.
θ jk represents a predisposition of the jth individual of the population k. b i is the inactivity (or position) parameter of the gene i, measured at the same scale of the predisposition a i is the discrimination (or of inclination) parameter of the gene i c i is the probability of gene i being active for individuals with low predisposition, γ i is the probability of gene i being deactivated for individuals with high predisposition, D is a scale factor, constant and equal to 1. The 1.7 value is used when it is desired that the logistic function yield results similar to that of the normal function.
N is the number of individuals involved in the study.

Defining the Parameters of the Genes
In a general way, the proposed model is based upon the fact that predisposed individuals are more likely to have the gene i activated, and that this relation is not linear. As a matter of fact, it can be perceived from Figure 1 that the IRF has the form of "S" with inclination and displacement defined by the gene parameters. However, only a subset of genes has to satisfy this situation that occurs only when a i > 0. Chances are that some genes are deactivated in high propensity individuals, and therefore the IRF curve should have an inverted form, expressing that individuals with high propensity are less likely to get the gene activated, and this is expressed by a i < 0. When a i = 0, we have that P U c , constant for all θ, indicating that the gene i does not interfere in the occurrence of the anomaly. Parameter b i is, perhaps, the most important of the four. The greater this parameter is, less likely it is that a given individual has the gene i activated. This is a valid conclusion only for a i > 0, and the opposite is true for a i < 0.
It is safe to say that individuals with low predisposition are prone to have the gene i active, and this information is conveyed by the parameter c i . On the other hand, high predisposition individuals can also have the gene i inactive, and this information is conveyed by 1γ i . These conclusions are valid only for a i > 0, and the opposite is valid for a i < 0.

Scale of Measurement/Indetermination
Predisposition can theoretically take any real value between -∞ e +∞. Thus, it is necessary to establish an origin and a unit of measurement for defining the scale. When only one population is under study the scale of measurement can be defined in such a way as to represent the mean value and the standard deviation of the individual predispositions of the population under study. For the graphs shown earlier the scale used had a mean of 0 and a standard deviation of 1, that will be referred from now on as scale (0,1). In practice, it does not make any difference to set these or any other values. What is paramount are the existent order relations between scale points. For example, in the scale used above an individual with a predisposition of 1.2 in fact is 1.2 standard deviations above the predisposition mean. This same individual would have a predisposition of 92, and therefore would also be 1,2 standard deviations above the predisposition mean, if the scale used for this population would have been the scale (80,10).
When various populations are present, one of them can be adopted as a Reference Population, and only the scale for this population will have to be refereed. The obtained predisposition values for other populations will have to be directly compared with those of the Reference Population. One such example consists of taking healthy individuals in the Reference Population and the population with a certain anomaly as the other. Other populations can be taken into account.

Local Independence
An often used hypothesis in IRT is the local independence (or conditional independence). It states that the probability that a certain gene is active depends only on its predisposition; that is, it offers all the necessary information to determine an activation/deactivation of the gene. In 680 Tavares et al. Figure 1 -Example of a graphic of an IRF this fashion it does not mean that the quantities U kji e U kjl , i ≠ l, are independent, but given the individual predisposition θ jk they will be considered conditionally independent. However, there are models for the case when conditional independence is not met, but we have to model this possible dependence.

Parameter Estimation of the Genes and Predispositions
One of the most important stages of the IRT is the parameter estimation of the genes and/or of the individual predispositions. In some cases we can consider that the parameters of the genes are already known and what is wanted is to estimate the predispositions; in other, less common, predispositions of the individuals are known and what is wanted is the estimation of the parameter of the genes. However, the most common cases are those in which not only the parameters of the genes are to be estimated but also the individual predisposition simultaneously. In all these cases, the proposed model is assumed as true, and from the set of responses obtained for a certain number of individuals from one or more populations, parameters and/or predispositions are estimated using either likelihood or Bayesian methods. Both methods require iterative procedures involving very complex calculations and, therefore, specific computer codes. It is important to point out that, in any of these cases, the predisposition values and those of the gene parameters will all be in the same scale of measurement and therefore they can be compared.
Before outlining some points about the estimation process, some arrangements are in order. The set of genes involved in the analysis will be ordered in a fashion such that they will be represented by ζ = (ζ 1 ,..., ζ n ). Let U kj. = (U kj1 , U kj2 ,..., U kjn ) be a random vector of answers from individual j from group k; U k.. = (U k1. , U k2. ,..., U kn. ) the random vector of answers from group k and U ... = (U 1.. , U 2.. ,..., U k.. ) the whole vector of answers. In a similar fashion, observed answers will be represented by u kji , u kj. , u k.. and u ... . This notation and local independence allow us to write the probability associated with the vector of answers U kj as P u P u Generally, it is considered that the predispositions of the individuals of population k, θ jk , j = 1,..., N k , are accomplishments of a random variable θ k , with continuous distribution and probability density function g k ( ) θ η , twice differentiable, with the components of η k finite. In the case where θ k has a Normal distribution, we have η µ σ where µ k is the mean and σ k 2 the variance of the predispositions of the individuals of the population k, k = 1,..., K. This hypothesis carries a great advantage: only the parameters of the genes have to be estimated, as the likelihood will not depend on the individuals' predispositions. Therefore, the estimation is a two-stage process, where in the first only the parameters of the genes are estimated, after which these parameters are considered as known for the estimation of the predispositions.

Estimation of Gene Parameters
With the above defined notations we have determined that the marginal probability of U kj is given by where in the last inequality we use that the distribution of U kj. is not a function of parameters η k . Utilizing the independence between answers of different individuals, we can see that the associated probabilities to the vector of answers U ... as ... .
Even though the likelihood can be written as (2), the approach has often been used of Response Patterns. As we have n genes, with two possible answers for each item (0 or 1), there are S = 2 n possible response vectors (response patterns). Let r kj be the number of distinct occurrences of the answer pattern j in group k, and yet S k ≤ min(N k , S) the number of response patterns with r kj > 0. It follows that and P i represents the IRF adopted. The specific equations for each parameter of the vector ζ i = (a i , b i , c i , γ i )' can thus be obtained from above.

Application to the 4-parameter Logistic Model
For convenience, let In sum, the estimation equations for a i , b i , c i and γ i are, respectively,

Estimation of the Population Parameters
Considering the log-likelihood obtained in (3), the estimation equations for the mean predispositions and population variances are obtained by

Estimation of the Predispositions
Once the parameters of the genes are set, individual predispositions can be estimated. In addition, such predispositions can also be estimated for individuals whose data were not considered in the item parameters estimation. The usual methods for estimating the predispositions are the maximum likelihood (ML) as well as Bayesian methods such as maximum a posteriori (MAP) and the expected a posteriori (EAP).

Estimation by ML
In this case, the estimation of the predispositions is done iteratively by the Newton-Raphson algorithm maximizing the likelihood in (2), or of the equivalent form, the function The Maximum Likelihood Estimator (MLE) of θ kj is that which maximizes the likelihood, or equivalently, is the solution of the equation Note, from (5) where the last equality follows by (4), and when plugged in the respective quantities. As It follows then that the estimation Eq. (5) for θ kj , Again, this equation does not have an explicit solution for θ kj and, for this reason it is necessary for some iterative method in order to obtain the desired estimation. Following, the necessary expressions are obtained for applications of the Newton-Raphson iterative processes.
Considering $ ( ) θ kj t an estimation of θ kj in iteration t, then, in iteration t + 1 we have and ( )

Estimation by MAP
Such as in the marginal likelihood estimation, the Bayesian estimation of the predispositions is done on the second stage, considering the fixed parameters of the genes. Through the hypothesis of independence between the predispositions of different individuals, estimations can be done separately for each individual.
Let us assume that the prior distribution for θ kj , j = 1,..., N k , is Normal with known vector η µ σ k k k = ( , ) 2 of parameters. The posterior distribution for the ability of the individual j of the population k can be written as Some characteristic of g kj kj * ( ) θ can be adopted as estimator of θ kj , where the most frequently adopted are the mean and the mode. Following, we deal with how to obtain each of these characteristics.

Estimation of the mode of the posterior distribution -MAP
The Bayesian modal estimation consists in obtaining the maximum of (9). For easiness, we work with the logarithm of the posteriori log ( ) log ( , ) log ( ), As we have adopted the prior distribution Normal (µ k , σ k 2 ) for θ kj , the second portion of (10) is  (11) and (12) where h kji and H kji are given by (7) and (8), respectively.

Estimation by EAP
The Bayes expected a posteriori (EAP) consists in obtaining the expectation of the posterior distribution, that can be written as

Simulation Results
In this section we present one application of the proposed methodology in simulated data. The data were generated based on N = 5000 individuals and to n = 5 genes. The total simulation consisted of 1000 replications. The known gene parameters are presented below. All the calculations were done via a computer program developed by the authors using the computer language Ox (see Doornik, 1998) using the BFGS routine for maximization.

The Genes Parameters
In order to generate the data it was assumed that the genes parameters are those presented in Table 1. It was adopted the 4 parameter logistic model with D = 1.7. The values for parameter a (discrimination) varied from 0.8 (low discrimination) to 1.2 (high discrimination) and the values for parameters b (predisposition) varied from -0.5 to 3.0. For the parameters c it was considered only one value (0.20) and for the γ, only 0.9. It was adopted the 4 parameter logistic model with D = 1.7.
From Table 2 we see that the average estimates obtained from 1000 replicates are very accurate for all genes. We see that the estimations procedure works very well, still when we have a relatively small number of genes. Results were obtained with a larger number of genes and the results were still very good. However, we hope that estimation problems just appear when the number of genes is too small.
The Table 3 presents the standard deviations obtained from 1000 estimates. The largest values are associated with the parameters a and b. With exception of the gene 2, the values associated with parameter a are larger than those associated to b.

Concluding remarks
We have introduced a new proposal for genes and person diagnostic via Item Response Models. From a simulation study, it was shown that the models provide good estimates for several genes configurations. However, other studies and models should be proposed to allow, for example, different levels of activities of the genes. Longitudinal models, following the lines of Tavares and Andrade (2004) and Andrade and Tavares (2004), should also be considered. 684 Tavares et al.