Effects of Statistical Models and Items Dif fi culties on Making Trait-Level Inferences : A Simulation Study

Researchers dealing with the task of estimating locations of individuals on continuous latent variables may rely on several statistical models described in the literature. However, weighting costs and benefi ts of using one specifi c model over alternative models depends on empirical information that is not always clearly available. Therefore, the aim of this simulation study was to compare the performance of seven popular statistical models in providing adequate latent trait estimates in conditions of items diffi culties targeted at the sample mean or at the tails of the latent trait distribution. Results suggested an overall tendency of models to provide more accurate estimates of true latent scores when using items targeted at the sample mean of the latent trait distribution. Rating Scale Model, Graded Response Model, and Weighted Least Squares Meanand Variance-adjusted Confi rmatory Factor Analysis yielded the most reliable latent trait estimates, even when applied to inadequate items for the sample distribution of the latent variable. These fi ndings have important implications concerning some popular methodological practices in Psychology and related areas.

"Latent variable" refers to a random variable with no sample realizations immediately available for at least some cases of a database (Bollen, 2002).Latent variables mathematically represents real infl uences underlying ob-served behavior, and play an important role when it comes to investigate whether scores on a set of indicators afford inferences about underlying psychological phenomena (Borsboom, 2008)In Psychometrics, a latent trait estimate (i.e., a latent score) indicates the most likely location of an individual on a psychological dimension, when taken into account the observed pattern of responses given to a set of valid items or tasks (Grice, 2001).. Nevertheless, estimating the true latent score of an individual does not depend only on using valid indicators, but also on a statistical model that establishes a link function between these indicators and the latent variable in question.
Many statistical models address the problem of latent variable assessment in Psychology and related areas.Following, we discuss two general statistical models that comprise several of commonly used psychometric approaches.Namely, these general models comprise the Classical Test Theory (CTT), and the Latent Variable Models (LVM).Further, we briefl y describe a third type of methodological approach, known as Principal Components (CP).
The general formulation of the Classical Test Theory (CTT) is: Stated otherwise, observed score X of an individual i in a test (or isolated item) j equals to his or her true score T plus a random error ε.In this case, the true score T means E[X] = T, i.e., the expected raw score X for that individual, considering a hypothetical situation of infi nitely repeated independent measures (Bollen, 2002;Lord & Novick, 1968).Actually, there are no latent variables in the general model of CTT, so that the model consists only in a thought experiment involving repeated observable operations (Borsboom, 2005).However, it is a common practice among researchers in Psychology and related areas to treat raw scores (even when resulting from a single test administration) as if they were some sort of estimate of a latent variable.Indeed, some researchers have attempted to relate, conceptually, the observed score X to a latent variable θ (Bechger, Maris, Verstralen, & Béguin, 2003).
In contrast with CTT, Latent Variable Models (LVM) allow for empirically testing the hypothesis that the population distribution of raw scores in psychometric instruments depends on the population distribution of "unobserved" variables.A common central characteristic of all LVM is assuming observed data as a function of a unidimensional or multidimensional latent structure (Borsboom, 2008).Although the link function in the syntactic formulation of models may be linear, logistic, probit or of another type (Bollen, 1989), several LVM can be described by a simple linear combination of explanatory parameters (Mellenbergh, 1994).In spite of exceptions (see Mellenbergh, 1994), this holds true for most of usual models in Psychometrics, such as Factor Analysis (FA) and Item Response Theory (IRT).
The FA models are widely used in Psychology (Ten Holt, Van Duijn, & Boomsma, 2010).Briefl y, the FA models consist in: That is, observed score X ij of individual i on item j is a function of the combination of m factor loadings a j and factor scores F plus a random error ε j , with the assumption that ε j ~N (Gorsuch, 1983).The v j parameter represents an intercept, in general, set to 0 for identifi cation purposes.The FA of categorial data (e.g., Likert scales; for instance, Weighted Least Squares Mean-and Variance-adjusted Confirmatory Factor Analysis [WLSMV]; Muthén & Muthén, 1999) adds a threshold structure to the model, parameterizing the diffi culty of endorsement of categories (Ferrando & Lorenzo-Seva, 2005;Takane & de Leeuw, 1987).So, generally stated, in FA, an item score is explained by an item intercept (and, in some cases, categories thresholds), the saturation or load of this item on n factors, and the location of individual on these n factors, besides random error.
While raw scores are modeled as the dependent variables in FA models, in the context of IRT, the dependent variable is the conditional probability of observing a specifi c score on an item, given individual and item parameters.A general IRT model can be defi ned as: That is, the conditional probability of observing a score u on item j is a f function of a θ vector with one or more parameters describing the location of individuals on one or more continuous latent variables, and a γ vector containing one or more item parameters (for a more complete introduction, see Reckase, 2009).Among unidimensional IRT models suitable for polytomous items, link function f is, in general, a logistic regression (ψ), and item parameters are a j (discrimination) and b j (diffi culty), so that: Actually, IRT models are equivalent to the FA of categorical data (Takane & de Leeuv, 1987).A minor difference between IRT and FA of categorical data derives from equation (4), which implies a logistic parameterization, whereas categorical FA analysis -such as WLSMV models, for instance -implements a probit parameterization for discrimination (factor loadings) and diffi culty (thresholds) parameters (Wirth & Edwards, 2007).Three commonly used unidimensional models to estimate latent scores with polytomous items are the Graded Response Model (GRM; Samejima, 1969), the Partial Credit Model (PCM; Masters, 1982) and the Rating Scale Model (RSM; Andrich, 1978).Whereas PCM and RSM estimate the person θ i and item b j parameters (constraining a j = 1), GRM estimates the parameters a j , b j and θ i (for a more detailed explanation about differences between models, see Embretson & Reise, 2000;and Wright & Masters, 1982).
Finally, another commonly used model, but unrelated to CTT and to LVM, is Principal Components Model (CP; Hotelling, 1933), which consists in: Stated another way, the principal component Z i score for individual i consists in a linear combination of n indi-cators X and its respective w weights on the component.A principal component equals to a weighted sum of a set of ordinal or continuous indicators -an index useful to summarize data.Factors and components are not necessarily equivalent, as components are constituted by the common variance between the indicators (which constitutes a factor), but also by their specifi c variance plus error variance (Gorsuch, 1997).Actually, PC model is more appropriate to rather investigate formative constructs (e.g., Vyas & Kumaranayake, 2006) than psychological phenomena underlying data.Thus, like CTT, it is a model that, formally, does not include latent variables as an explanation for the data, despite being often used in this sense.
A theoretical and practical matter is whether it makes difference using a particular model (from the ones presented above) for estimating the latent trait levels of individuals in detriment of using the other models.In this regard, evidence suggests using CTT raw scores can systematically overestimate or underestimate the true latent trait level under some conditions (Ziegler & Ziegler, 2009).Such biases were found in studies both in cognitive and psychopathology areas (Reise & Waller, 2009;Stansbury, Ried, & Velozo, 2006;Ziegler & Ziegler, 2009).A study have also shown that latent scores yielded by PC are less contaminated by disturbing infl uences, such as social desirability, than CTT scores (Saar, Aavik, & Konstabel, 2012).Also, compared to CTT, IRT models seem to provide more accurate estimates of latent trait levels (Fraley, Waller, & Brennan, 2000;Weiss & Von Minden, 2011).Furthermore, IRT models are less prone (than CTT scores) toward spurious interaction effects in analysis of variance (Embretson, 1996a) and linear regression (Morse, Johanson, & Griffeth, 2012).Therefore, studies have shown substantial differences, in pairwise comparisons, in the quality of estimates provided by models.
However, to our knowledge, only one previous empirical study addressed the issue of comparing latent trait estimates obtained via several different models.Namely, using real data on psychopathology, Dumenci and Achenbach (2008) explored the relationship between latent trait estimates yielded by six statistical models (i.e., CTT, PC, exploratory FA with Maximum Likelihood estimation method [EFA-ml], confi rmatory FA with Weighted Least Squares Mean-and Variance-adjusted estimation method [CFA-wlsmv], GRM and PCM models).Authors found similar estimates obtained by CTT, PC, EFA-ml methods on the one hand, and between CFA-wlsmv, GRM and PCM, on the other hand.Specifi cally, within each method group, linear relationships (R²) between estimates were near 1.00.By contrast, between groups, relationships were more of a quadratic or cubic type, with R² around .90.Therefore, fi ndings revealed non-negligible differences between these two clusters of models, suggesting some of them may be more appropriate than others given empirical conditions yet to be fully explored.Nevertheless, the use of real data prevented authors from investigating such conditions, as exemplifi ed by controlling the presence or not of items with highly dissimilar degrees of diffi culties.
In this respect, Embretson (1996b) used simulated data to illustrate how test equating under CTT yield divergent (non-linearly related) estimates for the same individuals when using an "easy" and a "hard" version of a test.In fact, the presence of items with inadequate diffi culties to the sample latent trait distribution may bias even the number of underlying dimensions identifi ed when using exploratory methods such as continuous FA and PC (Aryadoust, 2009;Smith, 2009).Then, it follows that convergence between latent trait estimates from different psychometric models may vary according to whether or not items diffi culties match the sample mean of latent trait distribution.Nevertheless, we know of no previous works that have applied statistical tests to address this empirical issue.Therefore, the aim of the present simulation study was to compare the performance of seven popular statistical models in providing adequate latent trait estimates in conditions of items targeted at the mean level or at the tails of the latent variable distribution.To do so, we evaluated correlations between estimated and simulated true latent scores, considering three simulation conditions: (a) items diffi culties targeted at the sample mean of latent trait distribution (Condition 1); (b) items diffi culties targeted at the lower tail of latent distribution (Condition 2); and (c) items diffi culties targeted at the upper tail of latent distribution (Condition 3).In addition, we investigated the infl uence of sample size on the quality of models estimates.We sought to behold a diversity of statistical models commonly used in Psychometrics, representing the perspectives of CTT, FA, IRT and PC.

Procedures of Data Simulation
Fifteen unidimensional databases were simulated, considering three items diffi culties distribution conditions (described below) × fi ve sample sizes (N = 100, N = 200, N = 500, N = 1000 and N = 2000).For each database, 10 items representing a continuous latent variable were generated.We specifi ed a Likert scale of fi ve points, and a discrimination (parameter a) ranging between .5 and 2.8.This allowed for items with a wide range of degrees of discrimination, according to extreme reference values listed in the literature (Baker, 2001).The purpose of these specifi cations was to approximate real data, in which items tend to vary in terms of relationship with the latent trait.Items responses were generated with the Generalized Partial Credit model (Muraki, 1992), which admits a variability in the a and b parameters of the items.The simulation was performed using the WinGen program (Han, 2007).
For all databases, latent scores were specifi ed to have a normal distribution with mean = 0 and standard deviation = 1.For databases of Condition 1 (items targeted at the sample mean of latent trait distribution), b parameters were also specifi ed to have a normal distribution (mean = 0 and standard deviation = 1).Thus, items resulted always located exactly in the portion of the latent continuum with the largest amount of cases -and, therefore, of useful information for estimating locations of individuals.By contrast, for databases of Condition 2 (items targeted at the lower tail of latent trait distribution) and Condition 3 (items targeted at the upper tail of latent trait distribution) we created items with diffi culties matching only the tails of the latent trait distribution; respectively, the 20% lower and the 20% upper individuals of sample distribution.Specifi cally, items diffi culties fell between -3.00 and -.84 for Condition 2 and .84 and 3.00 for Condition 3, consistent with values form z score table.

Data Analysis
We measured the magnitude of correspondence between latent trait estimates and true (simulated) latent scores using Pearson correlation and determination coeffi cient (r²).We also tested for mean differences and effects of simulation condition, sample size and statistical model on shared variance with the true latent score using t test, one-way ANOVA and factorial ANOVA.Estimates of true latent trait locations were obtained with the following models: Classical Test Theory (CTT).Raw scores were computed for the 10 items of each database, assuming the parallelism between items (i.e., its equivalence as to the true scores and the error variance; Graham, 2006).
Principal Components (PC).As described previously, PC method (Hotelling, 1933) provides an index which consists on a weighted sum of (continuous or ordinal) indicators.Component scores were computed using SPSS 19.0 program, regression scoring method.
Maximum Likelihood Exploratory Factor Analysis (EFA-ml).ML method applied to exploratory factor analysis (Jöreskog, 1967) is also a way to estimate parameters described in equation (2).In spite of assuming normal and continuous distribution of data, ML is one of the most popularly used estimation methods for factor analysis (Fabrigar, Wegener, Maccallum, & Strahan, 1999).Factor scores derived from EFA-ml were computed using SPSS 19.0 program, regression scoring method.
Minimum Rank Exploratory Factor Analysis (EFAmr).Minimum rank method (ten Berge & Kiers, 1991) is one of the several possibilities for estimating parameters of the general model of equation (2).EFA-mr was developed to maximize common variance explained in each extracted factor (ten Berge & Kiers, 1991).Factor scores estimates were obtained with software FACTOR 8.1 (Lorenzo-Seva & Ferrando, 2006), which uses a linear method developed by ten Berge, Krijnen, Wansbeek and Shapiro (1999).
Rating Scale Model (RSM).RSM (Andrich, 1978) is an IRT model for polytomous items that takes into account the purely ordinal nature of Likert scales.RSM was derived from dichotomous Rasch (1960) model, which is often considered the best option when the goal is separability between item and person parameters (Bond & Fox, 2007;Wright, 1997).Software used was Winsteps 3.72.0(Linacre, 1991), which provides Joint Maximum Likelihood estimation method.
Graded Response Model (GRM).GRM (Samejima, 1969) is an IRT model suitable for polytomous items such as Likert scales, hence taking into account the ordinal nature of raw data modeled.GRM admits a variability in item discrimination parameters (parameter a).Analyses were conducted with ltm package (Rizopoulos, 2006) using R program.The package uses Marginal Maximum Likelihood estimation method with Expectation-Maximization algorithm (Bock & Aitkin, 1981) to calculate model parameters.We computed latent scores via Expected a Posteriori method.
Weighted Least Squares Mean-and Variance-adjusted Confi rmatory Factor Analysis (CFA-wlsmv).Confi rmatory factor analysis with WLSMV estimation method (Muthén & Muthén, 1999) does not assume continuity or normal distribution of data, typically using policoric correlation matrices.CFA-wlsmv, therefore, takes into account the purely ordinal nature of Likert scales of response.CFA--wlsmv, in general, tends to provide parametric estimates closely related to GRM (although in a probit scale), as it estimates factor loadings (item discrimination) as well as item thresholds or intercepts (item diffi culty ;Ferrando & Lorenzo-Seva, 2005).Analyses were conducted with Mplus 6.0 software (Muthén & Muthén, 2010).Mplus uses Maximum a Posteriori method to calculate factor scores for WLSMV models.

Results
We used the seven methods described in the previous section to estimate (i.e., recover) the true latent trait locations on 15 simulated databases.Pearson correlation coeffi cients and shared variance r ² measuring the relationship between estimated and true latent scores in simulation Conditions 1, 2 and 3 are shown in Table 1.Results showed a small variability in determination coeffi cients (Δr²) along sample sizes when items had diffi culties matching the latent trait distribution of sample used -namely, Condition 1.In contrast, we observed a larger variability in determination coeffi cients for Condition 2 and Condition 3 (i.e., situations in which items diffi culties were not targeted at the sample mean of latent trait distribution).
For Conditions 2 and 3, RSM, GRM and CFA-wlsmv methods showed a mean r² signifi cantly higher than CTT, PC, EFA-ml and EFA-mr.So, consistent with the study of Dumenci and Achenbach (2008), we identifi ed two internally consistent clusters of statistical models.Indeed, after averaging observed r² yielded by CTT, PC, EFA-ml and EFA-mr on one hand, and RSM, GRM and CFAwlsmv on the other hand, results showed extremely-sized signifi cant differences between groups of statistical models for Condition 1, t(33) = -3.89,d = 1.32, p < .001,Condition 2, t(33) = -14.43,d = 4.97, p < .001,and Condition 3, t(33) = -10.63,d = 3.62, p < .001.Therefore, overall performance of recovery of true latent score was better for the group comprising RSM, GRM and CFA-wlsmv when compared to the group comprising CTT, PC, EFA-ml and EFA-mr.This pattern is clearly depicted in Figure 1.
Table 2 summarizes each model performance along simulation conditions and sample sizes.PC, EFA-ml and EFA-mr did not show a better performance than the simple sum of raw scores (CTT).By contrast, RSM, GRM and CFA-wlsmv showed general r² means consistently higher than obtained by others models.Variability in r and r² coeffi cients was also lower for these three models, indicating more stable estimates, independently of simulation condition.

Discussion
Our fi ndings have theoretical and practical implicatons in respect to the task of estimating the locations of individuals on continuous latent variables using unidimensional statistical models.First of all, results indicated a substantial variability in the performance of methods according to the presence or not of items too easy or too diffi cult for the sample assessed.More specifi cally, we observed a reduced overall performance of models in conditions in which items diffi culties matched the latent trait levels of only 20% of lower or upper individuals.In these conditions, we observed a greater proportion of error in the yielded estimates.As a result, it bears stressing the critical need to always use items adequately matching the latent trait level of the sample.Inadequate items imply low accuracy for latent trait estimates, dramatically increasing the likelihood of spurious results in analyses based on these estimates.
Second, in all simulation conditions, RSM, GRM and CFA-wlsmv provided estimates more closer to the true latent scores than CTT, PC, EFA-ml and EFA-mr.These differences were extremely-sized in Conditions 2 and 3 (i.e., in which items diffi culties were targeted at the tails of sample distribution of latent trait; d = 4.97 and 3.62, respectively).Thus, fi ndings suggest that RSM, GRM and CFA-wlsmv are less affected by a possible "mismatch" between items and sample distribution in their latent trait estimates.This pattern is consistent with fi ndings from the study by Dumenci and Achenbach (2008), and held even in the ideal Condition 1, in which items diffi culties were specifi ed to be randomly distributed around the sample mean of the latent trait distribution (d = 1.32).One explanation for this difference refers to the syntactic formulation of models.Namely, RSM, GRM and CFA-wlsmv estimate items diffi culties (δ or b parameters in RSM and GRM, and τ thresholds in CFA-wlsmv), what is not true for CTT, PC, EFA-ml and EFA-mr.Including these parameters in the model, therefore, results in a greater capacity to isolate variability in responses due to features of items from variability attributable to features of individuals who respond to them.Put in other words, RSM, GRM and CFA-wlsmv do not assume items to be equally diffi cult.Actually, diffi culty parameters tipically tend have a distribution of values between items (a desirable feature in psychometric instruments), so that it may be more appropriate to use statistical models that take this variability into account.Another feature is that RSM, GRM and CFA-wlsmv do not assume categorical ordered data (such as Likert scale scores) to be continuous measures of psychological attributes possessed by individuals.Consistent with previous simulation studies (e.g., Holgado-Tello, Chacón-Moscoso, Barbero-García, & Vila-Abad, 2010), taking into account the ordinal nature of indicators yielded a better approximation of estimates to the real parameter values.
Normal theory ML estimation assumes continuity and normal distribution of items, unlikely features for discrete, ordinal indicators such as typical Likert scales with a small number of categories.Illustrating this point, a recent simulation study recommended using Robust Categorical Least Squares (RCLS) estimation instead of ML for factor analysis of items with fewer than fi ve categories (Rhemtulla, Brosseau-Liard, & Savalei, 2012).By contrast, Rhemtulla et al. (2012) also recommended using ML rather than RCLS if the number of categories equals to or exceeds fi ve, as variables tend to approach a continuous distribution.In light of this particularity, we must stress that we base our results and conclusions on data with fi ve categories, without claiming the patterns would remain the same for items with a larger number of categories.Future studies should address this issue by testing for the interaction of number of categories, statistical model and simulation condition.
Nevertheless, our fi ndings counter-recommend using raw scores (CTT) or factor scores derived from EFA-ml, EFA-mr and PC to represent true persons locations on unidimensional psychological variables assessed with items with up to fi ve categories.Besides producing larger errors in estimates, they do not afford detecting whether the situation in question is ideal as Condition 1 of this study, or problematic as Conditions 2 and 3. Therefore, researchers, technicians and other professionals in the area of psychology and related areas should review their practice of using raw scores (e.g., sum of Likert scales scores) as if they were proxies to latent psychological phenomena.
A result that is also worth mentioning is that RSM yielded estimates as precise as GRM and CFA-wlsmv, even without modeling items discriminations.In fact, RSM allows to estimate only the overall diffi culty δ j of items and specifi c item thresholds "δ j + τ k " for Likert scale categories (Andrich, 1978).Thus, considering that GRM and CFA-wlsmv also incorporate discrimination parameters in the model, it would be expected a better recovery of true latent trait level when compared to the RSM.The reason for this expectancy is that including discrimination parameter generally implies a better fi t of model to the data, as it yields a variability in the slope of item characteristic curves (Hambleton, 1994).In contrast, we observed no signifi cant differences between estimates obtained from RSM, GRM and CFA-wlsmv models across the three simulation conditions, even with items specifi ed to have a discrimination value widely ranging from .5 to 2.8.Although these three models yielded similar estimates, it is noteworthy that RSM imposes a smaller number of parameters on data, which points to a more parsimonious latent trait modeling for RSM than for GRM and CFAwlsmv.

Conclusions
We need to stress some limitations in our study.First, we used single databases for each sample size in all Conditions 1, 2 and 3. Future studies may address this shortcoming using a larger number of databases in order to obtain a distribution of R² for each sample size within simulation conditions.This may help to obtain more precise evidences of the level of bias in estimates of each model when items do not match sample distribution.Second, we did not address assessment situations using multidimensional models, so that we encourage authors to expand the investigation to multidimensional contexts.Third, we did not test the effect of estimation method within each statistical model.It is possible, in this sense, that promising estimation techniques, such as Markov Chain Monte Carlo, would provide more stable estimates along several simulation conditions for RSM, GRM and CFA-wlsmv -and perhaps other models.Fourth, we restricted our investigation to fi ve-category indicators, so that results do not generalize to situations in which instruments comprise items scored on a scale with a larger number of categories -new simulation studies should investigate whether differences between models still hold in this situation.Finally, researchers may be interested in further controlling for latent trait distribution features such as assimetry and kurtosis, as some techniques such as EFA-ml make assumptions in this regard.
Our fi ndings provide relevant guidelines to decision making concerning the use of psychometric models to estimate latent scores.We recommend using latent scores estimates provided by RSM, GRM and CFA-wlsmv methods instead of traditional raw scores.In addition, we emphasize the need of researchers to base their methodological practices in sound empirical evidences concerning the performance of data analysis methods.

Figure 1 .
Figure 1.Overall r² means for the statistical model groups in the simulation conditions.

Table 1
Models Performances in the Simulation ConditionsNote.CTT = Classical Test Theory (raw scores), PC = Principal Components, EFA-ml = Exploratory Factor Analysis with Maximum Likelihood estimation method, EFA-mr = Exploratory Factor Analysis with Minimum Rank estimation method, RSM = Rating Scale Model, GRM = Graded Response Model, CFA-wlsmv = Confi rmatory Factor Analysis with Weighted Least Squares Mean-and Variance-adjusted estimation method.ΔR² = variation in the R² coeffi cient.b Differs from a with p < .001 in the pairwise comparison.

Table 2
Overall Results