A systematic review of validity procedures used in neuropsychological batteries

This study presents a systematic review of validity evidence for neuropsychological batteries. Studies published in international databases between 2005 and 2012 were examined. Considering the specificity of neuropsychological batteries, the aim of the study was to review the statistical analyses and procedures that have been used to validate these instruments. A total of 1,218 abstracts were read, of which 147 involved studies of neuropsychological batteries or tests that evaluated at least three cognitive processes. The full text of each article was analyzed according to publication year, focal instrument of the study, sample type, sample age range, characterization of the participants, and procedures and analyses used to provide evidence of validity. The results showed that the studies primarily analyzed patterns of convergence and divergence by correlating the instruments with other tests. Measures of reliability, such as internal consistency and test-retest reliability, were also frequently employed. To provide evidence of relationships between test scores and external criteria, the most common procedures were evaluations of sensitivity and specificity, and comparisons were made between contrasting groups. The statistical analyses frequently used were Receiver Operating Characteristic analysis, Pearson correlation, and Cronbach’s alpha. We discuss the necessity of incorporating both classic and modern psychometric procedures and presenting a broader scope of validity evidence, which would represent progress in this field. Finally, we hope our findings will help researchers better plan the validation process for new neuropsychological instruments and batteries.


Introduction
Brazilian neuropsychology researchers are increasingly interested in developing and adapting instruments based on evidence of validity (Abrisqueta-Gomez, Ostrosky-Solis, Bertolucci, & Bueno, 2008;Caldas, Zunzunegui, Freire, & Guerra, 2012;Carod-Artal, Martínez-Martin, Kummer, & Ribeiro, 2008;Carvalho, Barbosa, & Caramelli, 2010;Fonseca, Salles, & Parente, 2008;Pawlowski, Fonseca, Salles, Parente, & Bandeira, 2008).The validation process for psychological instruments includes different procedures and statistical techniques to evaluate psychometric properties (Pasquali, 2010;Urbina, 2004).Detailed procedures and techniques are supplied in the Standards for Educational and Psychological Testing (American Educational Research Association, 1999).Several statistical software programs can be used for instrument validation, which can be observed in articles and test manuals.However, the applicability of the techniques depends on the characteristics of the instrument that is being validated.With regard to neuropsychological batteries, instruments show variations in the type and quality of the test items, number of examined cognitive functions, and measured construct.
Many neuropsychological instruments include tasks that evaluate different cognitive domains, and they require distinct validation techniques compared with regular scales, such as the Likert scale.Some procedures or statistical analyses can be difficult to apply in specific situations, such as when the number of items is limited or when a large number of subjects is required but the sample is hard to access.Comparisons of neuropsychological testing research methods and specific guidelines for psychological and neuropsychological test development contribute to the refinement of interpretative, clinical, and psychometric methods (Hunsley, 2009;Brooks, Strauss, Sherman, Iverson, & Slick, 2009;Blakesley et al., 2009).Consistent with most validity frameworks and the current test standards (American Educational Research Association, 1999), tests differ with regard to the categories that are most crucial to test meaning, depending on the test's intended use (Embretson, 2007).A brief discussion of psychometric procedures that are used to provide evidence of the validity of neuropsychological assessment batteries can be found in Pawlowski, Trentini, & Bandeira (2007).
Considering the specificity of neuropsychological batteries, the aim of the present study was to review the procedures and statistical analyses that have been used to study evidence of the validity of these tests.This study can contribute to the selection of appropriate statistical techniques and inform professionals about better instrument validation procedures.

Materials and Methods
Abstracts and articles published in indexed periodicals and international databases between 2005 and 2012 were reviewed.The selected publications simultaneously considered neuropsychological assessment and validity.

Study type
The present research involved an integrated and systematic review (Fernández-Ríos, & Buela-Casal, 2009).Beginning from a set of quantitative studies, the aim was to integrate information about analyses that have examined evidence of the validity of neuropsychological batteries.

Procedures
The PsycINFO and MEDLINE (EBSCO) databases were searched on May 3, 2013.The terms "neuropsychological assessment" and "validity" (key words used in Thesaurus) were used to search for published abstracts between January 2005 and December 2012.The search was conducted without publication language restrictions.A database was created with all abstract titles, and duplication between MEDLINE and PsycINFO databases was removed.Abstracts that involved investigations of evidence of the validity of neuropsychological batteries and assessed at least three cognitive processes were included.For example, the cognitive processes could include memory, language, and praxis (i.e., motor planning).Three independent judges classified each abstract according to the name of the instrument, study type (e.g., empirical, theoretical, or review), instrument type (e.g., battery, single task, or scale), the number of cognitive processes evaluated by the instrument, and whether the study evaluated any evidence of validity.Each abstract was read by at least two of the three judges.In case of disagreement, the abstract was evaluated by all three judges until consensus was reached.The selected abstracts were read again.Articles that assessed evidence of the validity of computerized batteries were excluded.The complete article of each selected abstract was read and classified according to the following criteria: publication year, sample type (clinical or healthy), sample age group (children, teenagers, adults, and elderly), clinical pathology (in the case of clinical samples), and procedures and statistical analyses employed to provide evidence of validity.

Information analysis
Descriptive analyses (frequency and percentage) were performed to record the publication year, focal instrument of the study, sample type, sample age range, characterization of the participants, and type of procedure and statistical analysis employed to evaluate evidence of validity.

Results
The search for articles with the simultaneous use of the key words "neuropsychological assessment" and "validity" resulted in 1,218 abstracts published in scientific journals between January 2005 and December 2012.A total of 525 abstracts were published in the PsycINFO database, and 693 abstracts were published in the MEDLINE database.Of the 693 abstracts in MEDLINE, 117 were also published in PsycINFO (one repetition was found in PsycINFO itself).Of the 1,100 total abstracts, 524 were from PsycINFO, and 576 were from MEDLINE. Figure 1 presents a detailed diagram of the selected abstracts.
Only studies of neuropsychological batteries or tests that evaluated at least three cognitive processes and included tasks with face-to-face or traditional paper-and-pencil administration were analyzed.The final selection included 147 abstracts (73 in PsycINFO and 74 in MEDLINE).The distribution of the 147 abstracts by publication year is presented in Figure 2, in which an increase in the number of studies in recent years was found, especially in 2010 and 2012.Because the full-text articles were unavailable for 15 abstracts, 132 articles were fully reviewed.Four abstracts were excluded because information about the analytical criteria was not present.The final review included 132 full-text articles and 11 abstracts.Detailed information for all 143 full-text articles and abstracts is presented in Table 1, including year of publication, journal, authors, quantity and type of participants (clinical and control/comparison), and instrument.
The instrument whose psychometric properties were most often analyzed by studies in this systematic review was the Repeatable Battery for the Assessment of Neuropsychological Status (RBANS), which was cited in 12.5% of the articles.Investigations of the psychometric properties of the Montreal Cognitive Assessment   With regard to sample type, 51.4% of the studies included both a clinical and healthy/control sample.A total of 35.4% of the studies exclusively examined clinical samples, and 13.2% focused on healthy populations.Concerning the age of the samples, 44.4% of the studies were performed with elderly participants, 22.9% included both adults and the elderly, and 18.8% exclusively included adults.Additionally, 4.2% included youth, adults, and the elderly, and 4.9% were performed exclusively with children.Only one study (.7%) included a sample of children, teenagers, and adults, and another study (.7%) exclusively analyzed teenagers.Identification of the age group was not possible in five studies (3.5%).Concerning clinical sample pathologies, frequencies and percentages are presented in Table 2.The clinical samples were predominantly composed of individuals diagnosed with dementia (18.5%),Alzheimer's disease (17.9%), and mild cognitive impairment (16.2%).Patients with acquired brain injury and cerebrovascular diseases were also assessed in a large number of studies (15.1%).To evaluate the validity of the batteries, the articles incorporated from one to eight distinct procedures.Most of the studies completed two (25%), three (21.5%),or just one (20.8%)procedure.Four procedures were employed by 13.2% of the studies, five were employed by 10.4%, and six were employed by 6.3%.Only 1.4% of the studies presented seven procedures, and .7%presented eight procedures.The most frequently used procedures included the evaluation of sensitivity and specificity (17.6%), correlations with other tests (15.6%), comparisons between groups (12.3%), analyses of internal consistency (12.1%), test-retest reliability (8.6%), and factor structure analysis (8.6%).The procedures are presented in Table 3.The most frequently used statistical analyses for neuropsychological battery validation were the Receiver Operating Characteristic (ROC) analysis (13.1%), Pearson product-moment correlation coefficient (12.9%),Cronbach's alpha coefficient (10.9%), analysis of variance or covariance (9.3%), and regression analysis (7.2%).The frequencies and percentages of the statistical analyses are presented in Table 4.

Discussion
This paper presents a systematic review of studies that assessed evidence of the validity of neuropsychological assessment batteries published in international databases between 2005 and 2012.The increase in the number of papers in recent years indicates a growing scientific concern about providing evidence of the validity of neuropsychological batteries.
The main findings demonstrate that the typical procedures and statistical analyses employed in psychological test validation are also present in neuropsychological battery validation studies.Specifically, sensitivity and specificity, correlations with other tests, comparisons between groups, reliability, and factor structure analysis were commonly employed.With regard to statistical techniques, the same was observed, with a major prevalence of ROC analysis, Pearson correlation, Cronbach's alpha, analysis of variance, and regression analysis.
The assessment of the validity of instrument scores usually considers sources of evidence of construct validity (American Educational Research Association, 1999;Embretson, 2007), which are also related to content, criteria, and patterns of convergence and divergence (Urbina, 2004).In this systematic review, the studies primarily assessed different sources of validity by searching for patterns of convergence and divergence.
The main sources of patterns of convergence and divergence were correlations with other tests and measures of reliability, such as internal consistency and test-retest reliability.The pattern of correlations with other measures, considering theoretical relationships, is frequently employed by researchers as a source of evidence of construct validity (Westen, & Rosenthal, 2003).In addition to correlation studies that provide additional support for the validity of an instrument, Urbina (2004) noted that an instrument should also measure the construct in a precise and reliable way in order to be valid.This is consistent with the idea of minimizing the role of external sources of validity and emphasizing internal sources of evidence to establish test meaning.Such procedures would include item design principles, domain structure, item interrelationship, and reliability (Embretson, 2007).
Consistent with this notion, factor structure and correlation among instrument subscales should also be investigated.Factor analysis can contribute to investigations of the dimensionality of a particular assessment instrument or battery or to confirm the theory that underlies the battery by considering the identified weightings of the variables (Floyd, & Widaman, 1995;Schmitt, Livingston, Smernoff, Reese, Hafer, & Harris, 2010).Considering that a neuropsychological battery is composed of tasks or tests with an unequal number of items through which different constructs are examined, factor analysis is not recommended to assess test validity when a small number of items are present because there must be at least three variables for each dimension of an instrument to endorse the use of this technique (Brown, 2006;Fabrigar, Wegener, MacCallum, & Strahan, 1999).
Pearson product-moment correlation coefficient and Cronbach's alpha are statistical tests often used to estimate patterns of the convergence and divergence of psychological and educational instruments (Creswell, 2008).The frequent use of these techniques emphasizes the popularity of traditional procedures in the validity assessment of instruments.Few studies employed alternative models from classical test theory, such as Item Response Theory (IRT).In IRT, different properties of items are evaluated to provide more complete characterizations of the items, the instrument as a whole, and the performance of each subject.Thus, these models offer improved accuracy and precision in neuropsychological evaluation tests; however, IRT has had a limited impact on neuropsychological tests, possibly because this type of use has only been recently adopted (Thomas, 2011).The use of IRT to study the validity of neuropsychological tests could contribute to selecting the most representative items for evaluating a specific cognitive function.Item Response Theory also has the potential to identify items with superior discriminatory power in relation to specific deficits (Pedraza et al., 2009;Schultz-Larsen, Kreiner, & Lomholt, 2007).
Regarding criterion validity, Urbina (2004) suggests assessing the precision of decisions related to concurrent and predictive validation.Concurrent validation can be achieved by correlating test scores with the predicted criteria.By studying differences between clinical groups and controls (or healthy samples), information about the precision of concurrent validation decisions can be obtained.In the present review, the evaluation of sensitivity and specificity and comparisons between contrasting groups were most often applied as evidence of concurrent validity.For evaluations of sensitivity and specificity, an analysis of the ROC curve was frequently employed.Receiver operating characteristic curve analysis contributes to the diagnostic validation of neuropsychological instruments by evaluating the ability of the instrument to predict false positives in relation to a diagnosis or specific criterion (Burgueño, García-Bastos, & González-Buitrago, 1995).
With regard to sample types, many studies analyzed groups of patients diagnosed with dementia, Alzheimer's disease, and mild cognitive impairment.One common characteristic between these groups is memory loss as the main symptom, although these patients remain heterogeneous in other ways (Pike, Rowe, Moss, & Savage, 2008).Generally, when a patient undergoes a neuropsychological assessment, decreased memory capacity is a common complaint.The choice of these clinical groups may be related to this pattern.The predominance of studies with elderly samples, which included more than half of the articles in this review, also supports this pattern.
Regression analysis also stands out in studies that assessed an instrument's specificity, but most of its frequency was observed in analyses of the effect or influence of demographical variables on neuropsychological instruments.Other studies of the effect of demographic variables compared groups from distinct regions or cultures also using regression analyses.These studies have the potential to contribute to validity assessments of the cultural or incremental type (Mungas, Reed, Haan, & González, 2005a).Additionally, ecological validity can be assessed, which would include studies that compare patient test performance with their practical daily activities (Chaytor, & Schmitter-Edgecombe, 2003;Temple, Zgaljardic, Abreu, Seale, Ostir, & Ottenbacher, 2009).
Still with regard to criterion validity, a small number of studies investigated other forms of concurrent validity, such as an item's agreement with an external variable (i.e., the absence/presence of a deficit), the detection of clinical improvement using posttreatment scales, and the prediction of other test results.Few studies analyzed predictive validity, which refers to the evaluation of future criteria.The prediction of the capacity to return to work and the ability to predict future cognitive deficits can be considered future criteria.Survival analysis or the power of an instrument to predict future outcomes, such as death or institutionalization, was also employed in one of the studies (Cruz-Oliver, Malmstrom, Allen, Tumosa, & Morley, 2012).The limited number of studies that focused on future criteria corroborates the difficulty implementing viable predictive studies (Urbina, 2004).Despite the complexity implementing such studies, analyses that predict such factors as the patient's prognosis or capacity to return to work are necessary and viable for assessing the validity of neuropsychological tasks.
Specifying homogeneous criteria in clinical neuropsychology samples is a difficult goal to achieve (Benedet, 2003).Validity studies often include samples with a wide range of cognitive deficits with neurological involvement.This review indicates that researchers are looking for alternative groups of patients, which is highlighted by the presence of validity studies with samples of traumatic brain injury, cerebrovascular disease or stroke, schizophrenia, and Parkinson's disease.Notably, many neurological or psychiatric disorders, such as multiple sclerosis, attention-deficit/ hyperactivity disorder, and bipolar disorder, are still under investigation and do not have a determined homogeneous profile of cognitive deficits.The lack of a homogeneous pattern of deficits in such samples interferes with their viability in validity studies, which demands a solid symptom profile.
Evidence related to content validity was mentioned in a small number of the reviewed studies.Some procedures or analyses employed in item development or translation included interrater reliability, the percentage of agreement, and qualitative evaluation.Assessments of content validity could provide evidence of relevant and representative items of the different constructs that are being investigated (Urbina, 2004).One explanation for the absence of studies related to test content could be that these studies have been published in previous articles about instrument development or the original test manuals, rather than as standalone articles; the former often focus on more fundamental aspects of instrument development.Nonetheless, from a psychometric point of view, the evaluation of item representativeness remains important for ensuring the validity of neuropsychological instrument scores.This is especially true if we consider the complexity of the evaluated functions.
Some studies employed other statistical analyses or procedures to search for evidence of validity and present the relevance of data completeness, scaling assumptions, targeting, and effect size.The extent to which a scale's components are completed in the target sample and the percentage of people for whom reporting a single score is possible denote data completeness.Scaling assumptions determine whether summing subscales of the instrument to create a single scale score is appropriate.Targeting evaluates whether the range of cognitive performance measured by the battery corresponds to the range of the sample (Cano et al., 2010).Effect size correlations can provide convenient and informative indices of construct validity (Westen, & Rosenthal, 2003).
Considering the results of this review, the most common procedures refer to external sources of validity, such as correlations with other measures, sensitivity, and specificity.With regard to internal validity, reliability procedures are commonly employed, but few studies have emphasized item development or used modern techniques as validity procedures.In our view, a balance between external and internal sources of validity evidence could improve the psychometric quality of neuropsychological batteries.Additionally, more careful attention to item and test development according to standards from both classic and modern techniques is useful and viable for providing validity evidence for instruments that assess very diverse domains, such as neuropsychological batteries.Finally, this approach could also minimize the difficulty studying very heterogeneous samples, such as neurological patients.
Finally, some limitations should be considered when analyzing the results of this review.First, we did not include truncation or word variation when the search was conducted.Instead, we decided to rely on established keywords from the Thesaurus.Second, we decided to exclude computerized neuropsychological batteries because they have specifics that are beyond the scope of this review.
In conclusion, our study suggests that improving evidence of the validity of neuropsychological instruments is possible.Incorporating both classic and modern psychometric procedures and presenting a broader scope of validity evidence would represent progress in neuropsychological battery validation.By highlighting the most common procedures and statistical analyses employed in this context and the observed limitations, this study may help researchers better plan the validation process for new instruments in the field.
(MoCA) were found in 11.2% of the studies.The Mini-Mental State Examination (MMSE) appeared in 7% of the citations, half of which occurred in association with the MoCA.Addenbrooke's Cognitive Examination (ACE) and Addenbrooke's Cognitive Examination Revised (ACE-R) were cited in 5.6% of the articles.Other instruments that were cited were the Alzheimer's Disease Assessment Scale-Cognitive Part (ADAS-COG; 4.9%), the Consortium to Establish a Registry for Alzheimer's Disease (CERAD; 3.5%), and the Neuropsychological Assessment Battery (NAB; 4.2%).

Figure 1 .
Figure1.Diagram of the selected abstracts. 1Abstracts in the "Not focus of the study" category were excluded because they correspond to (1) studies in which the measuring instrument was not the main focus of analysis, (2) the cognitive assessment instruments were not the research focus, or (3) the validity of the battery was not the main focus of the study. 2Abstract excluded because the final score evaluated only one cognitive function (e.g., a battery of intelligence assessment).

Figure 2 .
Figure 2. Articles per year of publication.

Table 1 .
Years, journals, authors, samples, and instruments of the papers reviewed.

Journal Authors Sample (N and type) Instruments Clinical Control/ Healthy
(FAB) and Mini-Mental State Examination (MMSE)

Table 2 .
Frequencies and percentages of clinical pathology samples in the reviewed articles.

Table 3 .
Frequencies and percentages of validity evidence procedures.

Table 4 .
Statistical analyses employed in the reviewed validity procedures.