Scale for developmental dyslexia screening: evidence of validity and reliability Escala para rastreio de dislexia do desenvolvimento: evidências de validade e fidedignidade

Purpose: To investigate the empirical validity and reliability of a screener for risk of developmental dyslexia (DD) by elementary school teachers. Methods: The scale was tested with 12 teachers who answered questions about their students (95 students total, all in the third year of elementary school); the students, in turn, performed reading and writing tasks which were used to investigate the association between screening scores and performance. The following analyses were carried out: (1) factor analysis; (2) internal consistency; (3) relationship between each scale item and the construct of interest, as measured by item response theory (IRT); (4) correlation of each scale item with external variables (reading and writing tests); and (5) the temporal stability of teachers’ evaluations. Results: The analyses showed: (1) one factor was extracted; (2) strong internal consistency – the items in the scale are good indicators for screening of this construct; (3) items were monotonic (IRT), i.e., item variability is associated with one construct; (4) moderate Spearman correlation (11/17 items); (5) temporal stability – the result of screening did not vary over time. Conclusion: This study shows evidence of validity and reliability of the proposed scale in its intended use of screening for developmental dyslexia. The percentage of children at risk for developmental dyslexia, according to the scale, was approximately 9%, which is in agreement with the international literature on the prevalence of dyslexia.


INTRODUCTION
The aim of this study was to obtain evidence of validity and reliability of a scale designed to screen for symptoms of developmental dyslexia (DD), by means of empirical analysis. This self-report scale was developed with the aim of being accessible, easy to use, and easy to analyze for teachers and other professionals who work with children during literacy development. Considering the lack of instruments for DD screening in Brazilian Portuguese, it was designed to have empirical validity and utility to aid in the identification of red flags for this learning disorder.
DD is a specific learning disorder of neurobiological origin. It is defined by an unexpected difficulty in reading for the child's chronological age, intellectual quotient, and educational level, which cannot be explained by another diagnosis (1) . Although the DSM-5 (1) uses the term dyslexia only as a descriptor for Specific Learning Disorder with Impairment of Reading, the term developmental dyslexia was retained for the present study because of its widespread and historical use.
Worldwide, dyslexia is estimated to affect 5% to 10% of readers. In Brazil, a prevalence of 7% would correspond approximately 3 million dyslexics among the 49 million students in basic education as of 2014 (2) . DD is widely underdiagnosed or diagnosed late in Brazil. A recent survey of Brazilian children with dyslexia identified that 60% of them had been held back at least once and that the average age at diagnosis was between 10 and 11 years, which suggests misinformation and a lack of screening for early diagnosis (3) . These children had already completed between 5 and 4 years of schooling without so much as being identified as at risk of DD, even though some red flags for this disorder are already manifest during preschool or first grade, as described in the literature (4) .
Early identification of DD mitigates student absenteeism and other harmful effects of the low reading level associated with this disorder. Easy-to-use screeners for DD red flags, such as the Screener for Reading and Writing (SRW) proposed herein, can help identify children at risk. Two points that bear stressing are the role of the elementary school teacher and the wide range of potential learning difficulties in the public school system. Studies have demonstrated the reliability of teachers' capacity to judge the reading skills of their students (5) . Assessment of a child's performance as compared to that of her peers by a teacher, especially with the help of a structured instrument, can be an important strategy to address the problem of early identification of children at risk of learning disorders.
The SRW can identify children who exhibit certain symptoms and behaviors characteristic of DD. It must be noted that the SRW does not replace clinical investigation for the diagnosis of DD. The SRW was based on the structure of the SNAP-IV (Swanson, Nolan, and Pelham Rating Scale) scale used to screen for attention deficit/hyperactivity disorder (6) and on a theoretical review of the literature, focusing mainly on diagnostic manuals. After this first stage of development, the SRW was submitted to a panel of expert judges from different regions of the country for analysis. This analysis led to the exclusion and inclusion of items based on the experts' assessment, as well as changes in the wording of items to ensure intelligibility and clarity. The resulting draft was submitted to elementary school teachers for semantic analysis, to ensure that there were no distorted interpretations of the items. These steps of the instrument development process are explained elsewhere (7) .
The choice to validate the scale at the end of the third grade was based on the DSM-5 (1) . Under Criteria C and D for Specific Learning Disorders, the DSM establishes that difficulties must arise during school years and cannot be a result of lack of educational opportunities. Considering these criteria and the National Common Core Curriculum (Base Nacional Comum Curricular, BNCC), which mandates that literacy must be acquired by the end of the third grade of elementary school (e.g., at which time the National Literacy Assessment is carried out), our understanding was that only from this stage onward could the diagnostic hypothesis of dyslexia be established more accurately. The third grade is a milestone of the learning process in several guidelines. It is during this year that reading difficulties are most likely to be recognized by teachers; in Brazil, it is also the first grade which a student can fail. It should be noted that the BNCC's prescriptive stance on "a certain age" is a guideline based on evidence about neurodevelopment and the optimal age for acquisition of literacy (8,9) ; therefore, however arbitrary, this guideline establishes a time frame within the Brazilian educational process during which a screener instrument can assist in identification of DD risk.
The SRW is unique in Brazil. Only one other scale designed to monitor aspects of socio-emotional development such as social skills, behavioral problems, and academic skills has been validated in the country: the Social Skills Rating System (SSRS), for children aged 6 to 13 (10) . To date, there is no equivalent scale to screen for signs of DD. In addition to its original nature, this scale thus addresses an unmet need for a DD screener for the Brazilian population. Making this scale available for use across the country could consolidate it as an effective, user-friendly screening instrument.

EMPIRICAL TESTING: INTERNAL CONSISTEN-CY, FACTOR ANALYSIS, ITEM RESPONSE THEO-RY, CONVERGENT VALIDITY, AND TEMPORAL STABILITY
The validity and reliability of a test can be calculated through pre-established methods (11) and subsequently used in validation of the instrument (12) . The selection and construction of the SRW items, as well as other evidence of validity, are described elsewhere (7) . The present study is limited to presenting the results of statistical evaluations based on empirical data and discussing the implications thereof.
The following analyses of validity were conducted: a) internal consistency: tests whether the variability presented by each item/task has a strong relationship with the variability of the other items, as well as with variance of the final score; b) factor analysis: tests how many behaviors the scale and its items evaluate, and the extent to which each item is a good representation of the behavior which it is intended to measure; c) item response theory: tests the ability of each item on the scale to measure the degree of ability, or skill, presented by the respondent; d) convergent validity: tests for correlation between variation in performance in tests that measure the desired skill (external variables) and variation in SRW scores; and e) temporal stability coefficient: represents the stability of measurement over time, thus estimating the measurement error of the respondent. All of these tests were performed; a detailed presentation of the methods employed is available elsewhere (7) . This paper presents the results that underlie our empirical validation of the instrument and a practical discussion of the SRW items and the signs and symptoms for which it screens.

METHODS
This study was approved by the Research Ethics Committee of the Pontifícia Universidade Católica do Rio Grande do Sul (ethical approval number 51215715.8.0000.5336). All teachers who participated and the parents or guardians of the evaluated students provided written informed consent.

Participants
The sample of children for this study was derived from a larger umbrella project (ethical approval numbers 30895614.5.0000.5336 and 13629513.0.0000.5336). Assessment of children with the SRW was performed by12 elementary-school teachers (Table 1) who taught third grade at the six public schools attended by the students involved in the project. These schools serve as a convenience sample for the umbrella project. Overall, the 12 teachers evaluated 122 children with the SRW. Twenty-seven of these assessments were excluded because they did not meet the inclusion criteria: a) child > 25th percentile on Raven's (n = 13); b) no intellectual quotient evaluation (n = 6); c) incomplete reading and writing assessment (n = 7); and d) screener not completed by the teacher (n = 1). Therefore, the final sample consisted of 95 students (mean age 9.27 years, SD = 0.39; 52.6% female). The study was conducted during the 2015 school year. One month after the first assessment, all teachers were invited to participate in the retest stage. The teachers' response time ranged from 2 to 4 months. Ultimately, six teachers agreed to take part in the retest stage. At this stage, 30 children (mean age 9.25 years, SD = 0.39) were selected at random and reevaluated.

Instruments and procedures
The Screener for Reading and Writing (SRW) is a self-report instrument (see attached file). A four-point Likert scale is used to measure the frequency with which symptoms (listed in 16 items) manifest. The scale was delivered to the teachers with a list identifying the selected students. Teachers were instructed to respond within 15 days. The scale was scored as follows: each item marked "never" was assigned one point; "rarely", two points; "sometimes", three points; and "often/always", four points. The minimum total score, which denotes no difficulty, is 16 points; the maximum score, which denotes great difficulty, is 64 points.
Reading and writing performance was assessed by means of tasks performed in groups and individually. The tasks were: a) Reading aloud of words and pseudowords (13) ; b) Evaluation of reading comprehension of expository texts (14) ;c) Dictation (15) ; and) writing fluency assessment (7) .

Data analysis
Exploratory factor analysis was used to investigate the dimensional structure of the SRW. In this analysis, robust weighted least squares estimation was performed on a variance/ covariance matrix of data from the scale in order to: 1) obtain representative results of the general population, i.e., extrapolate sample data to the general population; 2) impute missing values and ordinal data, since usual estimators such as the maximum likelihood method assume items or variables as interval data and normal distribution of these indicators (16) . In Likert-type scales, however, items are ordinal variables; in these cases, factor analyses using RWLS-type estimators tend to estimate the number of factors underlying the data more accurately and produce more consistent parametric estimates of factor loadings and correlations between factors (17) . The following model fit indices were considered: Comparative Fit Index (CFI > 0.90), Tucker-Lewis index (TLI > 0.90), Standardized Root Mean Residual (SRMR), and RMSEA (Root Mean Square Error of Approximation), all with optimal (reference) values near or < 0.08. Other indices, such as the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), were used to compare different hypotheses of theoretical models, with lower values indicating better fit. The reliability of each factor was estimated using Cronbach's alpha coefficient, with values > 0.6 being considered adequate (18) .
In order to assess the psychometric properties of each SRW item by the item response theory (IRT), Masters' model for partial credit scoring (19) (1978, an extension of the Rasch dichotomous model [1960] for polytomous items) was employed. The partial credit model jointly estimates the respondent's skill level and the difficulty of the items, insofar as both parameters are represented on the same simple linear continuum, in log-odds units (logits); as the items and the estimates of the latent trait are measured by the same metric, the estimate of the respondent's skill will correspond to a likelihood of answer or endorsement of the item category. Its assumptions were tested by: 1) factor analysis to confirm the one-dimensional nature of the instrument; 2) monotonicity (principle by which the likelihood of endorsing a particular item category increases as the participant's latent variable increases); and 3) local independence (the test items are independent of each other, or do not influence each other in leading to the answer).
IRT was also used to compute infit and outfit statistics for the items. For this study, values ranging from 0.5 to 1.5 were considered; according to Wright and Linacre (20) , these infit and outfit limits provide productive measurement parameters. Statistical significance (by the chi-square test) was used as a tiebreaker criterion for the two goodness-of-fit measures. For example, if any item presented unsatisfactory infit and satisfactory outfit values, comparison of the difference between the modelpredicted values of each item and the actual empirical values collected was performed using the chi-square test.
Spearman correlations were calculated to analyze convergent validity. Finally, temporal stability was analyzed by means of test-retest of the teachers' responses to the scale after a 2-to-4-month interval. For an instrument to be considered reliable and temporally stable, these correlation coefficients must be equal to or greater than 0.8 (12) . Analyses were conducted in the Statistical Package for the Social Sciences (SPSS), Mplus (21) and R (22) software environments, including the psych (23) and mirt (24) packages for R.

RESULTS
Given the many different analyses conducted (factor analysis, IRT, convergent validity, test-retest), the results will be presented in separate sections below.
Analysis of internal consistency (Supplementary Table 02 of the supplementary material) showed high Cronbach's alpha coefficients (0.968), indicating a high degree of covariance between the items on the scale (the lowest alpha was 0.9643, and the highest, 0.9678).

Item response theory (IRT): sample-independent evidence
The results of IRT (Table 2) show that the items were monotonic, indicating that variability is linked to a single construct (reading and writing; values equal to or greater than 0.6). The borderline value of item 15 (0.59) was disregarded, as the infit and outfit statistics were good. On residual analysis, items 1 and 10 showed borderline misfit on both infit and outfit measures (item 1: outfit = 0.48, infit = 0.53, x 2 = 0.04; item 10: outfit = 1.64, infit = 1.28, x 2 = 0.04). Furthermore, the thresholds of item 11 (never-rarely = -0.90; rarely-sometimes = 0.63; sometimes-often/always = -0.12) did not present an adequate relationship between the frequency of symptom manifestations and the respondents' skill as calculated by IRT.
The scale was able to measure the reading and writing ability of children from the 10th percentile upward (Table 3). By overlaying the difficulty parameters ( Table 2) on the skill curve ( Figure 1, supplementary material), it can be observed ( Table 3) that the most informative part of the scale lies between the 10th and 90th percentiles, i.e., between scores 16 and 58.

Convergent validity: relationship between the SRW and reading and writing tests
Spearman's correlation coefficients were moderate (≥0.40 to <0.60) for 11 of the 17 external variables ( Table 4).

Test-retest reliability: stability of evaluations made with the SRW over time
Assessment of temporal stability was performed by correlating the latent trait estimates of the first and second data collections (r s = 0.80, p < 0.0001). There were no significant differences between the mean estimates obtained at the two time points (mean difference = -0.001, t (29) = -0.025324, p = 0.98).

DISCUSSION
The results of factor analysis demonstrate that the scale assesses a single construct, i.e., the reading and writing aspects measured in this test behaved as a single skill. Thus, it is evident that different cognitive skills, such as executive function and attention, are not interfering with the findings of the scale.
It is well known that reading and writing involve distinct cognitive processes, each with its own peculiarities (27) , and can even be learned separately. Therefore, we believed that the SRW would be composed of two skill constructs. However, because these two skills are highly interdependent (15) , the items on the scale were unable to measure their distinct features, at least in third-graders.
Analysis of the internal structure of the scale allowed us to obtain an indicator of the reliability of the scale and of the symptoms it investigates (11) . However, it is worth noting that such high Cronbach's alpha coefficients as found in this analysis can represent item redundancy, i.e., the presence of very similar items on the scale. To investigate this issue, we performed IRT analyses, which indicate the level of skill that a given item evaluates (11,12) without overlap.
First, the results of IRT confirmed that the SRW evaluates a single construct (monotonicity values equal to or greater than 0.6). Analysis of residuals indicated that items 1 (Takes longer than peers to read words) and 10 (Is better at telling a story aloud than at writing it down) showed misfit of outfit and  infit statistics. However, due to the clinical value of these items and because the misfits were considered borderline, both items were retained. Analysis of symptom frequency (Never-Rarely; Rarely-Sometimes; Sometimes-Often/Always) revealed a discrepancy between the frequency of symptom manifestation and respondent skill as calculated by the IRT for item 11 (Takes longer than peers when copying (e.g., from the blackboard)). This may indicate that more than one variable interferes with the process of responding to this item (28) , such as the attention factor. Nevertheless, exclusion of this item was deemed unnecessary, since all other measures resulting from the item showed good values.
One more measure of reliability of the scale was obtained by IRT analysis. By overlaying the difficulty parameters of the items (Table 2) on the information curve (Figure 1, supplementary material), we found that evaluation was most accurate in the "intermediate-high", "intermediate", and "intermediate-low" skill ranges. The analysis also showed that there was no overestimation and little underestimation of skill ranges as measured by the SRW.
As shown in Table 3, percentiles equal to or less than 10% and those equal to or greater than 90% were not in the most informative region of the IRT information curve (Figure 1supplementary material); therefore, the level of ability of children in these ranges is poorly assessed by the scale. Notably, the scale is unable to detect skill differences up to the 10th percentile, with an IRT value of -3.747. This result suggests that, above a certain degree of skill, all students would be classified in the "Never" category. The analysis also suggests that those students (n = 9) who obtain scores equal to or greater than 58 points (90th percentile) should be referred for diagnostic evaluation due to their degree of difficulty reading and writing. These children would be at risk for DD. The percentage of scores in this range (approximately 9% of the sample) corroborates the prevalence of DD reported in different countries (5-10%) (29) .
Regarding convergent validity, 11 of the 17 variables showed moderate correlation. There is no consensus in the literature as to the appropriate magnitude of correlation for convergent validity (30) . Urbina (11) notes simply that the correlation must be strong, while DeVon et al. (30) , in their review, indicate that values higher than 0.50 are infrequent, as it is often impossible to find a validated task with the same specificities as the construct of interest to perform the correlation.
Regarding the tests performed, we must make note of a problem with the comparison criteria used for the present study. There is no gold-standard instrument for assessment of reading and writing in Brazil. The most widely used assessment tool, cited in 478 studies (Google Scholar, 2016), is the School Performance Test (Teste de Desempenho Escolar, TDE) (31) . However, the version available at the time of the study was constructed more than 20 years ago, and is now outdated (32) . None of the tests used in this study have been validated, and only one (the Balanced Dictation task) has had norms described (15) .
Based on the arguments advanced by DeVon et al. (30) , we believe that SRW results have an important association with actual performance on reading and writing tasks. Convergent validity had to be assessed with schools as the unit of analysis, as there were major differences in the average performance on reading and writing tasks (7) across institutions. Because of this variation, given a student who made 60 spelling errors on the Balanced Dictation task (15) , teachers from different schools would probably score this same student differently on the corresponding SRW item.
The differences found in average reading and writing performance scores in this research may be largely related to differences in methodology and syllabus across the sampled schools. The Brazilian National Curriculum Guidelines for Basic Education (33) are limited to methodological principles (interdisciplinary and problem-based learning), without specifying what they are or how they should be worked on. It is thus up to each teacher to choose the best teaching methodology for their group; therefore, strategies for presenting content to the class may vary from educator to educator, thus leading to differences in student performance.
The SRW investigates reading and writing, skills that involve different cognitive processes but are interdependent, and can thus be compared to a battery of tests. For instance, all comprehension tasks are essentially related only to item 7 of the scale, whereas those related to the Balanced Dictation task have a more intrinsic relationship with item 8. Although factor analysis indicates that reading performance and writing performance correlate strongly with one another, to the extent of being considered representative of a single factor, the greater specificity of individual tasks may have decreased the correlation strength, as observed when comparing test batteries to an isolated task (11) .
Only one variable was uncorrelated with the scale: the number of errors made when copying. We presume this occurred because children with persistent difficulties, being aware of their problem, create strategies to avoid mistakes. Thus, there is no impact on accuracy, but rather on the speed with which they complete the task.
The weakest correlation was that of reading speed. As there are no parameters for assessment of reading fluency in school settings (34) , this evaluation is entirely subjective. The low correlation with comprehension scores can be a reflection of conceptual flaws about comprehension and of the screener instrument. In this line, studies have shown gaps in teacher knowledge regarding the processes that underlie the development of reading (35) .
The subjective nature of teachers' perceptions of differences between students may also be associated with the strength of the correlations found in this study. The strength of correlation varied widely across institutions, even ranging from positive to negative. Teacher training and seniority may be directly involved in this difference between institutions, as well as other social and demographic variables of schools.
Finally, regarding the temporal stability (test-retest reliability) coefficient, optimal values are generally defined as those equal to or greater than 0.90 (11) ; however, values as low as 0.80 are considered acceptable (12) . Several factors may explain subpar values.
Issues such as the time elapsed between assessments and a potential decrease in participant motivation when retaking a test interfere with this correlation (12) . An interval of 15 to 30 days between measurements is recommended. However, returns for the retest step were only received 2 to 4 months after the initial evaluation; reasons included a delay in delivery, as initially only one teacher had agreed to participate, which also appears to indicate a reduction in motivation among the sample of teachers.

CONCLUSION
The processes described in this article provide evidence of the validity of the SRW screening instrument (Appendix A), according to the principles set forth by the American Psychological Association and the Conselho Federal de Psicologia (36) . Although the SRW was developed to assess the reading and writing skills of students from the first to the fifth grade, assessment of its validity was restricted to third-graders.
As noted in the introduction, this was a deliberate choice, considering the diagnostic criteria for DD and the provisions of the Pacto Nacional pela Alfabetização na Idade Certa (PNAIC, National Pact for Literacy at the Right Age), at the time of assessment, and of the current BNCC. However, in 2019, conceptual changes led to a major update of the Brazilian National Literacy Policy (37) . The new policy aims to ensure that children are able to read and write simple texts by completion of the second grade of elementary school. This new concept in no way invalidates the present study or the SRW. Its items continue to represent the symptoms of DD, and this reconceptualization does not require any changes to the statistical analyses, especially those referring to the SRW items and their results, which were shown to correlate with performance on reading and writing tasks. In addition, as previously noted, the BNCC continues to regard the third grade as a literacy milestone, and it is in the third grade that the national literacy assessment is carried out.
By demonstrating that the proposed screener actually measures what is sets out to measure, this study fills an important gap in the field. The SRW provides physicians, speech therapists, psychologists, and educators with a tool which yields reliable evidence for the identification of students at risk of DD. The SRW can also be used in research settings, by investigators who wish to select third-graders with and/or without impairments in the development of reading and writing skills for study samples. Finally, the scale can serve as a population-level screening instrument for research purposes.