Evidence of validity and reliability of a phonological assessment tool

Purpose: To present evidence of the validity and reliability of a phonological assessment tool developed to assess the phonological inventory of Brazilian Portuguese. Methods: The study included 866 children aged between 3 and 8:11 years, divided into three groups: typical, control and clinical. Participants were evaluated using a phonological assessment software, which prompted the spontaneous naming of a series of images. The children’s responses were audio recorded and transcribed at the time of the assessment, by the software itself. The Cronbach’s alpha coefficient was used to evaluate the internal consistency of the instrument for reliability and validity purposes. Criterion validity was examined by comparing the performance of different groups using Student’s t-test for independent samples. Intraand inter-rater agreement were investigated using Kendall’s tau. Results were considered significant at p ≤ 0.05. Results: The present study provided evidence of validity and reliability (internal consistency) for this phonological assessment tool, confirming the reliability of its items and demonstrating excellent agreement rates between examiners regarding its scoring (intraand inter-rater reliability). The criterion validity assessment demonstrated that the control group outperformed the clinical group across all phonemes, showing that test scores were successful in identifying children with speech sound disorders (phonological disorders). Conclusion: The present findings provide strong evidence of the validity and reliability of this phonological assessment tool.


INTRODUCTION
Speech sound disorders can be caused by various etiologies, and lead to impairments at several levels of speech production, including linguistics/phonology and/or motor planning (1)(2)(3)(4) .Speech pathologists face the constant challenge of correctly diagnosing and distinguishing between these conditions in order to plan an effective intervention.
The present study will discuss the phonological aspects of speech sound disorders, known as phonological deviations, which are among the most prevalent alterations in children (4,5) and a major focus of scientific research (4,6,7) .Phonological deviations are characterized by the linguistic disorganization of the phoneme inventory.Children with these conditions may omit or replace certain phonemes, especially consonants and consonant clusters, at an age where such behaviors should no longer occur (8,9) .Phonological deviations can be evaluated and diagnosed using assessment instruments such as the Goldman Fristoe 2 -Test of Articulation -GFTA 2 (10) and the Clinical Assessment of Articulation and Phonology -CAAP (11) .In Brazil, the instruments most frequently used for this purpose are the Children's Phonological Assessment (Avaliação Fonológica da Criança -AFC (12) ) and the Child Language Test -Phonology (ABFW -Teste de Linguagem Infantil -Fonologia) (13) .Both instruments are elaborate and quite comprehensive, and available for use in both clinical practice and research settings in speech pathology.However, neither has undergone psychometric evaluation in order to examine their validity and reliability in the assessment of Brazilian Portuguese (BP)-speaking children.The importance of assessing the validity and reliability of diagnostic instruments is emphasized in the international literature (14,15) .In fact, such instruments should only be made available for general use after undergoing psychometric evaluation.
The development of any assessment instrument should ensure that it measures what it is intended to measure, and that its results reflect a given ability with no influence from any variables other than those stated in the aims of the instrument (2,10,14,15) .This can be ensured by submitting the instrument to validity and reliability testing.
Validity refers to the precision of any inferences drawn from the scores of a given instrument (2) .As such, the evidence of validity for an assessment instrument determines the extent to which it measures what it is intended to measure (10,16) .This is an especially important feature of diagnostic instruments, since the precision of scores obtained from these measures will contribute directly to diagnostic decision-making.
Like validity, reliability is crucial for test development, as it speaks to the degree to which an instrument is susceptible to different sources of error (e.g.variation between raters or poor intra-rater consistency).These issues can also be a threat to test validity (2,17) .As such, it is important to analyze the evidence of intra-and inter-rater reliability for the items in an instrument.This procedure will show the extent to which the test scores are susceptible to error.
The international literature has shown growing interest in the search for instruments with established psychometric properties in the assessment of speech sound disorders (2,14,15,(17)(18)(19) .Some of these studies (12,13,15) discuss multiple phonological assessment instruments whose validity and reliability has already been examined.Yet this is still a recent development in Brazil, where none of the available phonological assessment instruments have undergone psychometric evaluation (14,20,21) .
This constitutes a limitation in phonological assessment in BP, which compromises the diagnosis and treatment planning for children with speech disorders, especially in the case of inexperienced examiners.Therefore, the development of assessment protocols with robust psychometric properties is crucial for clinical practice and research in speech pathology (14,15,20) .In addition to psychometric studies, information technology may also contribute to phonological assessment in speech pathology.The use of computer software for phonological assessment may be both more appealing to children, and advantageous for practitioners, as it makes for faster and simpler test administration.
In light of these observations and the growing concerns regarding the availability of valid and reliable instruments for clinical and research purposes, this study aimed to evaluate the validity and reliability of the Phonological Assessment Instrument INFONO, developed specifically to evaluate phonemes in BP.

Participants
The sample consisted of children aged between 3 years and 8 years 11 months, attending 12 different schools (8 public and 4 private) in southern Brazil.Of the 1448 children invited, 1076 (73%) were authorized to take part in the study.Participants who were bilingual, had subjective or suspected hearing loss, neurological and/or psychological conditions, intellectual disability, a diagnosis of autism or Down's syndrome, or previous speech therapy were excluded from the sample.This information was obtained from screening questionnaires administered to parents/guardians and teachers.Children with signs of any alterations which could lead to speech impairments (e.g.anterior open bite, lisp, tongue thrust, probable language and/or vocabulary deficits) were also excluded from the study.These alterations were detected during brief informal conversations with each child, where they were asked about their school, their age, what day it was, what they liked to play, what they liked to eat, what they were currently doing in class, if they liked animals, which animals they liked, etc., or during the administration of the INFONO.These criteria led to the exclusion of 210 children from the study.
The final sample therefore comprised n = 866 participants, divided into two groups: a typically developing and a clinical group.The former group was composed of children with typical phonological development (n = 733), who were able to produce all phonemes expected for their age (e.g. a 3-year-old who could not pronounce the sound /r/ would be included in this participant group).The atypical group (n = 133), on the other hand, included children with atypical phonological development, who displayed omissions or substitutions beyond the age at which these phenomena were expected to cease.
Examples include a 3-year old who showed alterations in the production of the phoneme /b/ or a 5-year-old who was unable to pronounce the /r/ sound.This classification was carried out by the examining speech pathologist, based on her clinical experience in assessment and treatment, and on the national literature (22,23) on phonological acquisition.
A third group was then formed by both typically developing and atypical participants, which was referred to as the control group.These individuals were drawn from the typically developing (n = 228) and clinical groups (n = 114), and matched by age group, type of school and gender.This group comprised a total of 342 children (clinical and control groups).The clinical and control groups did not significantly differ in terms of their mean age (F = 0.000; p = 0.977), gender distribution (X 2 = 0.000; p = 1.000) or school type (X 2 = 0.000; p = 1.000).Sample characteristics are described in Chart 1.

Instrument
The INFONO is a software package developed for assessment purposes, to be used by speech pathologists with the help of a computer.The instrument was developed in four stages: literature review/stimulus selection (24) ; analysis by expert panels (24) ; analysis by non-expert users (24) ; and pilot study (25) .These stages were crucial for the development of the INFONO software.The developmental process has already been completed, and the INFONO is now undergoing final psychometric testing before being made available for the general use of speech therapists.The international literature (14,15) has suggested that psychometric studies are crucial for all assessment instruments involved in diagnostic decision-making and should only be made publicly available after these studies are completed.
The INFONO software allows for data collection in the form of audio recordings for later review and transcription.Data was collected via spontaneous naming of test stimuli.After each item, the examiner must select the transcription which corresponds to the child's response from a list of options provided by the software.After the evaluation is complete, these selections can be reviewed and compared to the audio recordings.The list of possible transcriptions was developed based on the phonological processes observed in children with both typical development and phonological deviations.If none of the options corresponds to the child's response, the examiner can enter a new transcription into a keyboard provided by the software, noting any distortions present in the child's speech.
The test stimuli for spontaneous naming consist of 84 colorful animated drawings in the form of gifs (animations formed by merging multiple GIF images into a single file, creating the illusion of movement), representing the target words.The examiner is also given a prompt question which they can ask the child during the assessment to facilitate the identification and production of the target word, such as: "He uses the pencil to...?" (write), "What animal is this?" (dog), etc.Other questions can also be used to facilitate the production of the target-word.It is important that the child produces all target words in the test, since they contain all phonemes of Brazilian Portuguese in all possible word and syllable positions.
After completing the assessment and transcriptions, the examiner clicks on a button to generate the results.The software then provides the following data: analysis of test results (list of target words and phonetic transcriptions of the child's responses), contrast analysis (number of correct answers, omissions and substitutions for every phoneme in Brazilian Portuguese) in order to describe the child's phonetic (presence or absence of different phonemes) and phonological inventory (percentage of errors, omissions and substitutions for each phoneme and onset clusters [OC]); analysis of distinguishing features; analysis of phonological processes; and severity of the phonological disorder.
The instrument does not evaluate vowels, since these are acquired early in development and are much less likely to be affected in Brazilian Portuguese speakers with phonological disorders.

Procedures
This study was conducted according to all relevant ethical guidelines, including approval by the Research Ethics Committee of the Universidade Federal de Santa Maria (UFSM), under protocol number 23081.005433/2011-65.All parents and guardians provided written consent to their children's participation in the study.The children, in turn, were only evaluated after assenting to participate in the assessment.
Firstly, a brief conversation was carried out with each child to detect any impairments which may justify their exclusion from the study.Both this procedure and the administration of the INFONO were conducted by three doctoral students in speech pathology as well as a speech pathologist with over The data provided by the software regarding the phonological inventory of each child was later used to create a database in SPSS where psychometric analyses were performed in order to investigate the validity and reliability of the scores provided by the INFONO, as follows: (a) Validity The precision of INFONO scores was determined based on its internal consistency and criterion validity (i.e.distinction between clinical and control groups).Internal consistency refers to the extent to which the test items reflect the intended purpose of the instrument (14).Criterion validity speaks to the effectiveness of the instrument at describing the performance of a specific group of individuals (19) .In the present study, internal consistency was calculated using scores from the typically developing group (n = 733), while criterion validity was examined based on the control group.

(b) Reliability
Reliability refers to the consistency of test scores, ensuring they do not change when administered by the same examiner at different points in time (intra-rater reliability) or by different raters altogether (inter-rater reliability).These procedures reflect the extent to which the results of an instrument are reliable.Three sources of reliability data were examined in the present study: internal consistency, intra-rater reliability and inter-rater reliability.
Internal consistency was calculated as described in the validity section, since consistency (or stability of measurement) is also crucial for test validity.Intra-rater reliability was evaluated by asking examiners to listen to the audio recordings of their own test sessions and manually transcribe participants' responses (without the use of the software).The results of this procedure were then analyzed independently of the first assessment.This was done for a subgroup of 77 children (approximately 10% of the sample) assessed by the same examiner.These participants were randomly selected at least a month after their original evaluation.
Similarly, between-rater reliability was analyzed using the audio recordings of 120 children (approximately 15% of the sample).These were examined by two final year undergraduate students, both of whom had experience in the area, as they worked as research assistants in the laboratory where this study was conducted.The new transcriptions and analyses were performed independently of the first evaluation.

Data analysis
Data were analyzed using SPSS, version 22, for Windows.Internal consistency, which is relevant to both validity and reliability, was measured using Cronbach's alpha.This method was chosen because participants completed a single instrument on only one occasion.The Cronbach's alpha coefficient describes the correlation between each test item and the remaining items on the instrument, or between the item and the total score on the test.This technique provides a measure of covariance ranging from 0 to 1, where higher scores are indicative of greater reliability.Values greater than 0.7 reflect adequate validity and reliability.
Criterion validity was evaluated by comparing the performance of typically and atypically developing children (control and clinical groups), and by discriminant analysis.Firstly, performance was compared between the clinical and control groups using Student's t-test for independent samples.Then, a stepwise discriminant analysis using Wilks' lambda was conducted to verify the ability of the INFONO to differentiate between typically and atypically developing children.Assumptions of normality and homogeneity of variance-covariance matrices were examined using the Shapiro-Wilk and Box's M tests, respectively.
The reliability of the INFONO was also evaluated using intra-and inter-rater agreement as measured by Kendall's Tau, where values greater than 0.6 suggest adequate agreement, and therefore, satisfactory reliability.Intra-and interrater reliability was calculated by dividing phonemes according to the following positions: General onset (GO); initial onset (IO); medial onset (MO); general coda (GC); medial coda (MC); final coda (FC).Results were considered significant at p ≤ 0.05.

RESULTS
The scores provided by the INFONO were examined for evidence of validity and reliability.The results of the internal consistency analysis are shown in Table 1, which contains the Cronbach's alpha coefficients for each phoneme and age group assessed by the instrument.
These findings showed high internal consistency (>0.7) across all age groups, with values ranging from 0.713 to 0.922 (median = 0.816).These results indicate that the items used to evaluate each phoneme in BP have acceptable validity and reliability within each age group.
Findings pertaining to intra-rater reliability are shown in Table 2, which displays the agreement rates for correct production of phonemes and consonant clusters in the INFONO.
Intra-rater reliability for these scores ranged from 0.622 to 1.0.Most phonemes were associated with excellent agreement rates across different word and syllable positions, with values often greater than 0.8.Agreement was considered "adequate" only for /s/ in GC, MC and FC; /k/ in GO and MO; /m/ in GO and IO; /r/ in MC; plosive+/r/ in GO and IO; and fricative+/r/ in GO, IO and MO.When separated according to syllable and word positions, mean percent agreement within raters was considered excellent, with values ranging from 0.822 to 0.944.
Results of interrater reliability analyses are shown in Table 3, which displays agreement rates for the correct production of phonemes and consonant clusters in Brazilian Portuguese as measured by the INFONO.Inter-rater agreement ranged from 0.663 to 1.0.The lowest agreement rates were observed for /d/ in GO, /k/ in GO and MO, /m/ in GO and IO, and /l/ in GO, which were nevertheless classified as "good."All other values were greater than 0.826, and therefore indicative of excellent interrater agreement.When examined separately for each word and syllable position, mean percentage agreement between raters was excellent, with values of at least 0.93.
Evidence of validity for the INFONO can be found in Table 4, which compares the production of all phonemes in GO, GC and CO (means and standard deviations) between the clinical and control groups.
The results show that mean production accuracy was lower for the clinical group than for control participants across all phonemes examined.The production of individual phonemes differed significantly between the clinical and control groups in several cases, except for /p/, /d/ and /ɲ/ in GO and /n/ in GC; /p/ in IO; /b/, /d/ and /m/ in MO; /n/ in MC; and, /l/ and /s/ in FC.These findings show that INFONO scores can differentiate between children with impaired vs. typical phonological development.
In order to analyze which of these phonemes in simple onset, coda and/or OC were most useful in differentiating between the clinical and control groups, a stepwise discriminant analysis was performed.The procedure produced a discriminant function based on 18 phonemes and OC, which accounted for 100% of the variation between groups (Ʌ = 0.494; X 2 (18) = 602.403;p ≤ 0.001).Table 5 shows the standardized coefficients and phonemes included in the discriminant function.These phonemes made the most important contributions to the differentiation of typically and atypically developing children.
The discriminant function classification rates are shown in Chart 2. In 91.7% of cases, the discriminant function classification agreed with the original assessment, which speaks to the efficacy of the INFONO at differentiating between children with typical and atypical phonological development.

DISCUSSION
The present study achieved its original aims and was able to demonstrate the validity and reliability of the INFONO, which has been confirmed as a useful tool for the phonological assessment of children with suspected of phonological disorders.
Validity and reliability analyses showed that the instrument has adequate internal consistency as demonstrated by its Cronbach's alpha coefficient (median = 0.816).This finding suggests that the items in the INFONO which evaluate phonemes in Brazilian Portuguese have satisfactory reliability.The internal consistency of other international instruments (10,26) used in phonological assessment has also been evaluated using Cronbach's alpha coefficient.The internal consistency of the GFTA-2 (10), for instance, ranged from 0.85 to 0.98 across genders and age groups, and is therefore considered adequate.The Test para Evaluar Procesos de Simplificación Fonológica -TEPROSIF-R (26) and the CAAP (11) , both of which are used for phonological assessment, have also shown high internal consistency (0.90) according to the literature.
A test is considered reliable when all its items or tasks provide similar measures of performance or evaluate the same domain (10) .In the present study, the internal consistency of the INFONO was not examined by age group, gender or school type, due to insufficient variability in participant scores.
The AFC (12) and ABFW-Fonologia (13) evaluate phonology in Brazilian Portuguese and are the most widely used instruments in Brazilian research (7,(27)(28)(29) .However, the validity and reliability of these instruments has not been evaluated.Therefore, the contributions of this study to the development of an assessment software package which has been subjected to rigorous validity and reliability testing is of particular relevance to the advancement of speech pathology and the evaluation of children with suspected phonological impairments.The use of standardized instruments with evidence of both validity and reliability increases the accuracy with which the present or absence of a disorder can be determined.
There are several factors which can compromise the precision of an instrument, and as a result, influence diagnoses (18) and treatment.As such, it is important to evaluate the validity and reliability of assessment instruments, to ensure they measure what they intend to measure, and that their scores reflect performance in the area of interest (2,10,14,15) .For instance, the results of a phonological assessment instrument should allow for the analysis of all phonemes in the target language, across all word and syllable positions, in order to determine the presence or absence of developmental impairments.The validity of an instrument can be affected by the quality and reliability of its scores and by the skills of the examiner.
Reliability can be defined as the consistency of an instrument over time and across changes in test administration or scoring (19) .The scores obtained from a reliable instrument can be generalized from the assessment conditions to a wider range of situations.In the present study, reliability was examined using measures of intra-and interrater agreement.Intra-rater reliability was measured by comparing two distinct transcriptions of an assessment session carried out by the same examiner at two different time points.In the present study, excellent intra-rater agreement was observed for phonemes in different positions, with values ranging from 0.822 to 0.944.These findings suggest that the ratings provided by an examiner at different time points show an agreement of at least 0.822, which speaks to the reproducibility and diagnostic reliability of INFONO scores.
No studies to date have evaluated the intra-rater reliability of phonological assessment instruments.However, this is an extremely important factor for speech assessment instruments (15) , since it ensures the consistency of a measure when administered by the same practitioner across different contexts.In these situations, scores are not expected to change, since instruments should be able to produce consistent measurements regardless of when they are administered.Intra-rater agreement has been examined for an instrument used to evaluate speech apraxia, which yielded values ranging from 0.81 to 0.95 (2) , not unlike those displayed by the INFONO in the present study.
The present study also revealed excellent interrater reliability for phoneme transcriptions, with agreement rates of over 0.93 for all word and syllable positions.In other words, when two examiners were asked to transcribe the same set of responses, they provided similar ratings in 93% of cases.Given the subjective nature of this assessment method, and the role of examiners' skills and experience in ensuring its precision, it was especially important to examine the interrater reliability of the INFONO in order to confirm its reproducibility and diagnostic reliability.The GFTA-2 (10) has also yielded interrater agreement rates ranging from 70 to 100%, with a mean value of 93% or greater, depending on phoneme position.The interrater agreement for the CAAP (11) has been calculated at 99%, suggesting that raters are highly consistent in their scoring of this instrument.
The validity procedures, which included a discriminant analysis between the clinical and control groups, provided strong evidence of criterion validity for the INFONO.As expected, mean scores across all phonemes were lower for the clinical than the control group, which suggests that the INFONO is sensitive to between-group differences in phonological development.Significant differences between the groups were observed for most phonemes and onset clusters, except for /p/, /d/ and /ɲ/ in GO and /n/ in GC; /p/ in IO; /b/ and /m/ in MO; /n/ in MC; and, /l/ and /s/ in FC.It is possible that these phonemes are acquired and stabilized before the age of three, which is younger than the participants included in the present study.Previous studies (22,23) appear to confirm this hypothesis.The order of phoneme acquisition in Brazilian Portuguese tends to be the following: plosives and nasals, followed by fricatives and finally, liquids (23,30) and complex onsets.

Chart 2 .
Results of the discriminant function analysis for the = number of participants All examiners had previous training on how to administer and use the INFONO in a research context.After the initial conversation, participants were administered the INFONO.Data was collected in individual testing sessions conducted in a location provided by the children's schools.The administration of the INFONO takes approximately 20 minutes and was carried out according to recommended standard procedures.Children's responses to the test were recorded and transcribed during the assessment using the software itself.
Caption: N = total number of participants; n = number of participants; M = Mean; SD = Standard deviation 10 years' experience in this area of study.

Table 1 .
Internal consistency of the INFONO for children of different ages Caption: n = number of participants

Table 4 .
Comparison of phoneme production between clinical and control groups Caption: M = Mean; SD = Standard deviation; n = number of participants; F = Ratio; IO = Initial onset; MO = Medial onset; *= T-test could not be conducted since standard deviations were equal to zero