Vocal quality assessment: methodological approach for a perceptive data analysis

Purpose: to present a methodological approach for interpreting perceptual judgments of vocal quality by a group of evaluators who used the script Vocal Profile Analysis Scheme. Methods: cross-sectional study based on 90 speech samples from 25 female teachers with voice disorders and/or laryngeal changes. Prior to the perceptual judgment, we performed three perceptual tasks to select samples which were then presented to five evaluators using the Experiment script MFC 3.2 (software PRAAT). Next, we applied a sequence of tests based on successive approaches to interand intra-evaluator behavior. Data were treated by statistical analysis (Cochran and Selenor tests). Results: with respect to the analysis of the evaluators’ performance, it was possible to define those that presented the best results, in terms of reliability and proximity of analyses, compared to the most experienced evaluator, excluding one. The results of the cluster analysis also allowed designing a voice quality profile of the group of speakers studied. Conclusions: the proposal of a methodological approach allowed defining evaluators whose judgments were based on phonetic knowledge, and drawing a vocal quality profile of the group of samples analyzed.


INTRODUCTION
Perceptual voice assessment is one of the oldest and most widely used procedures for assessing and diagnosing voice disorders [1][2][3] . The efficacy of its results depends heavily on the evaluator's experience 1,[4][5][6] . Although it is considered as a gold standard for vocal evaluation, there is constant mention of possible interferences arising from the subjectivity of evaluators, lack of reliability of judgments, variety of evaluation methods, inconsistencies of instruments and lack of standardization of the terminology used 1,4,[7][8][9][10] .
The search for overcoming such limitations is based on different strategies, such as presentation of anchor stimuli, training and calibration of evaluators, repetition of stimuli, application of scripts to randomize the order of presentation of speech samples (programming applicable to open source software), and a statistical approach of perceptual judgments 6,[11][12][13] . In this perspective, among the evaluation instruments available for clinical use, there are few scripts and scales based on theoretical models such as phonetic theory, as is the case of the Vocal Profile Analysis Scheme (VPAS) 14 .
It is worth mentioning that the use of instruments of Phonetics in Speech-Language Pathology clinics has contributed to a detailed identification of speech structures in cases of speech disorders 15,16 , as well as speech control cues in the process of language acquisition of children with and without hearing impairment (HI) from early ages [17][18][19] . In addition, the use of such instruments may offer possibilities for the characterization of language sonority and linguistic variants 20 .
The VPAS script, and its adaptation to Brazilian Portuguese VPAS-PB 21 , details the occurrence of several vocal quality adjustments in phonatory, articulatory and tension areas, as well as vocal dynamics elements (pitch, loudness, use of pauses, speech rate and respiratory support) from the perspective of phonetic theory. The application of the VPAS script results in the voice quality profile of samples. An example could be a sample whose vocal quality profile is characterized by the combination of closed jaw adjustments (level 1), elevated larynx (level 2) and laryngeal hyperfunction (level 2).
Part of the complexity referred to by clinicians in their initial contact with the VPAS script lies in the theoretical principles behind it. The principles of compatibility and interdependence are related to relationships between vocal quality adjustments: the first deals with actions physiologically compatible or incompatible with each other; the second, in turn, focuses on actions physiologically interdependent. A third principle, susceptibility, refers to the relationship between adjustments and segments (vowels and consonants), that is, how vocal quality adjustments affect segments along the speech chain. In this last principle, a segment (vowel and consonant) may be susceptible to the interference of an adjustment, that is, it reflects the degree of vulnerability of segments in relation to adjustments, especially of segments considered as "key" for the detection of vocal quality events 22 . Thus, when adjustments have characteristics not shared by the segment, the latter becomes more susceptible to the influence of the former.
Another aspect to be considered in the field of perceptual evaluation of vocal quality refers to the demand for adoption of a group of examiners/evaluators to address especially the question of subjectivity in perception tests. This issue, which applies to the universe of studies that use the script VPAS, refers to a demand for the establishment of a vocal profile based on judgments of several evaluators. That is, the final result of the evaluation of each vocal sample should be considered in light of the judgments made by evaluators individually, resulting in the definition of a vocal quality adjustment (and its degree of manifestation), which comprises the voice quality profile from a phonetic point of view.
The scarcity of studies substantiating a method for analyzing perceptual judgments of vocal quality based on phonetic models and statistical procedures, which allows estimating the most similar judgments between evaluators (in pairs) and choosing the evaluator with the greatest reliability regarding an analysis instrument that presents a scale of several dimensions, justifies the interest of this study. In addition, the discussion about principles, procedures and especially possibilities and limitations of the application of perceptual analysis to clinical care routines and research environments stimulates a fertile ground that seeks to promote thinking on the nature and the theoretical basis of evaluation protocols and perceptive descriptions of voice used in a scientific and clinical context, as well as its relation to the vocal history of the speaker.
It should also be noted that, in several studies, the description of perceptual data of vocal quality is a step of the analysis and that such findings will often be compared to acoustic and/or physiological data. Thus, the approach that allows perceptual judgment data to result in information that may be analyzed statistically is a current demand.
This study aims to present a methodological approach for the interpretation of perceptual judgments of vocal quality by a group of evaluators who used the script Vocal Profile Analysis Scheme (VPAS) adapted to Brazilian Portuguese 21 .

METHODS
This research was approved by the Ethics Committee on Research with Human Beings of the Federal University of Paraíba, UFPB, under the protocol no. 298/2008. The corpus of the study comprised samples from a database containing 54 teachers' voices. Teacher were elected to participate in this study because they are voice professionals with the highest incidence of voice disorders, frequently seeking speech and hearing care 3 . The inclusion criteria were: the person had to be a female teacher with voice disorders and laryngeal changes (by perceptual information and otorhinolaryngological diagnosis). The samples must have been recorded using three speech styles. Based on such criteria, 33 teachers were selected.
The audio recordings were made in the teachers' work environment, during intervals between classes. They consisted in the following tasks: semi-spontaneous speech (interview situation), semi-spontaneous speech (lecture simulation) and reading out loud 23 . The choice for different speech styles (tasks) lies in variations already studied in vocal quality adjustments and vocal dynamics 24 .
The reading out loud task comprised reading a passage of standard text 23 . In addition, a semi-spontaneous speech-interview (SSI) was conducted starting with the question "What factors do you think interfere with the voice? Why?", as proposed by the authors 2 . Finally, lecture simulation consisted in a lecture excerpt with a topic chosen by the teacher (without a specific time limit) following the examiner's request.
Speech samples were recorded in a quiet room using a Plantronics GameCom PRO 1 headset microphone at a distance of approximately 15 cm from the right labial commissure, coupled to an HP Pavillion ZE 4920 CEL M330 1.4G notebook. The software used was SoundForge 7.0 set at a sampling frequency of 22,050 Hz, 16 bits, extension ".wav".
All 99 samples of 33 teachers were submitted to three perceptions tasks, performed by different groups of evaluators. From the results of such tasks, we selected a set of samples that became the corpus of this study. This corpus is detailed below. The objectives and methods, as well as the results of each task, are summarized in Figure 1.
After analyzing the results of the three perception tasks, samples of eight teachers were excluded. Thus, the corpus of the perception experiment of vocal quality consisted of 90 speech samples from 25 teachers: 25 loud readings, 25 semi-spontaneous speeches (SSI), 25 semi-spontaneous lecture simulations (SLS) and 15 replications of some samples to approach the reliability of the evaluators' answers, totaling a 20% randomized sample replication of the corpus 25 . The samples were edited in extracts of approximately 20 seconds extracted from the recordings of the 3 speech tasks.
All 90 samples were labeled as statements related to reading out loud (RL), semi-spontaneous-interview (SSI) and semi-spontaneous-lecture simulation (SLS), and analyzed based on a perceptual-auditory point of view (VPAS-PB) using the software PRAAT and the Experiment script MFC 3.2, version 5143 (available at: http://www.fon.hum.uva.nl/praat/). The script Experiment MFC 3.2 was used as a tool to randomize stimuli to be presented to all five evaluators (E1 to E5), who would evaluate vocal quality with a phonetic motivation. In the first screen of the perception experiment, a test instruction was presented. In the other screens, controlled by the evaluator, 90 sound stimuli were presented.
The duration of the experiment corresponded to approximately four hours per evaluator distributed into four sessions on different days, lasting one hour each session. There were intervals (pauses) of five minutes for auditory rest after the presentation of ten samples.
The selection of the group of evaluators was based on expertise in Phonetics and experience in the application of the script VPAS-PB. We decided to select evaluators with different levels of experience and expertise in order to discuss interferences of the variables with the vocal quality evaluation, according to Figure 2.

PT1
Investigate the influence of the nature of speech samples (on RL, SSI and SLS tasks) in the judgments using VPAS-PB.

Selection of 8 speech samples (4 of RL and 4 of SSI) from 4 female teachers, all with vocal complaints;
2 presented otorhinolaryngological diagnosis for laryngeal changes. These samples were edited, randomized and submitted to 3 evaluators experienced in the use of the VPAS-PB script.
The manifestation of vocal quality adjustments and vocal dynamics elements in the script VPAS-PB varied according to speech task. Impacts on research planning: inclusion of the 3 speech styles (RL, SSI and SLS).

PT2
Define the duration of the samples to be used in the experiment.
Selection of 4 speech samples from 2 female teachers with laryngeal changes in 2 speech tasks, which were edited into different durations (20 s, 30 s and 40 s). Ten evaluators were asked to evaluate the vocal quality (script VPAS-PB) of those who participated in this experiment using a specific form.
The duration of samples had no impacts on the perceptual judgments of vocal quality. Impacts on research planning: duration of speech samples set to 20 s.

PT3
Select the recordings with the best audio quality using signal-tonoise ratio (SNR).
Selection of 36 samples: 2 noise levels were added to 24 samples (at the medium and the maximum sound waves). Samples were analyzed by 8 evaluators with varying levels of experience in the use of the script VPAS-PB and by 2 experienced evaluators, considered as reference.
It evidenced a compatibility of answers of evaluators with various degrees of experience, with a slight increase in the number of mistakes proportional to the increase in the noise level of the sample. Impacts on research planning: exclusion of stimuli with a SNR < 2 Caption: RL = reading out loud, SLS = semi-spontaneous lecture simulation, SSI = semi-spontaneous speech-interview, PT1 = Perception task 1, PT2 = Perception task 2, PT3 = Perception task 3.   The initial approach of judgments made by E1 to E5 was based on inter-and intra-evaluator concordance and reliability in the use of the script VPAS-PB and in serial tests, by which scores and a classification were gradually established between evaluators so that the analysis of the most discrepant evaluators was excluded. Then, the tests were reapplied to the by two. The ranking, in turn, is defined as the ranking position of each evaluator according to an average of correct judgments. Since evaluators were ranked from 1 to 5, the relative distances may assume values ranging from 1 to 4. High values indicate a great dissimilarity among evaluators. The relative product is a composite index constructed by multiplying the ranking values of the evaluators compared two by two by the relative distance between them, as defined in the previous paragraph, indicating the quality of both judgments and their relative proximity. The lower the value of this index, the better the composition formed by two evaluators.
The comparison of intra-evaluator judgment data based on congruence values (index of correct judgments) allowed us to estimate the most experienced and congruent evaluator.

RESULTS
With respect to the analysis of the initial behavior of the evaluators (Cochran test, inter-evaluator analysis; and Snedecor test, inter-and intra-evaluator analysis), the E5 was excluded from the next step (reapplication of the test) precisely because it presented the largest intra-and inter-evaluator variance, less time of use and expertise in the script VPAS. This evaluator also did not participate in the training on the use of the script. The test was reapplied until the two smallest variances within the group, which represented the two evaluators with the greatest reliability in terms of judgment, were reached through the VPAS-PB: E2 and E4.
The correct answer indexes and confidence intervals (upper and lower limits), based on intra-evaluator analysis, are presented in Tables 1 and 2. Correct answers were considered according to comprehensiveness of results in perceptive judgments.
The correct answer indexes and confidence intervals (upper and lower limits), based on inter-evaluator remaining group. The Cochran test was used for homogeneity of variances. The Snedecor test was used with 95% confidence levels (in an Excel worksheet). Both tests were used to define intra-and inter-group reliability (including pairs). We noted that all evaluators, but one, presented a significant congruence between them. Thus, the judgments made by the incongruous evaluator were excluded in a blind procedure (i.e., the data analyst did not know the evaluators, having access only to the judgment worksheet). Then, the evaluators were classified according to their intra-judgment reliability scores and their congruence with the other evaluators.
In addition, at the final stage, after analyzing all evaluators, the judgments were compared to the evaluator considered as a reference based on the classification generated by this analysis.
The criteria used for the selection of the reference evaluator were specific training, time of use and expertise on the VPAS script, participation in the workshop on the use of this script, and inter-and intrajudgment congruence and reliability.
Having determined the statistical parameters for the valuation of judgments, the voice quality profile of each sample was designed. The profile was established based on the mean values of data from analyses of evaluators which were congruent with each other, determining a judgment composed of expected values based on univariate statistical analysis (computation of confidence intervals).
The values of distances and relative products of inter-evaluator judgments (in pairs) for perceptual evaluation results allowed estimating the closest judgments among evaluators in pairs.
Relative distance is a measure of relative dissimilarity between evaluators defined by the difference of positions between evaluators, i.e., the difference between the ranking of each evaluator compared two    analysis, are presented in Tables 3 and 4. Correct answers were considered according to comprehensiveness of results in perceptive judgments. Table 5 shows the distances and the relative products of inter-evaluator judgments (in pairs) for the results of perceptual judgments.
The approach of distances and relative products allowed estimating which judgments were closer between the evaluators; the result was E2-E4, E1-E2, E1-E4 and E3-E5. From the comparison of intra-evaluator judgment data, we noted that the E4 presented the highest correct answers ratio in total, followed by the E2 and the pair E1-E3. In turn, the E4 was considered the most experienced and congruent evaluator. It is noteworthy that the group of evaluators did not reveal an absolutely similar behavior at all steps when considering the inter-evaluator approach. However, they were consistent in their judgments in repeated samples in intra-evaluator analysis. Therefore, their contributions to judgments could be considered in the composition of vocal quality profile. In this respect, the results of the answers to the repetition of stimuli Table 5. Distances and relative products of inter-evaluator judgments

Pairs of evaluators
Relative distances Related products E1-E2 revealed a homogeneity among the four evaluators (E1 to E4), and the segregation of the E5 (Figure 3). In view of the data presented, the evaluators were classified according to their experience in the VPAS-PB script in the following descending order: E4, E2, E1, E3 and E5. The profile of vocal quality and elements of vocal dynamics traced for this study contemplated a set of analyses of four evaluators (from the inter-and intraevaluator approaches), which reached a level of distribution of variances and confidence intervals of similar judgments made between them. The vocal quality profile of the studied group was characterized by decreasing order of occurrence: laryngeal hyperfunction, rough voice, elevated larynx, vocal tract hyperfunction, closed % of correct answers of evaluators in relation to a mean value Interval size (%) mandible, pharyngeal constriction, raised tongue body and breathiness. As for vocal dynamics aspects, in descending order, the following stood out: inadequate respiratory support, decreased variability of pitch, usual high pitch, high habitual loudness, fast elocution rate and increased loudness variability.

DISCUSSION
In speech-language practice, the perceptual evaluation of vocal quality is considered the gold standard 26 . Although some researchers classify it as a subjective, inconstant method, with a great terminological variability, we emphasize that perceptual evaluation depends on the expertise and the experience of the evaluator, as well as on its attention throughout the procedure [27][28][29] .
There are few studies presenting a methodological approach for the analysis of perceptual judgments of vocal quality based on a phonetic model 14 , as well as on statistical treatment procedures for the consideration of judgments of several evaluators together. This study aimed to present a methodological approach to develop an experiment of perceptive evaluation of vocal quality with samples of teachers with voice disorders and/or laryngeal changes using the script Vocal Profile Analysis Scheme (VPAS-PB), and also aimed to evaluate the performance of a group of evaluators. The choice to compose a group of evaluators with a varied experience in terms of time of exposure to the script, history of training and participation in training prior to the application of the perception task of vocal quality had the objective of discussing precisely the evaluator training demands for the method under analysis.
The issue of the experience of evaluators has been widely debated in the literature, especially as for a possible subjectivity on analyses 30 . In this study, it was possible to redeem the time of training and the performance of evaluators according to the classification established in terms of their experience by statistical analysis data of the results of the evaluators' judgments by using the script VPAS-PB. The degree of experience was related to the time spent using the VPAS-PB in a descending order: thirteen years (E4), three years (E2), one year and six months (E1 and E3), and six months (E5).
In the sample of five evaluators, it is possible to identify important factors in the definition of the evaluators' experience: time spent with the instrument VPAS-PB, specific training in a phonetic approach to vocal quality, participation in the workshop on the use of this script, and inter-and intra-judgment congruence and reliability. The statistical procedures adopted, being the subjective characteristics of the subjects (evaluators) unknown when applied, allowed establishing a scale of experience of evaluators congruent with the aspects of training and time of activity in phonetic evaluation of vocal quality (VPAS-PB), as well as the participation of the evaluators in the training on the VPAS-PB. It is worth mentioning that the application of the script to evaluators was made after the training (workshop), aiming to reach a level of calibration using anchor-stimuli, a procedure also defended by the authors, which explore the complexity of experiments of perception of vocal quality 12 . Thus, the Evaluator 5 was excluded since it had the shortest time of training and use of the VPAS-PB script, did not participate in the workshop and presented the greatest variance in intra-and inter-judgment analysis. It was therefore considered an incongruent evaluator.
The choice for the interpretation method of findings of the initial group of five evaluators was also challenging, especially as regards the complexity of considering the uniqueness of individual analyses in search for a "consensus", or one analysis that reflected the opinion of the group. In order to discuss the specificities and, in particular, the complexity of perceptual analyses by groups of evaluators, we decided not to work with consensus analyses or reliability assessments of the answers of evaluators which could lead to a choice for one of the evaluators. In view of the demand for a discussion on the advantages and disadvantages of adopting a phonetic model for the description of vocal quality 14 , we decided to study in a more detailed way the set of perceptual judgments of the five initial evaluators until it was possible to define a set of tests which allowed the definition of a vocal quality profile of a group of voiced samples.
After the global analysis of judgments and the analysis of the general behavior of evaluators, it was possible to develop a sequence of tests which resulted in the choice of the evaluators whose judgments were based on the principles of the phonetic model.
It is worth emphasizing that this evaluation was not intended to qualify evaluators in terms of their perceptual skills, but to qualify and estimate their performance in terms of the proposed task, considering their consistency of answers for the same stimuli at different moments of the analysis, which characterizes an intraevaluator analysis.
Another important point is that the four evaluators whose analyses comprised the average profile of vocal quality judgments of the set of samples studied did not have a similar behavior when analyzed using an inter-evaluator approach. However, they were consistent as for their judgments for repeated samples. Thus, although the group is not absolutely homogeneous, their judgments are consistent at different moments. Such findings were similar to those found by another study involving a group of students of Speech-Language Therapy. There was a concordance of intra-evaluator answers in relation to the analyses of evaluators (speech-language therapists with expertise on voice) 6 .
The information from intra-evaluator analyses was also interesting since it provided a comparison of a group of evaluators with the judgments of an evaluator with more experience with the instrument, who could be considered as a reference evaluator, a procedure used at other stages of studies based on VPAS-PB judgments of two evaluators experienced in this script 11 . In one of such studies, the authors 11 reported relevant results arising from the training of a group of 16 evaluators (14 first-year students in a voice specialization course; the two other evaluators were Speech-Language Therapy and Linguistics professors with experience in VPAS) upon investigating the validity and the consensus in the use of VPAS among examiners. One of the highlights was the lack of consensus among the participating evaluators regarding the group of phonatory adjustments and their possible combinations, which, according to the authors 11 , revealed a lack of systematization of auditory-based methods of vocal evaluation and familiarity with the mentioned model. These data were compatible with this study because, by the inter-evaluator analysis, there were some discrepancies in judgments, which reinforce that the extension of the training period or, more precisely, the constant updating and the continuous work with the evaluators becomes essential to create a cohesively qualified group to conduct phonetic analyses of vocal quality.
The care in the procedures of this study for the perceptive analysis of vocal quality, regarding the training and experience in the script VPAS of the group of evaluators and the comparison between the judgments of this group and the judgments of an evaluator with more experience, sheds to light a complexity inherent to studies focusing on the answers of evaluators according to several modes of perception. As for the perception of vocal quality, we highlight the criticism of the way by which the statistical analysis was conducted in many studies, in which some correlations may be effects of test artifacts and specificities of samples 12 . In this study, we considered the several steps of the study of the evaluators' behavior and of successive approaches in intrinsic terms (intra-evaluator approach: consistency between task repetitions) and extrinsic terms (intra-evaluator approach: in relation to other evaluators, or more precisely, to each evaluator). The new proposal of statistical approach presented in this study may be a contribution to the continuity of the exploration of studies of auditory perception of vocal quality 31 .
We emphasized that the analysis of an agreement inter-evaluator and intra-evaluator is a fundamental factor to provide reliability to the perceptual evaluation of voice 32 . Such an agreement may increase according to the experience and training in analyses of vocal changes, and is influenced by factors such as fatigue, attention lapses and misunderstandings during the evaluation 27,33,34 , in addition to the very conception and structuring of the perception experiment.
At this point, we may state that the data collected reinforce that time of training in the method is fundamental. Inter-evaluator data, in which there were some discrepancies, reinforce that the lengthening of the training period or, more precisely, the constant updating and the continuous work with the evaluators becomes essential in order to create a cohesively qualified group to conduct vocal quality analyses, which, although it may be considered subjective, that is, without an objective and extrinsic standard, may be replicated by training. Another point to be taken for the design of future studies in this subject refers to a higher number of the group of evaluators.
The findings reinforce the multidimensional character of vocal quality and the complexity involved in the perceptive judgments of this phenomenon, as well as the demand for training and perception experiments to select evaluators.
The issue of voice multidimensionality lies in the fact that vocal quality emerges from a combination of actions, so that it is not possible to analyze vocal quality based on only one parameter. The VPAS script is presented as an alternative to address phonatory, muscular tension and supralaryngeal activity aspects. To do so, it requires, as in other modalities of analysis scripts, training and familiarity in the use of the instrument. Such a situation makes researchers adopt the mentioned script in order to adopt several steps of composition and selection of their examiners 12,30,[35][36][37] . With a proper phonetic training, the evaluator becomes able to evaluate the prominent sound quality in the speech of an individual. This was the path taken, by which the adoption of references from Phonological Sciences provided conditions for detailing events related to vocal quality in a group of teachers with voice disorders.

CONCLUSION
Based on the perceptual analysis of the four congruent evaluators, the mean vocal quality profile of the group (female teachers of the public education network with voice disorders and/or laryngeal changes), was studied. The most frequent adjustments in this group, in a descending order, were adjustments to the laryngeal hyperfunction, rough voice, elevated larynx, vocal tract hyperfunction, closed mandible, pharyngeal constriction, raised tongue body and air leak.
As for vocal dynamics, in a descending order, the following aspects were seen: inadequate respiratory support, decreased variability of pitch, usual high pitch, high habitual loudness, fast elocution rate and decreased loudness variability.
The proposal of a methodological approach to evaluate the performance of a group of evaluators for voice quality assessment was adequate, since the proposed set of tests allowed defining evaluators whose judgments were based on phonetic principles, as well as designing the mean vocal quality profile of a group of voiced samples.