Acoustic Voice Quality Index-AVQI for brazilian portuguese speakers : analysis of different speech material Acoustic Voice Quality Index-AVQI para o português brasileiro : análise de diferentes materiais de fala

Accepted: August 15, 2018 Study conducted at Universidade Federal de São Paulo – Unifesp in partnership with the Centro de Estudos da Voz – CEV, São Paulo (SP), Brasil; University of Antwerp as a pilot study of a doctoral thesis, Antwerp, Belgium. 1 Universidade Federal de São Paulo – Unifesp São Paulo (SP), Brasil. 2 Centro de Estudos da Voz – CEV São Paulo (SP), Brasil. 3 University of Antwerp, Antwerp, Belgium. Financial support: nothing to declare. Conflict of interests: nothing to declare. ABSTRACT


INTRODUCTION
The voice quality evaluation is performed by several professionals and it is essential in the clinical routine of voice patients.This evaluation uses different protocols (1) that need to be applied by trained professionals with enough know how (2) .The protocols include several assessments such as: perceptual-auditory judgment, acoustics, laryngeal imaging, aerodynamics, and the self-evaluation.These different assessments provide subjective or objective measures; the perceptual-auditory judgment and acoustics are two commonly procedures (3,4) .
The perceptual-auditory judgment of the voice quality is mostly used in voice clinics (5) .It is considered to be subjective, since it suffers great influence from the experience and the listeners internal standards, which are formed throughout his life according to his judgment experiences.Aiming to reduce the subjectivity of this rating it is usually complemented by patient self-assessment and acoustic measures (4,6,7) .
The acoustic measures provide more objective data of several voice parameters such as: fundamental frequency, cepstral peak prominence, jitter, shimmer, harmonic-to-noise ratio, among others; however, it is more often than not one single acoustic measure (3) .Voice is an acoustic phenomenon that must be evaluated in a multidimensional matter.Thus, the extraction of single parameters seems to be insufficient to characterize the voice quality.Therefore, the interest in multiparametric acoustic models of overall voice quality is raising; two examples are the Cepstral Spectral Index of Dysphonia (CSID) and the Acoustic Vocal Quality Index (AVQI) (4,8,9) .
Both indices, the CSID and AVQI, consider several acoustic parameters to provide one single score for voice quality.For its analysis, the indices consider two vocal tasks: a sustained vowel (sv)-traditionally used in the acoustic analysis of the voice -and a continuous speech (cs) part, that provides more information about the real vocal use, although less common in acoustic voice quality analyses (10) .
The CSID evaluates cs and sv separately, thus, the software generates two scores for voice quality ranging from 0 to 100 points.This index runs on the commercial KayPENTAX (9) program: Analysis of Dysphonia in Speech and Voice.
On the other hand, the AVQI, runs on the freeware Praat using an individual designed script that quantifies the vocal deviation considering concatenated voice samples of cs and sv (11) .The AVQI Praat-script generates one single score ranging from 0 to 10 points by combining six acoustic measurements (i.e., the smoothed cepstral peak prominence, harmonics-to-noise ratio, shimmer percent, shimmer dB, general slope of the spectrum, and tilt of the regression line through the spectrum) (11) .
The AVQI was originally developed in Dutch but has already been validated in other languages (German, English, French, Finnish, Korean and Lithuanian (12)(13)(14)(15) ).The index presents strong correlation with the auditory-perceptual judgment (APJ), ranging from 0.794 to 0.929, and also a consistent diagnostic accuracy.It is worth mentioning that the AVQI validations uses the reading of a phonetically balanced text, such as "The Rainbow Passage" (16) or its equivalents, as the connected speech sample.
Although many voice clinics and researchers from various countries use a phonetically balanced text to analyze cs, traditionally, the Brazilian voice evaluation uses automatic speech (e.g., months of the year, counting numbers), repetition of sentences and/or a spontaneous speech sample for its analysis.There is no standardized and phonetically balanced text for such evaluation in the Brazilian Portuguese language.In addition, the general Brazilian population lacks of fluent reading due to high iliteracy rates (8%) and low schooling (52%), according to data from the Pesquisa Nacional por Amostra de Domicílios (PNAD) of 2015 (17) .
In order to complement teaching, research and vocal evaluation in the Brazilian Speech-Language Pathology Clinic, the validation of an objective and robust measure is essential.Thus, aiming at the future validation of this index, the objective of the present research was to verify the best speech material for the AVQI for the Brazilian Portuguese language and to identify which stimuli best correlates with the APJ and the AVQI score, and which stimuli has the best diagnostic accuracy.

METHODS
This study was approved by the Committee for Ethics in Research under the protocol number 2.106.335,June 2017.All participants agreed to participate in the study and signed an informed consent term.
Voice samples of 50 individuals (mean age 40.3 years old; standard deviation: 16.99) were recorded.The participants were divided into 38 dysphonic (5 men and 33 women) and 12 vocally healthy (4 men and 8 women).The vocally healthy individuals had no vocal complaint and a VHI-10 score below 5 points.The 38 dysphonic patients presented various medical dysphonia diagnoses.The data recording was performed in several different speech language pathology services and vocal clinics; thus, the diagnosis considered the last medical and health report containing the patient's clinical history.Subsequently, the authors classified the diagnosis according to the Behlau et al. (18) classification system.Hence, the dysphonia diagnosis of the voice-disordered patients was as following: 20 patients with functional dysphonia, 14 patients with organic-functional dysphonia and 4 patients with organic dysphonia.The individuals had various background and different professions.This variable was not controlled in the present study.

Voice sample
The individuals were instructed to speak aloud, the months of the year, from January to December, count the numbers from 1 to 20, repeat the six sentences of the CAPE-V protocol (19) and to sustain the vowel /a/ at comfortable pitch and loudness.
All recordings were performed at a soundproof booth using an AKG C420 head-mounted condenser microphone, digitized at a rate of 44kHz and 16 bits of resolution with the 174 AKG MPA V L + the Focusrite iTrack Solo using the Audacity program version 2.0.6.The same program was used to edit the vowel /a/ in order to achieve 3 seconds without voice onset and offset to avoid instabilities of the raise and decay moments (20) .
The cs samples of all individuals were edited on the Praat program in order to obtain different durations.This process occurred as following: 1. D1 (Duration 1) -Total continuous speech material Months of the year, January to December (32 syllables), counting number 1 to 20 (42 syllables) and all the CAPE-V sentences (60 syllables).

D2 (Duration 2) -Customized continuous speech material
The customized duration was performed so that the voiced segments of the continuous speech had three 3 seconds, as the sustained vowel.The purpose of this customization was that the voiced segments of the continuous speech had the same duration of the sustained vowel (11) .
In the following paragraph, the steps are described to perform this customization: 2a: Extraction of all voiceless segments using the extraction Praat-script from Maryn et al. (8) ; the first 3 seconds of the voice sample was analyzed; 2b: The original audio file was hand-marked so that the audio voice sample file would have 3 seconds of voice segments as determined on step 2a.This hand-marked cut-off point was determined using the spectrogram, pitch contour and auditory feedback.The duration of each sample was verified by running the AVQI script on the edited voice sample and the 3 seconds vowel /a/; this script extracts the voiceless segments of the speech and links it to the vowel sample.A tolerance margin of 0.1 seconds, below or above 3 seconds for the continuous speech, was accepted.
Table 1 presents the average numbers of syllables for each continuous speech voice sample after the customization.
3. D3 (Duration 3) -Pre-defined cut-off point for the continuous speech material 20 syllables for months of the year (january to august), 15 syllables for counting number (1 to 10) and 32 syllables for the CAPE-V sentences (3 sentences: "Érica tomou suco de pera e amora; Sônia sabe sambar sozinha; Olha lá o avião azul", equivalent to the English sentences: The blue spot is on the key again; How hard did he hit him; We were away a year ago).
Table 2 presents the continuous speech duration considering the voiceless and voiced segments and the voiced segments.

Auditory-perceptual judgment
Three Brazilian e speech-language pathologist who are voice experts with a mean of 8.67 years of clinical experience (minimum of 6 and maximum of 10 years) rated the voice samples.They evaluated the overall voice quality for each one of the different speech material (months of the year, counting numbers and the CAPE-V sentences) considering the total continuous speech (D1).
The final voice samples for APJ contained the concatenation of particular cs parts and three seconds of sv.Thus, there was one audio file with the continuous speech, 1 second of silence, and the vowel.A total of 3 contexts were obtained for the APJ: 1 st , month of the year + 1 second of silence + vowel /a/; 2 nd , numbers + 1 second of silence + vowel /a/ and 3 rd , CAPE-V sentences + 1 second of silence + vowel /a/.Therefore, 3 hearing sessions were performed to conclude the APJ, in which one session per context were achieved.To minimize memory effects, at least 1-hour break was taken between each session.
The raters used the G from the GRBAS scale (21) to perform their analysis for each context; G represents the degree of hoarseness.A 4-point Likert scale was used, in which: 0 = clear voice/no hoarseness; 1 = slightly hoarse; 2 = moderately hoarse; 3 = severely hoarse.The judgments took place in a quiet environment and the raters used headphones.To analyze the intra-rater reliability, 10% of the sample were repeated.
Furthermore, the listeners were blinded regarding the identity and diagnosis of the voice samples.Additionally, anchor voices were presented to the raters before they began the analysis.Thus, a better reliability was expected (22) .The anchors were representative of G = 0, G = 1, G = 2 and G = 3 and specific for each context.
The intra and inter-rater reliability were assessed by the Cohen's Kappa coefficient (Ck) and the Fleiss Kappa coefficient (Fk), respectively.Acceptable values were observed for the intra-rater reliability for the different contexts: months of the year (Ck = 0.667 to 1.000); numbers (Ck = 1.000 for all evaluators) and CAPE-V sentences (Ck = 0.688 to 1.000).The inter-rater reliability also presented acceptable values: 0.5038 for months of the year, 0.5788 for numbers and 0.6386 for the CAPE-V sentences.All raters were reliable, thus, the APJ related to the AVQI considered the G mean score of the 3 listeners.Furthermore, the analysis of AVQI's accuracy, a cut-off point of G<0.5 and G<0.68 were used as presence or absence of dysphonia.The G<0.68 was also used as cut-off point in order to include the analysis of the counting numbers.

AVQI analysis
The AVQI 03.01 version were used for acoustic analysis, it considers six acoustic parameters according to the formula (11) : To run the script by Barsties and Maryn (11) , each continuous speech voice sample and its durations, D1, D2 and D3, were opened on the Praat Program (version 6.0.6)simultaneously with the audio file of the 3 seconds vowel /a/ for each individual.
There was a total of 9 cs parts with several durations for the AVQI analysis: months of the year + vowel /a/; numbers + vowel /a/; CAPE-V sentences + vowel /a/, for each one of the three durations.
For each one of the combinations of the cs and sv, a report with the AVQI score was generated.Hence, each individual had 9 different reports and therefore, 9 different AVQI scores; being 3 scores for each context and each duration (D1, D2 and D3).Thus, there was a total of 450 scores for the 50 recorded individuals.

Statistical analysis
The software SPSS (version 23.0) and the MS-Excel (version MS-Office 2013) were used for the statistical analysis.The level of significance was set at 0.05 (5%).
Spearman rank-order correlation was used to assess which context presented the best concurrent validity, thus, the higher correlation with the APJ considering the overall vocal quality results of the G mean and the AVQI score.The area under the receiver operating characteristic curve (AROC) was used to determine which context had the best diagnostic accuracy; the thresholds used were G<0.5 and G<0.68 as presence or absence of vocal deviation.

RESULTS
The context with best correlation between the APJ and the AVQI was counting numbers 1 to 10 (Table 3).
The contexts with stronger correlation was counting numbers and the CAPE-V sentences.Considering these both contexts, the durations of the customized speech material (D2) and the pre-defined cut-off point were different, however, the AVQI scores in both durations, were considered to be the same (Table 4).
As to the diagnostic accuracy, considering all contexts, the AVQI presented high specificity (values ranging from 0.625 to 1.000) and low sensitivity (values ranging from 0.413 to 0.881).For G<0.5, the D3 CAPE-V sentences presented the best sensitivity and AROC (0.72 e 0.578).On the other hand, this context presented the lowest AVQI threshold (1.1).No G<0.5 was found for numbers, thus, the diagnostic accuracy was analyzed using G<0.68.Numbers 1 to 10 presented a good diagnostic accuracy, the higher sensitivity was for Numbers 1 to 20, both durations with low threshold values (Table 5).

DISCUSSION
The AVQI is an index that based on a multiparametric acoustic construct, it considers many acoustic parameters to generate one single score for overall voice quality (4,8) .Hence, it has higher ecological validity.
The AVQI uses an individual designed Praat-script that considers both cs and sv to perform its acoustic analysis and generate one single score for the vocal quality (11) .This index was validated in several different languages; because it uses continuous speech, linguistic differences may influence the AVQI accuracy.The validations found a good concurrent validity, ranging from 0.794 to 0.929, and also a good diagnostic accuracy (12)(13)(14)(15) .The present study was the first which performed AVQI in the Brazilian Portuguese language.
Usually, the AVQI uses an outline of the reading of a text for the analysis of the continuous speech part.The Brazilian voice clinic does not have a standardized text for vocal evaluation.Also, the general Brazilian population lacks of fluent reading, what could be an important limitation for the AVQI use in Brazil.Thus, in order to validate the index for the Brazilian Portuguese language, first, the continuous speech sample must be defined.The present study selected three types of continuous speech that are commonly used in the voice evaluation for possible use on the AVQI validation process.The cs voice samples were: month of the year, counting number 1 to 20 and the repetition of the CAPE-V sentences.
The intra and inter-rater reliability for all contexts were acceptable; studies in other languages considered values of at least 0.53 for intra and values of at least 0.37 for inter-rater reliability as acceptable to correlate with the AVQI (4,8,14,15) .
Numbers presented the stronger correlation between the APJ and the AVQI score (Table 3), also, counting numbers is the most commonly used in the Brazilian voice evaluation (23)(24)(25)(26) .Thus, the Brazilian evaluators listen more to number while evaluating the voice, therefore, they have more training with this stimulus, so, the evaluation becomes more precise and has greater chances of best correlating with the AVQI score, as an objective analysis.On the other hand, the CAPE-V is a standardized protocol, however, it is beginning to be more widespread and many Brazilian professionals still do not use it in their daily clinic routine.
In general, the AVQI uses a text for the cs part that enables the analysis of fluency and intonation, parameters that are considered to determine the index final score.Therefore, the best speech material for the AVQI is theoretically the spontaneous speech.To standardize the spontaneous speech in order to compare the scores among different individuals and in different evaluations moments of the same individual is quite challenging.On the other hand, counting numbers is quite habitual for most people and usually expressed in an automatic and natural way; what may explain its better outcome.The CAPE-V sentences, as well as being knew for the individual, has no semantic meaning and are out of context, which may reflect on hesitation by the speaker while repeating them, making the speech sound less natural.
The best precision and concurrent validity for the AVQI 03.01 is when the voiced segments of the continuous speech reach 3 seconds, like the length of the vowel (27) .The best concurrent validity for counting numbers was for D3, 1 to 10 (Table 3) with a mean duration was 2.85 seconds; D2 had a longer duration with 3 seconds (Table 4).Numbers in D3 (i.e. 1 to 10) presented durations values closer to 3 seconds (Table 2), thus, it seems that counting 1 to 10 (15 syllables) is more related to the AVQI best precision and concurrent validity.Moreover, the AVQI scores for both durations were the same (Table 4).Differences in the continuous speech duration were also observed for the CAPE-V sentences, similarly, this difference did not reflect at a different AVQI score (Table 4).
The AVQI validation in Japanese considered 30 syllables as the most appropriate, since this length presented outcomes more similar to the customized length with 3 seconds of voiced segments (14) .For the Brazilian Portuguese language around 15 syllables might be more proper for the AVQI analysis.
The AVQI is an index with high specificity (Table 5), it correctly identifies vocally healthy individuals thus, it is quite unlikely that someone with dysphonia is identified as vocally healthy.On the other hand, the AVQI presented limitations regarding its sensitivity, therefore, many individuals with vocal deviation could be identified as vocally healthy.
According to other studies about AVQI in different languages (8,(13)(14)(15)28,29) , the present study showed lower sensitivity regarding the area under the ROC curve. Considring G<0.5, the best sensitivity was for the D3 CAPE-V sentences.However, it also presented the lowest AVQI threshold value (Table 5).
No G score below 0.5 was obtained for numbers, therefore, to analyze this speech material, the cut-off point of G<0.68 was considered.It is noteworthy that this analysis was not observed in other publications with the AVQI.Nevertheless, the present study was a pilot study and numbers presented the best perceptual-acoustic correlation.For this reason, the analysis of the ROC curve was performed considering also the G<0.68 cut-off point.By means of this, the CAPE-V sentences (found to be the context with higher accuracy considering G<0.5) could be compared with numbers (found to be the context with higher perceptual-acoustic correlation) regarding to the ROC curve analysis.
The best diagnostic accuracy was found for numbers 1 to 10 for G<0.68; in addition, higher sensitivity values were found with extremely low thresholds (Table 5).These values are lower than the findings in other languages, Japanese = 0.915 (14) and Belgium = 0.923 (11) .However, the Brazilian outcomes with numbers might be more favorable for AVQI in the Brazilian Portuguese language.
The preliminary data of the Brazilian Portuguese language showed moderate validity results which were the lowest in comparison with other studies and languages (4,8,14,15) .Also, it shows that the AVQI for the Brazilian Portuguese does not seem to be useful as a screening or diagnostic tool.It is worth mentioning that the aim of this study was not to validate the AVQI, but rather to define which speech stimulus should be used in its validation, whose analyzes must deepen such aspects.However, a future study is essential to validate AVQI in the Brazilian-Portuguese language to assess the validity of AVQI for potential recommendation as a screening or diagnostic tool for voice clinics and research in Brazil as well.
It is suggested that the AVQI validation, and further analyzes with this index uses, in the Brazilian Portuguese, the continuous speech sample of counting numbers, since numbers presented the best perceptual-acoustic correlation and higher accuracy values.In addition, counting numbers presented an average value very similar to the one of the best AVQI accuracy and internal consistency and it is an easy cut-off point for editing.Also, counting numbers is a common adopted speech task for voice analysis in Brazilian studies (23)(24)(25)(26) , thus, its use will enable more retrospective research using the AVQI, when validated.

CONCLUSION
The stimulus that best correlated between the APJ and the AVQI is numbers 1 to 10.The CAPE-V sentences presented the best diagnostic accuracy considering G<0.5; numbers presented the best diagnostic accuracy considering G<0.68.Numbers is usual in the Brazilian clinic routine and it was the speech sample with values closer to those of the best AVQI precision and concurrent validity; thus, number is suggested as cs part for AVQI in the Brazilian Portuguese language.

Table 1 .
Mean, median, minimum, maximum and standard-deviation for each continuous speech voice sample after the customization process

Table 2 .
Continuous speech duration for each voice sample considering voiceless and voiced segments and voiced segments Caption: D1 = Total continuous speech material; D2 = Customized continuous speech material; D3 = Pre-defined cut-off point speech material; Voiceless and voice = Voiced and voiceless segments; Voiced = Only voiced segments

Table 4 .
Comparison of the AVQI scores and duration of voiced-only segments between D2 and D3 for numbers and CAPE-V sentences