Investigation of the neural discrimination of acoustic characteristics of speech sounds in normal-hearing individuals through Frequency-following Response (FFR).

PURPOSE
To evaluate how the auditory pathways encode and discriminate the plosive syllables [ga], [da] and [ba] using the auditory evoked Frequency-following Response (FFR) in children with typical development.


METHODS
Twenty children aged 6-12 years were evaluated using the FFR for the [ga], [da] and [ba] stimuli. The stimuli were composed of six formants and were differentiated in the F2 to F3 transition (transient portion). The other formants were identical in the three syllables (sustained portion). The latencies of the 16 waves of the transient portion (<70ms) and of the 21 waves of the sustained portion (90-160ms) of the stimuli were analyzed in the neural responses obtained for each of the syllables.


RESULTS
The transient portion latencies were different in the three syllables, indicating a distinction in the acoustic characteristics of these syllables through their neural representations. In addition, the transient portion latencies progressively increased in the following order: [ga] <[da] <[ba], whereas no significant differences were observed in the sustained portion.


CONCLUSION
The FFR proved to be an efficient tool to investigate the subcortical acoustic differences in speech sounds, since it demonstrated different electrophysiological responses for the three evoked syllables. Changes in latency were observed in the transient portion (consonants) but not in the sustained portion (vowels) for the three stimuli. These results indicate the neural ability to distinguish between acoustic characteristics of the [ga], [da] and [ba] stimuli.


INTRODUCTION
In speech, the hearing perception of vowels can be determined by a small number of frequencies of the first formants, which reflect the resonance properties of the vocal tract (1) . Plosive or stop consonants are produced through a temporary obstruction of airflow through the vocal tract in three different phases: total obstruction of the oral cavity, pressure build-up while the oral cavity remains blocked, and sudden release of the air current causing noise, which is also called burst. The acoustic register corresponding to the airflow release refers to the source of transient noise (2,3) .
Plosives provide rich acoustic cues that serve as a basis to identify the place of articulation and voicing, such as formant transition, burst spectrum, presence or absence of aspiration, and duration of Voice Onset Time (VOT), which corresponds to the time of voicing start or attack (4) .
Studies conducted with animal models have shown that perception of this acoustic information is encoded through many levels of the auditory system and with different neural events. Both peripheral and central structures, such as the auditory nerve and fibers of the cochlear nuclei, are able to synchronize the phases (phase-locking activity) for the harmonics (integer multiples of the fundamental frequency) of a speech stimulus (5,6) . In addition, these structures as well as the rostral part of the lower colliculus also show increased activity (discharge rate) for VOT (7) .
In humans, neural synchrony in response to acoustic characteristics of speech has been measured using the Frequencyfollowing Response (FFR), an auditory evoked potential also known as Brainstem Auditory Evoked Potential with complex or speech stimuli (BAEPc or BAEPs). This terminology has been changed since mid-2015 (8) not to limit the concepts involved by this potential, such as the nature integrated (top-down and bottom-up) and related to enriching experiences and stimuli (9) .
The FFR reflects a neural response composed of several different types of cells, mainly neural cells, in the rostral portion of the brainstem.
The brainstem responds with a high level of neural synchrony and is exceptionally well tuned to the spectral and temporal characteristics of sound, including speech sounds. However, the mechanisms involved in the neural encoding accuracy of many acoustic cues in speech remain speculative.
A large number of studies have investigated how auditory brainstem potentials respond to the speech sound [da] (10) . For this research, a structure that suggests that different neural mechanisms are responsible for encoding different acoustic aspects of speech sounds was proposed (11) . Speech sounds consist of three fundamental components: pitch (a source characteristic conveyed by the fundamental frequency); formants (filter characteristics conveyed by the selective enhancement and attenuation of harmonics), and the timing of major acoustic aspects. All of these aspects are important for speech perception and, although they are simultaneously present in the speech signal and its responses, specific components of the brainstem respond separately to each of these components (4) .
In a mature auditory system, the basal region of the cochlea is more responsive to high frequencies, while its apical region responds better to low frequencies. This tonotopic organization is preserved throughout the auditory pathway to the cortex, and it is believed that it can assist with preserving the spectral relationship in the neural activity pattern (12,13) .
Studies have shown that perception of differences between phonemes using the cortical auditory evoked potentials (for instance, [da], [ga] and [ba]) is related to the frequencies contained in the formants of the stimuli used (14)(15)(16)(17)(18) .
Formant transitions are one of the essential cues underlying the identification of plosive consonants (19) . Thus, an interesting way to study how neural encoding of this transition occurs in the auditory system would be to assess the stimuli that differ only in the characteristics of filter (or harmonics), as in the case of the [da], [ga] and [ba] syllables. A primary difference between these syllables is the transition of frequencies from the second to third formants (F2 and F3).
Since F2 and F3 are beyond the brainstem phase-locking capacity, differences between these spectral cues can be observed through the latencies of neural responses.
Based on the tonotopic organization of the auditory system, low-frequency sounds, located in the apical portion of the cochlea, generate responses milliseconds later compared with those generated by high-frequency sounds encoded in the basal portion of the cochlea. Thus, the response time to high-frequency stimuli could have lower latency responses than those to low-frequency stimuli. This progression of latency as a function of frequency has been demonstrated in auditory brainstem responses to pure tone stimuli (20) .
Thus, the investigation of neural encoding for the distinctive features of the [da], [ga] and [ba] syllables, which occurs in the transition from the F2 formant, through the FFR can assist with assessing the neural encoding for the formants and understanding the processes that underlie the neural differentiation of acoustic contrasts of different speech stimuli, such as plosive consonants.
Aiming to expand knowledge on neural discrimination between different acoustic characteristics, this study assessed how the auditory pathways encode and differentiate the plosive consonant-vowel syllables [ga], [da], and [ba], presented through speech stimuli, using the FFR in children with typical development.
The following hypotheses were considered: (1) Because of the tonotopic organization of the auditory system, which promotes faster encoding of high frequencies, the differing F2 and F3 frequencies of the formants of the presented stimuli should manifest themselves as latency shifts. (2) Latency differences should decrease throughout the response until they disappear by the time the three syllables reach their steady state; (3) There should be no differences between the latencies of the electrophysiological responses for the three stimuli in the sustained portion.

METHODS
This study was approved by the Research Ethics Committee of the University of São Paulo Medical School (FMUSP) under protocol no. 109/12. The parents and/or legal guardians of the participating children were informed about the procedures and signed an Informed Consent Form (ICF) prior to study commencement.

Study sample
The study sample was composed of 20 children with typical development (according to information obtained through interviews with the teachers and parents and/or legal guardians of the participating children), absence of neurological, cognitive and psychiatric disorders, school complaints, and speech and language impairments.
All participants presented thresholds within the normal range (≤15 dB HL) for the assessed frequencies (500-4000 Hz), speech recognition with scores >90%, normal tympanometric measures, and BAEP with click stimulus within normality. These children also had normal performance in the auditory processing assessment. Changes in auditory processing were ruled out following the criteria recommended by the AAA (21) and ASHA (22) through the use of a monotic test, a dichotic test, and two temporal tests. If changes associated with auditory, neurological, cognitive or psychiatric aspects were verified, individuals would be excluded from the study and referred to specialized service.

Stimuli and response capture parameters
The FFR was obtained through the presentation of acoustic speech stimuli -plosive consonant-vowel syllables [da], [ga], and [ba]. The speech stimuli were synthesized (23) at 20 kHz frequency, 16-bit resolution, and 170 ms duration. The stimuli were composed of six formants, differentiated in the onset frequencies (initial portion of the stimulus), in the transition from the second (F2) to the third (F3) formants (Table 1). These stimuli were the same used by Johnson et al. (4)

Procedures
The stimuli were presented only to the right ear using an electroneuromyograph (SmartEP model) equipped with the cABR module (Intelligent Hearing Systems, Miami, FL, USA) at a speed of 4.35 stimuli/sec and an intensity of 80 dBnHL.
The FFR was captured through surface electrodes in the positions Cz, M2 (right mastoid), and Fpz as ground with an analysis window of 230 ms (45 and 185 ms corresponding to the pre-and post-stimulus periods).
Two 2000-stimulus scans were performed for each syllable presented with alternating polarity. The two waves generated by the scans were calculated by weighted sum and the resulting final wave, with 4000 stimuli, was analyzed.

Formant transition period
The formant transition period was defined as the portion of the response corresponding to the onset and formant transition periods of the stimuli (0-50ms). Based on the first hypothesis, latency differences between the stimuli were expected in this portion of the response.
In this portion, a total of 16 transient peaks were recorded, with six positive peaks and 10 negative peaks (Figure 1) in the initial 70 ms of the electrophysiological response.

Sustained response period
Sustained response was defined as the portion corresponding to the steady part of the stimulus (51-170 ms).
In this portion, a total of 21 transient peaks were recorded, with seven positive peaks and 14 negative peaks between 90 and 160 ms of the electrophysiological response.

Qualitative analysis
The data were quantitatively analyzed using the Crossphaseogram technique (24) . This technique calculates the wave phase differences between two electrophysiological responses as a function of time and frequency and illustrates the differences in the transient portion in [ Adapted from Johnson et al. (4) consists of shades of yellow, orange and red, with the largest differences represented in dark red. If the opposite occurs, i.e., the [ba] response leads the [ga] response, the representation is in blue shades, and when there is no difference between the phases, the plot appears green.

Quantitative analysis
A grand mean (GM) latency for the peaks obtained across the three stimuli was computed to normalize these values for all peaks -16 in the transient portion and 21 in the sustained portion -so that they could be described on the same scale. After that, this GM was subtracted from each individual peak latency (Latency Individual -Latency GM ). Thus, earlier peaks are negative numbers, later peaks are positive numbers, and peaks near the GM are close to zero (4) .
Multivariate analysis of variance with repeated measures (repeated-measures MANOVA) was performed to compare the test averages across the three studied stimuli (25) . In the repeatedmeasures MANOVA, the p-value and the F ratio, which is used to test the global difference between groups, were analyzed using the Wilks' Lambda (λ) test.
To complement the descriptive analysis, confidence interval (CI) was used to assess the extent to which the average could vary at a certain level of confidence. The CI established for data analysis was 95%, with a significance level of 0.05 (5%). due to the progression between the frequency differences of the sounds. As for the sustained portion of the stimulus (90-170 ms), it was hypothesized that there would be no significant differences in the electrophysiological responses between the three stimuli used.

The
In the qualitative analysis carried out using the Crossphaseogram (24) technique, it was found that the greatest discrimination occurred between [ga] and [ba], followed by between [da] and [ba] (Figure 2). This discrimination is represented in shades of yellow, orange, and red. Smaller discrimination occurred for the pair with the least difference, that is, [ga] and [da], represented in Figure 2 by most predominant green. The differences occurred only in the transient portion, with difference between F2 and F3 for the three syllables (10-50 ms). In the sustained portion (similar in the three syllables), no difference between the response phases was observed, shown in green in Figure 2 .
Since no differences between the stimuli were identified in the sustained portion, statistical analysis was performed only in the transient portion.
The latencies of the 16 waves that compose the transient portion (0-70 ms) in each of the stimuli were analyzed. Table 2 shows the descriptive analysis of the latency measures of the 16 waves of all children with typical development. Since some individuals did not present all 16 waves, the acronym 'N' was inserted to specify the number of participants who presented that wave and, consequently, the number of participants used for the other analyses.   Repeated-measures MANOVA was conducted to determine whether there were differences between the three syllables studied. Results of the analyses were divided into four parts: a) latency of the onset peaks (1,2); b) latency of the major  (3,4,6,7,9,10,12,13,15,16); c) latency of the minor peaks (5,8,11,14); latency of the end-point peaks (15,16). Univariate analysis was used to evaluate the relative contributions of each latency measure of the analyzed waves in the difference found. These analyses indicated statistically significant differences between the three stimuli studied for the following waves: 3 (p<0.001), 4 (p<0.001), 6 (p<0.001), 7 (p<0.001), 9 (p<0.001), and 10 (p<0.001).

Analysis of onset peak latencies
For paired comparison, the paired t-test was applied to verify the differences between the stimuli ( Table 3).

Analysis of minor peak latencies
Repeated-measures MANOVA could not be used to assess the latency values of minor peaks since the number of absences found in these waves (Table 2) hindered the application of this type of analysis. This finding demonstrates greater inconstancy of these waves compared with peaks of the major, onset and end-point waves.
Thus, only the paired t-test was used to verify the differences between the stimuli ( Table 4).

Analysis of end-point peak latencies
Repeated-measures MANOVA showed multivariate difference in latency measures between the electrophysiological responses  For paired comparison, the paired t-test was applied to verify the differences between the stimuli ( Table 5).

DISCUSSION
Due to the importance of neural processing in the transition of acoustic elements overtime for the integrity of speech processing, there is great interest in understanding how the central auditory system encodes this information in a normal auditory nervous system so that what occurs when this encoding is broken, or is still under development, can be understood.
This study aimed to understand how the auditory pathways located in the brainstem reflect subtle acoustic differences existing between the plosive consonant-vowel syllables [ga], [da] and [ba], which differ only in the transition from the F 2 to F 3 frequencies.
The results confirmed the first hypothesis of this study, that is, the differing F 2 and F 3 frequencies were manifested in the neural processing of the acoustic characteristics of the studied stimuli. In other words, changes in the latency time of the electrophysiological response have been demonstrated with a progressive increase in the response latency for the [ This difference between latency period and the stimuli was evident mainly for the latencies of the major and minor peaks. However, it was noticed that the major peaks had a clearer and more steady morphology and were present in all participants, unlike the minor peaks (Table 2).
According to the theory presented by Johnson et al. (4) , this distinction in the responses between major and minor peaks supports the idea that separate neural mechanisms are responsible for encoding different acoustic aspects of speech sounds. The major peaks would represent the fundamental frequency (F 0 ) and correspond to the glottal pulse in the stimulus, thus transmitting information about the pitch. In contrast, the minor peak latencies reflect the stimulus transition formants, which vary between the [ga], [da] and [ba] syllables and are expressed in the time domain in the electrophysiological response, because variation of these frequencies is beyond the phase-locking capacity of the auditory system.  Since major peaks reflect the stimulus F 0 , it would be expected that these peaks be identical across the neural responses obtained in all the syllables used in this study; however, differences between latencies were also observed for the major peaks.
One hypothesis for this observed difference is that the major peaks are influenced by the patterns observed in the minor peaks. Another factor to be considered would be that, in natural articulations, pitch perturbations caused by articulatory movements in the vocal tract could be present. In this case, the systematic pattern observed in the minor peaks could also be evidenced in the major peaks.
Since the minor peak latencies are neural representations of the formants of these stimuli, the smallest difference found between the     between F 2 and F 3 . These contrasts in the neural representation of electrophysiological responses were also verified through the Cross-phaseogram analysis, as shown in Figure 2. Therefore, the presented results corroborate the hypothesis of Johnson et al. (4) and Hornickel et al. (26) , demonstrating that neural encoding for the different acoustic elements is manifested in a different, independent way and can be studied through the FFR.
The second hypothesis of this study was also confirmed. Table 2 and Figure 3 show that the difference between the mean latencies obtained across the three stimuli decreased during the response, until disappearing when the three syllables reach their steady state (vowel) (Figure 2).
However, no statistically significant differences were observed between the mean latencies in the initial or onset part of the response (waves 1 and 2). These findings corroborate those reported by Johnson et al. (4) , who related the neural response onset to the initial burst of the voiced plosive syllable stimulus, similar in the three studied stimuli.
The third hypothesis of this study has also been demonstrated, since no differences between the latencies of the electrophysiological responses were observed across the three stimuli in the sustained portion (vowel). This result was already expected since the acoustic properties in this portion are identical across the studied stimuli.
Thus, the different electrophysiological representations of the acoustic characteristics of the transient and sustained portions of the speech stimuli in the brainstem in children with typical development show that different neural mechanisms, mediated by neural synchrony or phase-locking, have separately encoded these acoustic cues.
This study contributes to the understanding of the subcortical neural mechanisms that underlie formant transition encoding. The results showed that the electrophysiological responses in the first 70 ms of responses were responsible for differentiating between spectral cues that assist with distinguishing between consonants. This suggests that different neurons have specific responses to different acoustic aspects, that is, high-frequency stimuli present earlier latency responses than low-frequency stimuli. This progression in latency time as a function of frequency has already been demonstrated in the pure-tone brainstem response (20) . In a mature auditory system, the basal region of the cochlea is more responsive to high frequencies and the apical region is more responsive to low frequencies. This tonotopic organization is preserved along with the neural auditory pathways, which would assist with preserving spectral information in neural encoding activity (12,13) .
Although this study has contributed new information regarding the representation of transient and sustained acoustic cues in the subcortical auditory pathways, there is still much to be investigated. Regarding the normal encoding of acoustic characteristics, it is hoped that future studies will add a wider repertoire of syllables, including consonants with different places of articulation.
Finally, it is believed that the FFR with speech stimuli (or other complex stimuli), together with other measures and clinical assessments, can inform processes that underlie the biological nature of auditory processing and speech and language changes, assist with therapeutic strategies, and promote an objective index of therapeutic evolution. For example, some populations may present deficits in neural encoding for specific elements of onset and/or end-point, or specific for formant rapid transition encoding. In contrast, there is also a possibility that some children present deficits in neural encoding, both in transient and sustained information.
Thus, the results of this study enable perception that such populations could be more precisely identified and that more accurate therapeutic programs and strategies could be developed to suit the specific area of difficulty.

CONCLUSION
The Frequency-following Response (FFR) proved to be an efficient tool to investigate the subcortical discrimination of acoustic differences in speech sounds, since the data demonstrate that the electrophysiological responses present differences relevant to each of the three evoked syllables. In the transient portion (consonants), latency shifts were observed, whereas no differences between the latencies across the three stimuli were found in the sustained portion (vowel). In other words, different neural representations for the different acoustic characteristics of the [ga], [da] and [ba] syllables could be observed.
Considering the existing knowledge on the encoding of acoustic characteristics of speech sounds, these data assist with understanding how the brainstem encodes the important perceptual differences in speech through the FFR. It is believed that this study has significance in expanding the knowledge on how the neural encoding of these acoustic differences occurs in clinical populations.