A phonetic study of Zootopia characters’ voices in Brazilian Portuguese dubbing: the role of stereotypes

This work comprises an experimental investigation approach of expressive speech that integrates methodological procedures of perceptual and acoustic analyses. As the object of this work, we have focused on voice quality and vocal dynamics. Speech samples from the four main personality-distinct characters in the animated feature ﬁ lm “Zootopia” dubbed by Brazilian voice actors have been analysed. Due to the expressive function of voice quality, we have posed the following question: what types of voice quality and vocal dynamics settings were used by the voice actors in the Brazilian dubbing of “Zootopia” to compose the vocal proﬁ les of the characters? Perceptual evaluation of the 54 speech stimuli was performed using the Vocal Proﬁ le Analysis protocol (Laver & Mackenzie Beck, 2007). Acoustic measures were automatically extracted using the ExpressionEvaluator script (Barbosa, 2008) for PRAAT. The proﬁ les for each of the four characters were composed based on the psychological traits described in the ﬁ lm script. The results of the acoustic analysis, the perceptual analysis of voice quality and vocal dynamics settings were correlated using the MFA (Multiple Factor Analysis) method in the R environment based on 40 variables (quantitative and qualitative) and it turned out that the speech stimuli were distributed in 6 clusters according to the variables analysed. The quantitative variables that presented the highest correlation percentage were: Standard Deviation of f0 Derivative, Standard Deviation of Spectral Tilt, f0 Median. The qualitative variables that presented the highest correlation percentage were: Lowered Larynx, Lip Rounding, Breathy Voice and Minimised Pitch Range. The research has presented evidence in favor of the symbolic use of phonic matter and contributions to the understanding of how vocal stereotypes are established.


Introduction
The potential of speech sounds to express a variety linguistic, extralinguistic and paralinguistic meanings is what makes it such an effective means of communication. Expressivity in speech is built on the relationship between segmental elements and prosodic elements. It is the interaction of those speech structures that yields multiple meanings to the same sequence of segmental elements (Madureira, 2004(Madureira, , 2011(Madureira, , 2018. As such, the investigation of vocal expressivity must consider speech in all its dimensions, considering the phonetic aspects which characterize the speakers´ vocal performances as animators (Goffman,1981) and impress listeners who attribute meaning effects based on vocal profiles. Therefore, vocal profiles are important to be investigated since they are powerful indices of physical and psychological states (Gobl and Chasaide, 2003).
This study aims to investigate and characterize the vocal profi les of the four main characters of the animated fi lm "Zootopia" in its Brazilian Portuguese dubbing. The fi lm is an animation featuring anthropomorphic animals living in a metropolis populated by mammals, called Zootopia. The fi lm's narrative centers around the issue of prejudice and assumptions based on stereotypes, and how harmful they can be. The "modern fable" aspect of the fi lm and the mixture of physical and biological characteristics of animals with human traits and attitudes allows us to correlate our data with patterns described in the sound symbolism codes and examine the role of vocal stereotypes in the characters' vocal profi les.
This study takes an experimental approach to the investigation of vocal expressivity that integrates methodological procedures of perceptual and acoustic phonetic analysis, and multivariate analysis. It seeks to answer the following questions: what types of voice quality and vocal dynamics settings were used by the voice actors in the Brazilian dubbing of "Zootopia" to compose the vocal profi les of the characters? Do they refl ect stereotypes understood as sound symbolic coded?
Sound symbolism emerges from the direct relations holding between the physical properties of sounds and meanings as stressed in works considering the associations between phonetic characteristics and linguistic, paralinguistic and extralinguistic meanings such as the ones by Köhler (1947), Jakobson (1978Jakobson ( , 1979, Albano (1988), Tsur (1992), Hinton, Nichols and Ohala (1994), Abelin (1999), and Fónagy (1983and Fónagy ( , 2001, to mention a few addressing this complex issue. The fact that sounds have meaning effects can be explored in the study of vocal stereotypes, considering that both biological and cultural specifi c constraints might determine them, as pointed out by Kreiman and Siditis (2011) based on evidence from on the perception of personality from voice.

Theoretical Framework
The basic frameworks for the analysis of voice quality and vocal dynamics and fi ndings of previous studies on the links between voice and physical, symbolic and psychological features are considered in this section.

Voice quality and vocal dynamics
The phonetic descriptive model of voice quality proposed by Laver (1980) covers both phonatory (laryngeal) and articulatory (supralaryngeal) elements of the vocal tract. The model considers the possible confi gurations of all those elements that defi ne an individual's speech, and describes them in articulatory, physiological, acoustic and auditory terms.
This phonetic model for assessing voice quality considers the inherent anatomical features of the human phonatory system, as well as extrinsic factors, such as long-term articulatory and phonatory settings, polysegmental, recurrent muscle mobilizations that occur in the vocal tract during speech. These recurrent and long-lasting extrinsic factors during the production of the speech segments constitute the analytical unit of this model, the setting.
Laver presents two principles in this descriptive model of voice quality: compatibility and susceptibility. With the fi rst principle, Laver establishes that certain settings are incompatible and cannot occur simultaneously during the production of a segment. For example, the lip spreading setting is incompatible with the lip rounding setting. This principle also rules the relationship between settings and the individual anatomy of the speaker: the individual anatomical confi guration of a speaker determines the degree of ease in adopting a specifi c setting in their speech.
The principle of susceptibility establishes that segments are more vulnerable to certain settings, and thus, such settings are more easily perceived in those segments. In general, phonic segments are more susceptible to the infl uence of settings with which they do not share articulatory, acoustic or auditory features. For example, oral vowels are more vulnerable to the infl uence of nasal settings than nasal vowels.
The VPA model and script are based on the neutral setting, a set of settings that occurs simultaneously in the vocal tract. This setting is characterized by none disturbances at any point in the extension of the vocal tract due to the actions of the lips, jaw, tongue, pharynx or larynx (Laver, 2000).
The remaining voice quality settings are described in relation to the neutral setting. The protocol consists of two stages of analysis: 1. the identifi cation of settings distinct from the neutral setting in the subject's voice; 2. the attribution of values to these non-neutral settings, ranging from 1 to 6; the higher the degree, the greater the difference in relation to the neutral setting. The combination of these settings and their degrees constitute the vocal profi le of a subject (Laver, 1980). The intrinsic and extrinsic settings exhibited by a speaker convey not only information about the linguistic message, but also information about the speaker, which can be divided into three categories (Laver & Trudgill, 1979): social characteristics markers, such as social class, level of education, place of origin; Markers of physical features such as sex, body type, age; attitude and affective state markers The fi rst category considers that an individual who speaks a certain language and belongs to a speech community adopts certain voice quality settings that relate to the sounds in their language and dialect (Honikman, 1964), and that these particular vocal tract confi gurations can be recognized by the listeners.
The second category is linked to intrinsic factors of voice quality, since they determine vocal characteristics resulting from anatomical attributes shared by groups of speakers, like shorter vocal tracts on children.
The latter category is related to voice quality settings resulting from emotional discharges and their consequences on the physiological configuration of the vocal tract, as well long-term attitudes and personality traits signaled in a speaker's voice.

Sound symbolism codes
Four kinds of sound symbolism codes have been proposed in the phonetic literature and they establish connections between the physical properties of sounds and meaning expression, the basis of speech expressivity (Madureira, Fontes and Camargo, 2019). The Frequency Code (Ohala, 1980) is linked to fundamental frequency (f0), an acoustic parameter related, in terms of speech production, to the number of vibrations per second of the vocal folds and, in perceptual terms, to the auditory sensation that goes from low to high (pitch). The Frequency Code is based on the evolution of vocalizations in animals, as a survival instinct of the species: low f0 values (few vibrations of the vocal folds per second) are linked to larger animals, and signal power, force hostility, and aggressiveness. In contrast, high f0 values (many vocal fold vibrations per second) are associated with smaller animals and signal submission, fragility.
In the animal kingdom, the use of frequency variations, combined with other forms of manipulating how one's physical size is perceived are decisive in situations of confrontation, especially in situations where the intention is exactly to avoid direct confrontation. In certain aspects of human vocalizations, the variation in f0, associated with other visual and gestural communication, attitudes and intentions of the speaker. Thus, the Frequency Code reveals the direct relationships that can exist between sound and meaning.
The Effort Code concerns articulation; the higher the degree of articulatory effort, the more precise the articulation, which carries some potential meanings. A greater degree of articulatory effort is related to tension, determination, while a lesser degree of effort is linked to disinterest, relaxation.
The Production Code, or Respiratory Code (Gussenhoven, 2004) is related to subglottic air pressure. At the beginning of a sentence, subglottic air pressure rises, and decreases at the end. Thus, it also generates potential meaning effects. Low subglottic air pressure in speech can be associated to weakness, while a high subglottic pressure can give the impression of arousal.
The Sirenic Code (Gussenhoven, 2016) is related to the voice characterized by air escaping between the vocal folds ( breathy voice). This type of voice quality has a linguistic (interrogative marker) function, and refers to paralinguistic (low excitement and seduction) and extralinguistic (female sensuality and fragility) elements in speech. Its name is linked to these last two elements, a reference to the mythical tales of sirens, who attract sailors (usually to their deaths) with their voice and singing. Sapir (1927) wrote an article about the expressive value of speech and refers to speech as a personality trait. He opened the path to the formal studies on the relationship between voice and personality which were developed in the 1930s. These early studies follow the fi rst of the three main approaches taken in studies of personality markers in the voice: accuracy studies, which were concerned with how accurate listeners were when judging a speaker's personality from their voice.

Vocal cues to personality and vocal stereotypes
The fi rst of such works the study by Pear (1931), conducted in England. The research consisted in the transmission of a reading excerpt from a text by Charles Dickens, produced by 9 different speakers, with different professions. After this broadcast, the listeners sent the researchers a form published in the newspaper with their opinions on the personalities and professions of the speakers.
The study had with no specific parameters adopted for the recordings and the selection of the judges, however, an important fi nding of the work was the existence of vocal stereotypes, shared by the judges who responded to the survey. For some of the voices, the percentage of agreement on the speaker's profession was higher than the percentage of judges who answered the question correctly (Kreiman & Sidtis, 2011).
Other formal studies with more robust experimental designs were carried out during the 1930s and 1940s, such as the study by Allport and Contril (1934), in which listeners judged characteristics such as physical appearance, professions, values, political preferences of three speakers. The results of these studies consistently replicated the 36.3 conclusions of the Pear's study. Teshigawara (2003) and Kreiman & Sidtis (2011) report that the accuracy of the fi ndings in the research carried in the 1930s and 1940s may be affected by the inherent experimental limitation of the studies.
The studies carried out after the 1940s, for the most part, can be divided into two other approaches: studies of externalization and studies of attribution (or inference). The fi rst approach refers to studies that aim to explore the correlation between acoustic analysis, expert analysis and coding of the voices of speakers and personality tests based on self-assessment. Studies that took this approach also found it diffi cult to produce solid conclusions from these correlations due to the inadequate nature of the personality assessment parameters, and the inaccuracy of the acoustic measures of the speech samples used. In addition, Scherer (1979) points out how cultural differences can make the comparison of studies in different languages diffi cult. The 1970s were indeed productive in terms of research works focusing the roles of voice qualities in determining psychological and social states (Scherer, 1972(Scherer, , 1979(Scherer, , 1989Scherer, Uno and Rosenthal,1972;Scherer, Rosenthal and Koivumaki, 1972;Giles, Scherer and Taylor 1973;Scherer, London and Wolf , 1973).
The latter approach refers to the investigation of the attribution of characteristics to speakers by layman judges, without concern for the accuracy of such judgments. This approach has characterized most of the studies on voice and personality in recent decades. Within this approach are studies on the "vocal attractiveness" stereotype.
Just as physically attractive individuals are perceived as more confi dent, competent and sympathetic, individuals with voices assessed by judges as "attractive" are also perceived in this way (Berry, 1991(Berry, , 1992Zuckerman & Driver, 1989). According to Zuckerman, Hodgins and Miyake (1990), this stereotype, however, is prominent when the judges do not know the speaker, and apply less and less as the listener knows and becomes familiar with the speaker. This study also demonstrated that the vocal stereotype proved to be more infl uential than the physical stereotype in highly familiar relationships.
Some studies, such as that of Hecht and LaFrance (1995), have combined the externalization and attribution approaches. In this study, the authors investigated whether the vocal characteristics of telephone attendants and the impressions caused by their voices correlated with the speed with which these professionals served customers. The researchers asked judges to hear statements from selected attendants and to classify personality traits and vocal characteristics. The traits were grouped into a single factor called a positive attitude; Correlations were calculated between vocal characteristics and positive attitude. The vocal characteristics that showed signifi cant correlations with the positive attitude were described as "modulated" and "clear". This fi rst characteristic can be understood as a high variability of pitch and intensity in relation to the duration of the speech segments. The second characteristic, in turn, may indicate a high variability of articulatory movement in speech.
The study by Yarmey (1993) combined investigations of vocal and facial gestures. In the study, judges should classify the vocal characteristics of 15 subjects in three situations: stimuli presenting only the subjects 'faces (in this situation, the judges should imagine the vocal characteristics of the subjects), stimuli presenting only the subjects' voice, and stimuli combining the visual and sound information. The results of the study suggest that the confi gurations of voices for noncriminal subjects are more typical and more pleasant, while the voices of criminals have unique and less pleasant characteristics, and are perceived as monotonous, rigid and unclear.
In his study of vocal expressiveness in Japanese animation (anime), Teshigawara (2013) examined the characters' voices using a modifi ed version of Laver's voice quality model (2000), acoustic measures such as f0 and formants, and vocal types correlated with the categories of characters (heroes and villains). The results of these analyses were compared statistically with the perceptual experiment, which consisted of the assessment of layman judges regarding age, sex, physical characteristics, personality traits, emotional states and vocal characteristics of these same characters. The study showed that villains and heroes in anime show differences in voice quality, especially in relation to pharynx and larynx settings. Most villains presented pharyngeal expansion and laryngeal sphinctering, while the heroes exhibited breathy voice and neutral setting for the pharynx. The study also demonstrated that these settings have different impressive 36.3 effects on the listeners, with the judges of the perceptual experiment assigning unfavorable physical, personality and affective states to the voices that had non-neutral pharynx and larynx settings.
One issue with which discussions about vocal stereotypes are concerned is the origin of stereotypes; whether their empirical bases are biological or cultural. Montepare and Zebrowitz-McArthur (1987) tested their hypothesis that vocal stereotypes were based on ecological factors, in a study comparing the perception of "infantilized" voices by North American and Korean listeners. Listeners of both nationalities associated childish voices with weak, incompetent and affectionate personalities. The difference between the two groups was shown only in the association of female voices with affectionate personalities by North American listeners, but not by Korean listeners. These results indicate that the two cultures share these particular vocal stereotypes, and give credibility to the hypothesis of the biological origin of these stereotypes.
However, the fi eld also has a number of studies that point to the social origins of vocal stereotypes, such as the study by Peng, Zebrowitz and Lee (1993), also focused on North American and Korean judges. Due to the differences in how both cultures see youth and old age, the authors hypothesized the personality traits perceived from the vocal characteristics associated with youth and old age would be different for North American and Korean subjects. The study showed that American judges (coming from a culture that values youth) considered voices that were louder and exhibited faster speaking rates (characteristics associated with young voices) as stronger and more dominant. For Korean judges, this was association was observed only in relation to intensity.
Thus, evidence from studies carried out in the area demonstrates both biological and cultural infl uences on the way in which the personality is perceived through the voice, different aspects of personality and the vocal stereotypes associated with them may have different origins.
Some studies on vocal attractiveness in children's voices suggest a possible additional infl uence on the relationship between voice and personality: the "self-fulfi lling prophecy", as proposed by Scherer and Scherer (1981). According to this theory, individuals may develop personality dispositions and behaviors based on inferences made by their signifi cant others when interacting with them.

Material and methods
This section presents the study's material and the methodological procedures. The fi rst subsection presents the material selected for the analysis and the criteria for the selection, following subsections give an overview of the methods applied in the study.

Material
As a fi rst step in the methodological procedures, the Brazilian Portuguese audio track and the video fi le from the Blu-ray disc of "Zootopia" were extracted and the converted into .mp4 format for video and .wav for audio.
The fi les were then uploaded into ELAN. With the software, the different scenes of the fi lm were annotated, given a brief description and the audio track was transcribed. Those tiers were then converted into .TextGrid fi les and uploaded to PRAAT along with the audio track.
The voice samples analysed in this study belong to the fi lm's four main characters: Judy Hopps (bunny), Nick Wilde (fox), Chief Bogo (buffalo) and Assistant Mayor Bellwether (sheep). The voice acting for each of these characters in the Brazilian Portuguese dubbing is performed by a different actor.
The characters were profi led based on information taken from interviews with the fi lm's creators, writers and voice actors, and promotional materials released by the studio.
The speech samples were taken from scenes in from different points of fi lm's narrative, in which these characters interacted with at least one of the other three selected characters, and their demeanor was consistent with their psychological profi les. The samples did not have a musical score or sound effects that would interfere with the acoustic and perceptual analyses of the voices. In total, 16 speech samples were selected, four for each character. The samples were between 7 and 14 seconds long.
For the VPA analysis and the extraction of acoustic parameters in PRAAT, the 16 speech samples were edited into smaller sections (stimuli) to eliminate longs pauses, sound and noises that were not part of the lines of the character being analysed. This generated 54 stimuli.
Chart 1 below lists the numbers of stimuli analysed for each character.

Figure 1 -Stimuli selected for analysis
The full transcription of the stimuli is included at the end of this paper.

Perceptual analysis
The stimuli were rated according to the VPA protocol (Laver and Mackenzie Beck). This rating was performed by a phonetician with twenty years of experience with the VPA.
The protocol consists of 55 settings, defi ned in relation to the neutral setting. Those settings are divided into six categories: Vocal Tract Features, Overall Muscular Tension, Phonation Features, Prosodic Features, Temporal Organization and Other Features.
The rating of voices with the protocol is a two-stage evaluation. In the First Pass, raters note if a subject's voice presents any of those 55 non-neutral settings. In the Second Pass, raters assess the degree of the non-neutral settings marked in the First Pass in a six-point grading scale; 1 to 3 represent moderate degrees, 4 to 6 represent extreme degrees.
The VPA protocol is included at the end of this paper. The ratings were done in PRAAT and annotated in .TextGrid fi les.

Acoustic analysis
For the acoustic analysis of the characters' voice, the script ExpressionEvaluator (Barbosa, 2008) for PRAAT to the 54 stimuli. The script was used to extract fi ve classes of acoustic parameters: fundamental frequency (F0), the fi rst derivative of the fundamental frequency (dF0), intensity, spectral tilt (SpTt) and Long-Term Average Spectrum (LTAS). One to four descriptors were used for each class producing the following parameters: F0 median; F0 inter-quartile semi-amplitude; F0 0.995 quantile; F0 skewness; F0 fi rst derivative mean; F0 first derivative standard deviation; F0 first derivative skewness; Global intensity skewness; Spectral tilt; Spectral tilt standard deviation; Spectral tilt skewness and Long-Term Average Spectrum (LTAS) standard-deviation. About these parameters, some remarks are worth mentioning since they are not obvious choices. Spectral tilt, for example, is a known correlate to vocal effort. The fi rst derivatives are sensitive to changes in the respective parameters. The fi rst derivative of F0 can thus be used to detect abrupt changes in the F0 contour. Skewness fi nally is a way of quantifying deviation from a normal distribution.
The script operates with the option of parameter reference values for male and female speakers. Thus, the reference values of the characters' genre were set for the each of measurements. With the extracted data, the script automatically generates a .txt fi le and PRAAT graphics.
The values for each of the parameters extracted to the .txt fi le were then exported to an Excel spreadsheet along with the results of the VPA rating.

Multivariate analysis
Multivariate analysis allows the projection of data on different planes and the weights of measures in different dimensions. By combining main component methods, hierarchical grouping and partitioning, this method of analysis makes it possible to highlight the similarities and differences between stimuli (Husson et al., 2013).
For multivariate analysis, we used R, a programming language and free software for statistical calculations and data visualization, and within this environment, the R Commander and FactoMineR interfaces.
From the package of FactoMineR programs, specifi c for Exploratory Analysis of Multiple Variables, we chose the MFA (Multiple Factor Analysis) method, an extension of the PCA (Principal Component Analysis) method, used for analyses in which a set of elements is characterized by variables structured as groups, which can come from different sources of information. The method calculates the main component of each of the groups of variables, to map the similarities between the stimuli in relation to all the variables and to group them based on these similarities (clustering).
The basic methodology of this analysis consists in looking at the data from different angles to understand its complexity in a data "cloud". The variables are represented by points in a dimensional space that can be defi ned, provided that each of these variables has a value per stimulus. The multivariate analysis presented in this work was performed in 7 dimensions.
In this work, the variables are the VPA protocol settings identifi ed in the 54 stimuli and acoustic parameters extracted by ExpressionEvaluator script. All values of the considered variables were normalized by z-score.

Results
This section presents the results of the analyses performed in the study, which are divided into three subsections. The fi rst of these subsections, Perceptual analysis, is divided into four subsections, each with a brief overview of the results of the perceptual analysis for the four characters.

Perceptual analysis
We begin this section presenting the results for perceptual analysis using the VPA protocol. In the table 3 below, we have listed the settings exhibited by each of the characters in their stimuli. The settings are color coded according to type of setting.

Judy Hopps
The character was the only one that did not exhibited any nonneutral phonatory setting in any of her stimuli.
High pitch variability was the setting that was most repeated recurrent setting in the stimuli from the character, appearing in 13 of the 16 stimuli analysed, always in a moderate degree 2.

Nick Wilde
Nick Wilde was the character who exhibited the highest number of non-neutral voice quality and vocal dynamics settings; 14, in total.
The breathy voice setting was the main setting identifi ed in the character's stimuli, appearing in 12 of the 14 stimuli, always in moderate degrees.

Chief Bogo
The character was the only one among the four who exhibited non-neutral settings to extreme degrees.
The Lip rounding and Lowered larynx settings appear in 10 of the 12 stimuli from the character, in similar degrees. The main differences between the stimuli were the muscle tension and vocal dynamics settings.

Assistant Mayor Bellwether
In her fi rst six stimuli, the character maintained a stable set of voice quality and vocal dynamics settings. In the fi nal six stimuli, when the character shows her true manipulative personality, there was greater variation between the non-neutral settings present in the segments.
Among the settings exhibited in the fi rst six stimuli, the breathy voice and high mean loudness were the only ones that did not appear in any of the fi nal stimuli. On the other hand, the tense vocal tract setting appeared in most of the fi nal stimuli and in none of the six fi rst stimuli.

Acoustic Analysis
This section presents the results of the analysis of the stimuli's acoustic and statistical parameters, related to fundamental frequency and intensity. Those parameters are:    Regarding the AMT (Global intensity skewness), the values for this measurement were higher in the stimuli from Judy Hopps than those in the stimuli from other characters. Higher values of AMT may be linked to high pitch and laryngeal tension.
The average values for F0 were calculated in PRAAT, correcting extraction errors. The F0 averages of the characters' speech stimuli were: 239 Hz for Assistant Mayor Bellwether in the scenes where she appears to be friendly and affable, and 282 Hz in the scenes where she acts as the villain of the fi lm; 126 Hz for Chief Bogo; 326 Hz for Judy Hopps; 145 Hz for Nick Wilde.

Multivariate Analysis
For this analysis, 40 variables were considered. Those variables are divided into two groups: the ExpressionEvaluator script measurements (quantitative data), and voice quality and vocal dynamics settings (qualitative data) identifi ed as non-neutral in the selected stimuli. Charts 4 and 3, below, show the variables that make up each of the groups.  The analysis showed that the group of quantitative variables (Gc1) was farther projected than the group of qualitative variables (Gq1) in Dimension 1 of the vector space. In Dimension 2, the two groups of variables were equally projected. This means that these two dimensions are relevant to our analysis. The two dimensions have 21.68% (11.75% in Dim1, and 9.33% in Dim2) of explanatory power for the data. Figure  4, below, shows the projection of each of the groups of variables in the two dimensions. The two groups of variables were relevant to differentiate the stimuli. The relevance of each of the groups can be verifi ed from the distance between them and the zero and the two axes: the more distant, the more relevant. This distance from the groups is shown in Figure  5, below. The Lg coeffi cient, which explains the degree of projection of the variables in the vector space, indicated that the group of variables with the highest projection (highest value of Lg) in this study was Gq1 (VPA settings). Table 5, below, lists all reported values for Lg. The Rv coeffi cient, which explains the degree of similarity between the groups, indicated that the group of variables with the highest similarity index (highest Rv value) in this study was also the Gq1 group. In Table 3, below, all reported values for Rv are listed.  Table 7, below, lists the value the contribution of the two groups in the 7 dimensions; the higher the value, the more relevant the contribution of the group to that dimension. In this study, we have considered the contribution of the two groups of variables only in Dimensions 1 and 2, which have suffi cient explanatory power for the data analysed. The multivariate analysis has shown that quantitative variables with the highest percentage of correlation in the two dimensions were: F0 median (MED), F0 fi rst derivative standard deviation (DPF), and Spectral tilt standard deviation (DES). DPF may be interpreted as the degree of abrupt fundamental frequency changes whereas DES may be seen as the degree of vocal effort variation.

2020
In Table 5, below, we have listed all the Gc1 variables that showed signifi cance in Dimensions 1 and 2. In the table, the percentage of correlation and the statistical signifi cance of each variable (p, value) are reported. The variable with the correlation value closest to 1 is the variable that best describes a dimension.  Figure 6, below, shows the projection of all Gc1 variables in the vector space; the closer to the edge of the circumference, the more signifi cant is the variable.  As for the Gq1 group, the quantitative variables that presented the highest percentages were: Lowered larynx, Lip rounding and Breathy voice in Dimension 1; Lowered larynx, Lip rounding and Minimised pitch range in Dimension 2. In table 7, below, the correlation (R2) and statistical signifi cance of each setting (p, value) are reported. The variable with the closest value to 1 in R2 is the one that best describes a dimension. The acoustic and vocal variables clustered the 54 stimuli in 6 groups, under the infl uence of the acoustic measurements (Gc1) and the voice quality and vocal dynamics settings (Gq1). The grouping of stimuli in the factor map is shown in Figure 7 below. The factor map shows the distribution of the stimuli in Dimensions 1 and 2. Each cluster in the factor map is shown in a different color. The small squares within the graph are the centroids of each cluster, which represent the mean values for the stimuli that make up each cluster. Cluster 4, in dark blue, presents the largest number of stimuli. It contains the stimuli from Judy Hopps' speech productions, as well as half of the stimuli from Nick Wilde´speech productions. Furthermore, this cluster contains fi ve of the stimuli from Assistant Mayor Bellwether, taken from the fi nal scenes of the fi lm. The cluster also contains two stimuli from Chief Bogo´s speech productions.
Cluster 5, in light blue, is composed solely of speech samples from Chief Bogo´s speech productions.
The sixth and fi nal cluster, in magenta, is composed of three stimuli made up of fi xed expressions: one from Judy Hopps', one from Chief Bogo's and one from Assistant Mayor Bellwether's. As fi xed expressions they tend to have determined pitch contour and are used produced with a higher rate of articulation. These two factors, pitch and rate, belong to the prosodic part of VPA.
We interpret the fact that some characters´ speech samples were distributed into a greater number of clusters in relation to the complexity of the roles they play in the animation. Chief Bogo´s speech samples are distributed into four clusters, Assistant Mayor Bellwether into three, Nick Wilde into two and Judy Hopps into one. Chief Bogo alternates between a very severe attitude when demanding the policeman to work to a fl exible one when welcoming a bunny (Juddy Hopps) and a fox (Nick Wilde) into his team of brave policemen. He also shows empathy towards Judy acting as a counsellor when she mentions that she has decided to leave the Police Department because she feels guilty. Assistant Mayor Bellwether alternates from pretending to be docile to revealing herself as a villain and Nick Wilde changes from a deceptive to a trustworthy attitude. Judy remains diligent and optimistic all the time.
The cluster analysis was derived by means of the MFA (Multivariate Factor Analysis) method, because there are two types of variables (qualitative and quantitative). The number of clusters is based on the inertia gain in relation to the number of variables, the number of tokens and the precision index. According to these settings, the cutting in the dendrograms of the hierarchical cluster analysis is defi ned (Husson and Pagès, 2011). Figure 8 shows the hierarchical clustering of the 54 stimuli in 6 groups, under the infl uence of the acoustic measurements (Gc1) and the voice quality and vocal dynamics settings (Gq1).

Discussion
The results of the acoustic analysis showed variation in the measurements extracted by the ExpressionEvaluator script, in relation to speech situations, that is, the same character in more relaxed or tense situation would change their voice quality settings, which would affect their acoustic outputs and the values for the acoustic parameters. Thus, the clusters generated by the MFA showed the stimuli were dispersedly distributed across the bidimensional space and mostly grouped in multiple clusters for each character.
Judy Hopps, the bunny, was the only one whose stimuli were grouped in a single cluster. The character, described as determined and optimistic, and also perceived as fragile and silly by the other characters, showed no behavioral changes throughout the narrative. The main setting identifi ed in the character's stimuli was High pitch variability, which was exhibited in 13 of the 16 stimuli. According to the Frequency Code, higher F0 values correlate to smaller animals and amiable attitudes.
The stimuli from Nick Wilde, described as charming and persuasive, were located in two distinct clusters, 2 and 4. These clusters, however, are shown to be close together in the bidimensional space created by the MFA method, as seen in the factor map, Figure 7. The fi rst of these clusters contains all the stimuli of the character that present Minimised pitch range, Low mean loudness, and Tense larynx and vocal tract. The stimuli in the second cluster present none of these settings. The most recurring vocal quality setting used by Nick is Breathy voice.
The latter setting also appears in the fi rst six stimuli of Assistant Mayor Bellwether, when she presents herself as friendly and helpful as a front to earn the trust of the other characters. According to the Sirenic Code, Breathy voice is related to seductiveness and bashfulness.
The settings of laryngeal and vocal tract tension presented by the character of Assistant Mayor Bellwether in her fi nal scenes characterize her speech at a time when she demonstrates more assertive facets of her personality, which contrasts with the docile attitude she presented in the earlier parts of the narrative.
Chief Bogo, the buffalo, was the character that showed the most variation in the attitudes shown in the scenes, and the distribution of his speech stimuli in the clusters refl ected this variation, as his stimuli were distributed across 4 clusters. Unlike Nick Wilde, Chief Bogo's stimuli not only form different clusters, these clusters are also shown to be dispersed in two-dimensional space, as demonstrated in the factor map of Figure 7.
The Lowered larynx and Lip rounding, which were present in most of his stimuli, result in lower F0 values, which correlate to a lower pitch. These acoustic features of Chief Bogo's voice also align with patterns described in the Frequency Code, in which lower pitched vocalizations are linked to larger animals, and in human voices relate to aggressive and dominant attitudes.
In his fi rst three stimuli, taken from a scene in which he berates Judy, the authoritarian character of the character's personality is refl ected in his speech, in an aggressive and hostile way of speaking. These stimuli are all grouped into a single cluster (1). Cluster 5 consists of stimuli taken from a scene in the fi lm in which the character seeks to console Judy Hopps, still as an authority fi gure, but with a compassionate disposition.
In stimuli 22 and 23, which are grouped with the stimuli of the other characters in cluster 4, the character presents an increased pitch variability setting. At this point in the fi lm, the character adopts a friendlier demeanor.
The fi nal stimulus, 24, on the other hand, exhibits Lowered larynx and Lip rounding setting, and represents the return to his hostile and dismissive attitude (albeit as a joke).

Conclusion
As presented in our Introduction, this study sought answer the following questions: what types of voice quality and vocal dynamics settings were used by voice actors in the Brazilian dubbing of "Zootopia" to compose the vocal profi le of the characters? Do they refl ect stereotypes understood as sound symbolic coded?
We found that all four of the characters used non-neutral settings for Vocal Tract features, Overall Muscular Tension and Prosodic Features. Three of the characters used at least one non-neutral Phonation setting; Judy was the only one not to exhibit this type of setting in any of her stimuli. Temporal organization settings also appeared in the stimuli of three characters; Chief Bogo was the only exception.
We found that the overall confi guration of the vocal profi les of each character were distinct from one another, showing symbolic uses of phonic matter for the expression of meanings, and according to the context of interaction between the characters.
The voice actor for Judy Hopps, the bunny, used non-neutral settings that shortened the vocal tract (Lip spreading, Decreased lip extension, Raised larynx, Extensive jaw range and tense vocal tract) and vocal dynamics settings like high mean pitch, high mean loudness and fast speaking rate. These settings and the resulting acoustic features fi t with the idea of a small, agile animal.
The voice actor for Chief Bogo, the buffalo, used non-neutral settings that lengthen the vocal tract (Lip rounding, Lowered larynx, Opened jaw and Lowered tongue body), and vocal dynamics settings like low mean pitch and high mean loudness. This vocal characterization is consistent with the strong and authoritarian image of the character.
For the character Nick Wilde, the fox, settings that lengthen the vocal tract (such as Backed tongue body and Lowered larynx) and settings that shorten the vocal tract (Raised larynx) were exhibited. Unlike Judy Hopps and Chief Bogo, the voice actor exhibited lax settings for both the vocal tract and the larynx. Breathy voice was the most recurring setting in the character's stimuli. In terms of vocal dynamics, the settings for mean pitch (high and low) and speaking rate (fast and slow) changed between segments, but the degrees of pitch extension and loudness variability were maintained through all the stimuli.
The voice actor for Assistant Mayor Bellwether also used settings that are opposite either in their confi guration or their impressive effects: Fast and slow speaking rate; Lowered tongue body and Raised larynx; Breathy voice, which is pleasant, and Harsh voice and Pharyngeal constriction, that cause unpleasant impressive effects.
The presence of opposite settings in Assistant Mayor Bellwether's speech samples signals the different roles played by the character in different moments of the narrative; the affable little sheep who is friendly and helpful, and the bitter, conniving villain detailing her plan.
The fi rst six stimuli, which were taken from earlier scenes in the fi lm when she presented herself as affable, all exhibited the Breathy voice setting. That setting appeared in none of the fi nal six stimuli, where she reveals herself as the villain; in those stimuli, the most recurring setting was Tense vocal tract.
As for the acoustic parameters, we highlight F0 values to separate the voices of large and small animals, and the spectral tilt descriptor to characterize breathiness in the expression of persuasion and tension in contexts of confl ict between the characters.
The comparison between F0 values demonstrate the relationship between the vocalizations and the size of the animals according to the Frequency Code and the expressions of attitude (low values indicating dominance, and high values, fragility) and emotion in speech (low values for affable feelings and high values for irascible feelings).
As such, the results of the analyses confi rmed our hypotheses that the vocal profi les for each of the characters would be distinct from one another, and that the vocal profi les and their resulting acoustic features would refl ect the patterns described in the Frequency Code (Laver, 1980) and the Sirenic Code (Gussenhoven, 2016).
The fi ndings of this study regarding the tension settings and Breathy voice can be compared to the results of Teshigawara (2003). In his study, Teshigawara noted that the Breathy voice setting appears in the voices of the heroes in Japanese animations. In the Brazilian Portuguese dubbing of "Zootopia", the setting appears in Nick's voice, who is described as charming and persuasive, and in Assistant Mayor Bellwether's voice, when she presents herself as friendly and docile. In contrast, the laryngeal tension setting appears the latter character's stimuli, when she assumes the role of villain. In the Teshigawara (2003), 36.3 the voices of the villains were characterized by the larynx sphinctering, which correlates to the laryngeal tension setting.
This last setting also appears in the voice of Chief Bogo in the scenes in which he antagonizes Judy and shows a hostile demeanor.
When compared, these results point towards biological and universal basis for the link between those settings and the characters who exhibit them, since they show that characters meant to be sympathetic or attractive to audience and "villains" have similar voice quality settings in two different cultures.
Choices like the bunny for an energetic character, who is seen as naive by the other characters, and the fox for a con man show how the biological foundation and the folklore associated with the behavior of these animals have shaped the set up for the narrative. The presence of vocal stereotypes in the character's vocal profi les, especially those related to F0, can be seen as an extension of the stereotypes that guide the choice of each animal to incorporate the characters regarding their aspects of personality, attitudes and social roles in the allegorical narrative of the fi lm. The fox, for instance is in a lot of cultures (oriental and occidental) is associated with "cunningness" as manifested in sayings such as "as sly as a fox" in English, "Listig som en räv" (smart as a fox) in Swedish, "esperto como uma raposa" in Portuguese and "rusé comme un renard" in French, "astuto commo un zorro" in Spanish as well as in other languages such as Japanese, Chinese, Turkish, Chinese and Russian. The same for the association between the rabbit and activeness, bull and strength, sheep and meekness and black sheep and oddness.
Phonetic knowledge of production and perception of meaning effects linked to the voice is of importance for the characterization of attitudes and emotions in speech. The choice of settings of voice quality and vocal dynamics by the speaker affects the attribution of meanings by the listeners due to the symbolic value that the acoustic features convey about the speaker (Ohala, 1983;1984;Chuenwattanapranithi, 2008;Gussenhoven, 2002;.
This study brings contributions to the fi eld of investigation of expressive vocal prosody in the sense that it provides evidence in favor of the symbolic use of sound, that is, on the motivated relationship between sound and meaning in a genre that is little explored in phonetic literature.
Concerning voice acting, we believe this study will contribute to the improvement of training for voice actors and provide them with theoretical and practical resources that will help them with their performances when playing animated characters.
As "Zootopia" has been dubbed in over 25 languages, we intend to extend this research by investigating the characters' vocal profi les in different languages in order to map linguistic and cultural factors that may interfere in the codifi cation of sound symbolism, and to investigate how the interactions between biological, psychological, social and linguistic factors shape vocal stereotypes. Principalmente uma raposa -(Especially not some jerk -)
(-and making them go savage.)