Cross-cultural and cross-linguistic perception of authentic emotions through speech : An acoustic – phonetic study with Brazilian and Swedish listeners

This study was conducted to investigate whether the listeners’ culture and mother language infl uence the perception of emotions through speech and which acoustic cues listeners use in this process. Swedish and Brazilian listeners were presented with authentic emotional speech samples of Brazilian Portuguese and Swedish. They judged on 5-point Likert scales the expression of basic emotions as described by eight adjectives in the utterances in Brazilian Portuguese and the expression of fi ve emotional dimensions in the utterances in Swedish. The PCA technique revealed that two components explain more than 94% of the variance of the judges’ *. Acknowledgments This work was funded by a fellowship from the São Paulo Research Foundation – FAPESP (2012/04254-4 and 2013/06082-9) to the fi rst author. We thank all Brazilian and Swedish subjects who took part in the perception experiments reported here. Sandra Madureira and Emílio Pagotto are also thanked for helpful discussions. Ana Carolina Vilela-Ardenghi & Ana Raquel Motta


Introduction
Emotions are very important for social interactions and for interpersonal relationships, as they signal information about others' feelings, attitudes, behavioural intentions, as well as aspects of events or situations in the environment.Observers, thus, may use such information to adjust their behaviour according to a specifi c situation (Frijda & Mesquita 1994;Van Kleef 2009).Decades of research have shown that the voice is a powerful tool for expressing emotion.In addition to conveying linguistic information, the speech signal (i.e. the utterance) also carries indexical information related to the speaker (sex, age, dialect, etc.) and information about his/her affective state (Laver & Trudgill 1979).In fact, listeners can satisfactorily recognise emotions expressed through speech in perception experiments, showing an accuracy rate much higher than what would be obtained by chance (Pittam & Scherer 1993).
The infl uence of the speaker's emotional state on his/her voice occurs because emotions cause, among other reactions, physiological responses on the individual.These responses, in turn, cause variations in respiration, phonation and articulation, which are processes directly related to speech, and thus affect the speech prosody (Scherer 1986).Moreover, emotions can also affect the attention and the cognition of the speaker, which infl uences the speech as well (Johnstone & Scherer 2000).
According to the evolutionary theory of emotions (Darwin 1872/2009), these physiological and motor responses have the function of helping the individual to cope with relevant events in the environment by providing energy to specifi c parts of the body and preparing the organism for dealing with the situation (Scherer 1981).Thus, this theory understands that emotions have been selected in the course of the evolution of the species because these responses helped our ancestors to survive.As a result, one expects to fi nd the same patterns of expression (facial and vocal) for a given emotion in all human cultures (Cornelius 2000).Following this approach, some researchers have postulated the existence of a small set of universal emotions (known as basic, discrete, or fundamental), which are very different from each other and have specifi c patterns of cognitive appraisal, expression, and physiological changes (Izard 1977;Tomkins 1984;Ekman 1992;and others).These emotions receive labels by the natural languages such as joy, sadness, fear, disgust, surprise, anger etc.
Early attempts at fi nding evidence to support the hypothesis of the universality of these emotions focused mostly on facial expressions (Scherer, Banse, & Wallbott 2001).In the recent decades there has been an increase in the interest of researchers from different areas in investigating the cross-cultural recognition of emotions through speech, as attempts to shed some light on important aspects of the human communication, the functions of speech prosody, the variability of the acoustic parameters of speech, as well as the nature of the emotional phenomenon itself (Lieberman 1961;Frick 1985;Barbosa 2012).The main hypothesis of such studies is that emotions should be expressed and recognized through speech in the same way in all cultures, regardless of the language, given that the emotion-related changes in the acoustic parameters of speech are mainly results of the physiological and cognitive responses which emotions cause on the speaker (Frick 1985;Scherer 1986).
A classic study on this subject is that carried out by Scherer, Banse, & Wallbott (2001).The study involved listeners from nine countries (Germany, French-speaking Switzerland, England, Netherlands, United States, Italy, France, Spain and Indonesia).The authors used meaningless sentences spoken by four German actors (two male and two female) portraying anger, sadness, fear, joy, and neutral voice.After listening to each stimulus, the listeners chose one or two of the emotions given in a list.All emotions, including the neutral speech, were recognized by all countries with accuracy rate above the chance level (rate expected when guessing the answers).However, the accuracy rate varied between the nationalities of the raters: the German listeners (who heard the stimuli produced by speakers of their native language) performed the best, followed by the Swiss, the English, the Dutch, the American, the Italian, the French, the Spanish and, fi nally, the Indonesian listeners.
Despite the apparent universality in the recognition of emotions through the voice, this and other studies (Abelin & Allwood 2000;Menezes, Erickson, & Han 2012) have also suggested that listeners are better at recognizing emotions that are expressed by speakers of their own language than by speakers of a foreign one.
The discrete approach is not the only view of the emotional phenomena, though.Some researchers (e.g.Wundt 1874;Schlosberg 1941;Rusell 1980) describe the emotions according to the degree of some emotional dimensions or primitives.These dimensions exist along a continuum and not only as two discrete poles of minimum and maximum intensity (Schlosberg 1954).The intensity of the dimensions varies along the continuum depending on how the event is appraised by the organism.The most studied emotional dimensions are activation and valence.Activation corresponds to the degree of arousal of the organism and varies from calm to agitated.Valence corresponds to the subjective feeling of the degree of intrinsic pleasantness caused by the antecedent event and emotions are commonly distinguished within this dimension as either positive or negative (Kehrein 2002).However, many more dimensions have been proposed in the literature (see, for example, Frijda et al. 1995).Barbosa (2009) used, in addition to these three dimensions, the dimension of involvement, which is related to the degree of involvement of the individual with the event and can capture the opposition between attention -rejection (used by Schlosberg 1941).Laukka & Elfenbein (2012) have found that emotional dimensions related to the appraisal of the emotion-eliciting events (e.g.valence, novelty, urgency, goal conduciveness, etc.) can also be inferred reliably from vocal expressions, which suggests that the speech can also signal information about the cognitive representation of events.
In addition to the dimensions of activation, valence, and involvement, we investigate in this study the perception of the dimensions of fairness and motivation, which were used by Frijda et al. (1995) among other dimensions in an inter-cultural study which these authors conducted.These two dimensions were chosen in particular because, as the appraisal of an event as unfair can trigger and increase the intensity of various emotions, especially anger (Ellsworth & Scherer 2003:581), and the emotion the individual is expressing signals his/her disposition to establish any kind of relationship with another individual (approaching, avoidance, touching, etc.), we hypothesise that they can also be inferred from speech.Thus, the dimension of fairness is related to the appraisal of the eliciting event by the individual, i.e., whether the individual considered what happened fair or unfair.Motivation is a dimension related to action readiness, i.e., whether the eliciting event enhanced or diminished the individual's disposition to act on the event.
The use of emotional dimensions seems to be better for distinguishing and describing the vocal expression of emotions than the use of labels of discrete emotions (Pereira 2000;Lugger & Yang 2007;Barbosa 2009).Perhaps this is so because, as some studies have suggested, emotions with a similar level for some dimensions (e.g.activation and valence) share the same patterns for some acoustic parameters (e.g.fundamental frequency and intensity) and this might cause confusion when trying to discriminate these emotions by means of labels (Pereira 2000).In addition, labels of discrete emotions may be used idiosyncratically by the listeners due to their emotional experiences (Barbosa 2009).
The majority of the studies on the vocal expression of emotions have been conducted by using emotional speech samples portrayed by professional or lay actors (see Elfenbein & Ambady 2002 or Scherer 2003, for a review).For this reason, it is unclear to what extent acted emotions correspond to real emotions and there is a lack of studies which investigated this relation.It is possible that actors exaggerate the expression of the emotions and emphasize stereotypical features, missing subtle ones which might be found in real expressions (Scherer 2003;Wilting, Krahmer, & Swerts 2006;Audibert, Aubergé, & Rilliard 2008).
In the study presented in this paper we investigated by means of two perception experiments how Swedish and Brazilian listeners perceive real emotions (as described by discrete emotional labels and emotional dimensions) expressed through speech in their mother language and also in the foreign language (Swedish and Brazilian Portuguese).Our main objectives were to investigate whether the listeners' culture and mother language affect the perception of the emotions and which acoustic-phonetic parameters among those extracted are used by the Brazilian and Swedish subjects to judge the degree of expression of these emotions in the perception experiments.Because the emotionrelated changes in speech are mainly results of the physiological and cognitive responses that the emotions cause on the speaker, we hypothesise that subjects from both cultures perceive the emotions expressed in our corpora in a similar fashion and that their perception is based on the same acoustic parameters.

Stimuli
The stimuli of this experiment consisted of 30 emotional speech samples of Brazilian Portuguese, extracted from interviews of 8 women 1 of the Brazilian documentary fi lm of 2007 "Jogo de Cena" ("Playing").This fi lm compares life narratives told by ordinary women in interviews with the director with the same stories as played by professional actresses.Because the aim of the present study was to investigate real emotional expressions, the utterances were selected from the real interviews only (i.e., the participation of the actresses was not considered).The duration of each speech sample varied from 3 to 10 seconds and all of them had acceptable quality for performing acoustic analysis.The utterances were saved on the hard drive into wave sound format (.wav) with a sampling frequency of 44.1 KHz (Mono).

Participants
Brazilian listeners: 26 subjects completed this experiment (17 women and 9 men).All of them were born and have lived most part of their life in Brazil and have Portuguese as their mother language.They were either undergraduate or graduate students and reported having no hearing problems.The average age of the judges was 23 years, ranging from 18 to 35 years.Swedish listeners: 16 subjects completed the experiment (7 men and 9 women).All of them were born and have lived most part of their life in Sweden and have Swedish as their mother language.They also reported having no knowledge of Portuguese and no hearing impairment.They were undergraduate and graduate students of the University of Gothenburg.Their average age was 26 years, ranging from 20 to 47 years.

Procedure
In this experiment, subjects were asked to rate on 5-point scales ranging from 0, "not at all adjective", to 4, "very adjective", the degree with which the speaker in each stimulus was expressing the discrete emotions described by eight adjectives (joyful, moved, surprised, sad, contented, anguished, distressed, and enthusiastic).The experiment, thus, consisted of eight parts and for each one the listeners evaluated one adjective for all the thirty stimuli, which were presented randomly.In order to prevent the fatigue of the subjects, the experiment was split in two sessions, carried out in different days.In the fi rst session, the judges assessed the labels joyful, moved, surprised, and sad, whereas in the second one, the adjectives contented, anguished, distressed, and enthusiastic, exactly in this order.The structure of the experiment was kept constant in both sessions and only the adjective to be evaluated was changed.Only the responses of the subjects who took part in both sessions were considered.
The main reason for allowing the listeners to rate the utterances according to multiple labels is the fact that an utterance may convey more than one affective (and emotional) state, which renders some emotions diffi cult for listeners to discriminate, e.g., anger and frustration or happiness and engagement (Scherer 1998;Hirschberg, Liscombe, & Venditti 2003;Douglas-Cowie, et al. 2005).In addition, the most used method in studies on the cross-cultural recognition of emotions through speech (and also through facial expressions), which consists in asking subjects to choose one label from a small set of alternatives to describe the emotion expressed in a particular stimulus, has posed many problems for this area of investigation and has been widely criticised in the literature (see, for example, Goddard 2002).We used one adjective for scale (together with the terms "not at all" and "very") rather than two antonyms (as is the case with the semantic differential scale proposed by Osgood 1952) because some studies suggest that some antonyms are not treated by judges as opposite values of a scale, but rather behave as two distinct affects (Schimmack 2001).The term "not at all" does not imply neutral speech.It means that the speaker does not express at any level the emotion indicated.Thus, it does not rule out the possibility that the speaker is expressing other emotions.
The experiment was developed and carried out over the internet through the "Survey Gizmo" online software (http://www.surveygizmo.com/).The link for accessing the experiment was sent by email to the subjects who were interested in taking part in it.They were asked to use earphones and to do the experiment in a quiet room.The texts of the experiment (instructions, questionnaires, and adjectives) were presented in the mother language of the subjects, i.e. in Portuguese for the Brazilian subjects and in Swedish for the Swedish subjects.One speech sample was presented on each screen along with its corresponding scale and it was reproduced automatically as the page was fi nished loading.The subjects had to mark their response on the scale by clicking on the desired value and then click on the "next" button at the bottom of the page to proceed to the next page (stimulus).It was not possible to return to the previous page or to proceed to the next one without having marked the response on the scale.The judges' responses were automatically converted to a linear scale ranging from 0 to 1 (0, 0.25, 0.50, 0.75, and 1) to approximate these values with the z-scored values of the acoustic parameters.

Analyses and Results
The statistical analyses reported in this paper were performed with the software R in its 2.11.1 version (R Development Core Team 2010).

Agreement between the subjects
Because we are not interested in the distances between scale points, but rather in the classifi cation of the intensity of the expressed emotions (as indicated by the labels placed on the left and on the right of the scales), the subjects' responses were analyzed as categories.Therefore, we verifi ed the reliability of the listeners' responses in the experiment by computing the Fleiss' kappa index (Fleiss 1971), which gives an estimation of the agreement between n raters.This test is signifi cant for α = 0.001 when z > 3.09.The kappa index is a number between 0 and 1, and the closer to 1, the greater the agreement.
The index was calculated separately for each emotional adjective, with the fi ve levels of responses of the 5-point scales.Table 1 shows the kappa values for the eight emotional labels in descending order, as well as the corresponding z value, for the Brazilian and Swedish listeners.All kappa values are signifi cant (p < 0.001) and similar to other studies on the perception of emotions through speech (Alm & Sproat 2005;Devillers et al. 2006;Barbosa 2009).There was satisfactory agreement for all labels but for surprised, which, although signifi cant, had a low kappa value for both groups.This might be due to the fact that surprise is better expressed through facial expressions than through the voice (Ekman 1999) and that it is often confused with other emotions, such as joy and anger (e.g.Abelin & Allwood 2000;Paulmann & Uskul 2014).

Identifying the correlated variables
A principal component analysis (PCA) was carried out with the mean of the listeners' responses for each speech sample in order to investigate the behaviour of the emotional lexical items according to the listeners' perception.PCA is a statistical technique which identifi es the correlated variables in the data and groups them into uncorrelated dimensions (factors or principal components), using the least necessary number of factors to account for the variance of the data.The rationale behind the use of this analysis is that the lexical items whose loadings present the same sign for a single PCA factor are behaving like synonyms (or antonyms, for the labels with opposite signs) to that emotion, according to the listeners' perception.
For the Brazilian listeners' responses two PCA factors account for 96.2% of the variance.The fi rst factor explains 80.8% of the variance and the second one, 15.4%.For the Swedish listeners' responses two factors account for 94.5% of the variance.The fi rst factor explains 80.3% of the variance and the second factor, 14.2%.Table 2 shows the loadings, which correspond to how much each variable contributes to each one of the factors.The pattern of the loadings was very similar between the two groups of judges.The adjectives related to emotions of positive intrinsic pleasantness (joyful, contented, and enthusiastic) presented a positive loading for the fi rst factor, whereas those adjectives related to negative emotions (moved, sad, anguished, and distressed) presented a negative loading for this factor.All variables presented a negative loading for the second factor.One can see in fi gure 1 that the fi rst factor reveals two emotional groups: the adjectives related to emotions of positive valence are clustered together in the upper right corner of the scatter plot, whereas the adjectives related to emotions of negative valence are grouped together in the upper left corner.The adjective surprised does not belong to neither of the groups, as it presented a relative high and negative loading for factor 2 but a relative low and positive loading for factor 1. Therefore, it seemed appropriate to call factor 1 HAPPINESS and factor 2 NEUTRALITY (the negation of any expressiveness).It is important Wellington da Silva, Plínio Almeida Barbosa, Åsa Abelin to emphasize that these two words (happiness and neutrality) were used by us only to name the two principal components.Moreover, this result does not imply that "joy" is the semantic opposite of "sadness", for example, since the negative group negates the positive group as a whole, which only indicates that the listeners were able to distinguish between "happiness" and "non-happiness".Cross-cultural and cross-linguistic perception of authentic emotions through speech

Acoustic analysis
The utterances evaluated by the judges were also subjected to acoustic analysis, in which some acoustic parameters were automatically extracted by means of the script "Expression Evaluator", implemented for the software Praat (Boersma & Weenink 2011) by Barbosa (2009) 2 .The classes of acoustic parameters extracted by this script and used here are: fundamental frequency (f0), fundamental frequency fi rst derivative (df0), global intensity, spectral tilt and Long-Term Average Spectrum (LTAS).The fundamental frequency is an acoustic correlate of the rate of vocal fold vibration and is perceived as the pitch of the voice.Sound intensity corresponds to the variations in the air pressure of a sound wave and is usually measured in decibels (dB).It is the major contributor to the sensation of loudness of a sound.Spectral tilt measures the degree of the drop in intensity as the frequencies of the spectrum increase.The LTAS is a spectrum obtained from the average of several spectra extracted from the speech sample for a particular frequency range.The f0 fi rst derivative is used as a means of revealing abrupt changes in the intonation contour.
These acoustic parameters were chosen because they are likely to undergo changes due to the physiological responses triggered by the emotional processes, being thus potential correlates of the vocal expression of emotions (Frick 1985;Scherer 1986).The parameters spectral tilt and LTAS are acoustic correlates of vocal effort and voice quality, since the increase of vocal effort enhances the energy in the harmonics of high frequencies due to changes in subglottal pressure and in the characteristics of vocal fold vibration (Laukkanen et al. 1997;Traunmüller & Eriksson 2000).
The script searches f0 within the range of 75 -360 Hz in the case of male speakers and 110 -700 Hz in the case of female speakers by means of the autocorrelation algorithm of Praat and smoothes it by applying a 10-Hz LP fi lter.The statistical descriptors related to f0 and df0 are normalized through the z-score technique 3 by using the following reference values (mean, standard deviation) of f0 in Hz for adult males: (136, 58) and females: (231, 120).The interquantile semi-amplitude is calculated as the difference between the 95% and 5% quantiles, divided by two.The f0 skewness is taken as the difference between f0 mean and f0 median, divided by the f0 interquantile semiamplitude.The f0 fi rst derivative is computed as the difference in Hz between successive odd-numbered f0 points of the PitchTier object taken in pairs.Spectral tilt is estimated by the difference of intensity in dB between the intensity points of the low band (0 − 1250 Hz) and the high band (1250 − 4000 Hz), taken every ten points.For the sake of normalisation, these values are divided by the complete-band intensity median.Finally, the LTAS slope is computed as the difference of intensity in dB between the bands 0 -1000 Hz and 1000 -4000 Hz, divided by 10 for the sake of scale.

Relating the acoustic parameters to the listeners' perception
Multiple linear regression models were performed to associate the twelve acoustic parameters (independent variables) with the scores of each of the two PCA factors (dependent variable) and thus investigate which acoustic parameters better predict the listeners' perception of the emotions expressed in the utterances of our corpus.P-values up to 10% were considered as marginally signifi cant and the β values presented below refer to the standardised regression coeffi cients.
The parameter spectral tilt mean, which is the one that most contributed to the explained variance of the PC1 of both cultures, presented a negative correlation with this factor (as was also the case with the spectral tilt standard deviation).This means that an increase in the value of this parameter, which is related to the decrease of the energy concentrated in the harmonics of higher frequencies, tended to be perceived by the listeners as a decrease of the level of HAPPINESS of the speakers.Such utterances were judged with high levels for the emotional adjectives moved, sad, anguished, and distressed.The opposite is true for the parameter f0 skewness, which presented a positive correlation with this factor.

Discussion
The main hypotheses tested in this experiment were that the Brazilian and the Swedish listeners would perceive the emotions expressed in the utterances of our Brazilian Portuguese corpus in a similar fashion and that they would rely on the same acoustic parameters to make their judgements.The analyses with the data from the experiment I have corroborated these hypotheses, as it was revealed that their perception was quite similar in this experiment.
Similar levels of agreement were observed between the Swedish and Brazilian listeners' responses, and the label surprised presented the lowest level of agreement in both groups of listeners.PCA showed that the dimensionality of the responses of the listeners from both cultures could be reduced to two components, which together accounted for more than 94% of the total variance for each culture.The fi rst of these components accounted for about 80% of the total variance and revealed two emotional groups for both populations, one which is composed of emotions of positive valence and another which consists of emotions of negative valence.This result indicates that these two PCA factors are very robust and that the Swedish and Brazilian listeners evaluated our Brazilian Portuguese corpus in a very similar way, which means that they have a similar perception of these emotions across the utterances.
Furthermore, the correlation of the scores of the two PCA factors with the acoustic parameters suggested that the acoustic parameters that best explain the responses of listeners from both cultures are, in general, the same: f0 skewness, spectral tilt mean and standard deviation for factor 1, and f0 median and spectral tilt skewness for factor 2. The proportion of the variance accounted for by the combination of acoustic parameters was slightly larger for the Brazilian listeners for the fi rst PCA factor and larger for the Swedish listeners for the second one.This latter difference is mainly due to the contribution of f0 interquantile semi-amplitude, which is higher for the Swedes.
These fi ndings lead us to the conclusion that the listeners' mother language and the emotional experience in both countries did not infl uence the perception of the emotions expressed by the speakers of our Brazilian Portuguese corpus, that is, the Swedish listeners' perception of emotions expressed by native speakers of Brazilian Portuguese (a foreign language) was quite similar to the perception of native speakers of this language.
Cross-cultural and cross-linguistic perception of authentic emotions through speech

Stimuli
The stimuli used in this experiment consisted of 40 speech samples from 5 Swedish female speakers 5 , each one with duration between 1 and 6 seconds and with acceptable quality for performing acoustic analysis.These utterances were extracted from authentic speech (talk shows and interviews) of the Swedish television and of one Swedish interview programme which was freely available over the internet as podcast.They were selected jointly by the fi rst (a native speaker of Brazilian Portuguese) and third author (a native speaker of Swedish) after a careful discussion on their emotional content.They were saved on the hard drive into wave sound format (.wav) with a sampling frequency of 44.1 KHz (Mono).

Participants
Swedish listeners: 19 subjects completed the experiment (6 men and 13 women).They were students and staff of the University of Gothenburg with no hearing impairment.Their average age was 30 years, ranging from 21 to 56 years.All of them were born and have lived most part of their life in Sweden and have Swedish as their mother language.
Brazilian listeners: 20 subjects completed the experiment (7 men and 13 women).Their average age was 25 years, ranging from 18 to 34 years.They were either graduate or undergraduate students, were born and have lived most part of the life in Brazil and have Portuguese as their mother language.They reported having no knowledge of the Swedish language at all and no hearing impairment. 5.The programmes from which the utterances were extracted did not provide information regarding the age of the participants.According to our perception, the speakers were aged between 30 and 60 at the time of the recording.

Procedure
In this experiment, judges were asked to rate on 5-point scales ranging from 0, "not at all adjective", to 4, "very adjective", the degree with which the speaker in each stimulus was expressing the emotional state described by emotional dimensions.The experiment consisted of 5 parts, which were carried out in a single session.In each part the listeners evaluated one emotional dimension for all 40 speech samples.The dimensions investigated in this experiment were: activation, fairness, valence, motivation, and involvement.The stimuli were presented in a random order, but the emotional dimensions were evaluated by all listeners in this order.After listening to each utterance, the subjects had to judge the degree of expressivity of the emotional dimension specifi c to that part of the experiment by answering a question related to that dimension (e.g."How agitated was the speaker?").The questions and adjectives used for the dimensions are shown in table 3. The texts of the experiment (instructions, questionnaires, and adjectives) were presented in the mother language of the subjects, i.e. in Portuguese for the Brazilian subjects and in Swedish for the Swedish subjects.The remaining of the procedure is the same as that followed in experiment I (described in section 2.1.3).
Table 3 -Questions and adjectives used for the dimensions in the experiment II (presented for the judges in their mother language, Brazilian Portuguese or Swedish).In each part, the second row shows the question and the third row shows the adjective used in the 5-point scales.

Identifying the correlated dimensions
PCA was carried out with the mean of the listeners' responses for each speech sample in order to identify which of the fi ve dimensions correlate with each other.This analysis revealed that for the Swedish listeners' responses two factors account for 98.2% of the total variance of the responses.The fi rst of these factors (PC1) accounts for 86% of the variance and the second one (PC2), 12.2%.For the Brazilian listeners' responses two factors account for 95.4% of the total variance of the responses.The fi rst of these factors (PC1) accounts for 76% of the variance whereas the second factor (PC2) accounts for 19.4% of the variance.
The loadings of the dimensions for these PCA factors are shown in table 5.The dimensions are also plotted according to these loadings on a scatter plot in fi gure 2. It can be observed from this fi gure that both axes of the principal components are inverted between the two groups of listeners.For PC1 the dimensions of fairness and valence presented negative loading for the Swedish listeners' responses and positive loading for the Brazilian listeners' responses, whereas activation, motivation, and involvement had negative loading in the latter group and positive loading in the former one.For PC2 all fi ve dimensions presented negative loading for the Swedish listeners' responses and positive loading for the Brazilian listeners' responses.However, that does not mean that the Brazilian and Swedish subjects' perception was different, because the most important is the similarity in the pattern of the loadings of the dimensions in both principal components, which shows that both groups of listeners distinguished fairness and valence from involvement, motivation, and activation.This inversion of the axes between the cultures was caused by the rotation procedure applied in the PCA.
For both cultures there was a tendency for the utterances which were evaluated with high values of the scales for the dimension of fairness to be also rated with high values for the dimension of valence.Conversely, the speech samples which were evaluated by these listeners with high values for activation tended to be also rated with high values for the dimensions of motivation and involvement (and thus with lower values for the dimensions of fairness and valence).By listening to these utterances, one can notice that the speakers expressed the emotion "anger" with some level (according to the listeners' perception, these speakers were very activated, very involved and very motivated to act on the situation).Therefore it seemed appropriate to name the fi rst factor (PC1) ACTION READINESS for the Swedish listeners and CALMNESS for the Brazilians (with antonyms because of the inversion of the axes).Because the dimension of valence presented the most extreme loading in the PC2 for both groups of judges, we called it DISSATISFACTION for the Swedish listeners and SATISFACTION for the Brazilians.

Relating the acoustic parameters to the listeners' perception
The twelve acoustic parameters described in section 2.2.3 were also automatically computed with the script "Expression Evaluator" for all the 40 utterances used as stimuli in this experiment.Multiple linear regression models were then applied to associate these acoustic parameters with each of the two PCA factors for the Brazilian and Swedish listeners.
As indicated by the standardised regression coeffi cients shown above, the parameter LTAS slope presented a negative correlation with the PC1 of the Swedish listeners and a positive correlation with the PC1 of the Brazilian listeners, that is, a decrease in the LTAS slope (caused by the increase of relative intensity in the harmonics of higher frequency) tended to be interpreted by the Swedish listeners as an increase of the degree of ACTION READINESS of the speakers and by the Brazilian listeners as a decrease of the degree of CALMNESS of the speakers.The parameter f0 median (which is the second parameter that most contributed to the prediction of PC1 in the multiple linear regression models) presented a positive correlation with the PC1 of the Swedish listeners and a negative correlation with the PC1 of the Brazilian listeners, suggesting that the increase of this parameter tended to be interpreted by the Swedish listeners as an increase of the degree of ACTION READINESS of the speakers and by the Brazilian listeners as a decrease of the degree of CALMNESS of the speakers.These fi ndings are consistent with the literature (Scherer 1986;Scherer 2003).

Discussion
The main objective of this experiment was to investigate whether the Brazilian listeners perceive the emotions expressed by Swedish speakers in the same way as the Swedish listeners, even when evaluating emotional dimensions rather than discrete emotions.
The agreement between the listeners, which was verifi ed by means of the Fleiss' kappa index, indicated that all fi ve emotional dimensions could be reliably inferred by the listeners from both cultures.
The PCA carried out with the mean of the listeners' responses for each utterance in the fi ve dimensions revealed that more than 95% of the total variance of the responses is explained by two uncorrelated components.The distribution of the dimensions according to their loadings in the fi rst of these principal components suggests that the listeners of both cultures evaluated the fi ve emotional dimensions by distinguishing between a state of "calmness" (rated with higher values for valence and fairness) and a state of higher emotional agitation (rated with higher values for activation, motivation, and involvement).
The parameters f0 median and LTAS slope explained a large proportion of the variance of the fi rst principal component of both cultures (ACTION READINESS and CALMNESS).However, the parameters f0 interquantile semi-amplitude, spectral tilt mean, and spectral tilt standard deviation also contributed to the multiple linear regression of the Swedish listeners.The parameter f0 fi rst derivative mean was useful for predicting the second principal component of both cultures (DISSATISFACTION and SATISFACTION), but the parameters f0 median, spectral tilt standard deviation, and spectral tilt skewness also contributed to the multiple linear regression of this principal component for the Swedish listeners.Thus, we can conclude from the results of this experiment that the perceptual judgements of the degree of expression of the fi ve emotional dimensions in our Swedish corpus were very similar between the Swedish and Brazilian listeners, despite some small differences.This means that the listeners' mother language, the culture, and the emotional experiences in both countries did not signifi cantly infl uence the perception by means of emotional dimensions of the emotions expressed by Swedish speakers.

General discussion
This study was designed to explore some ongoing questions of the fi eld of vocal expression of emotions, namely: I. Are the patterns of the vocal expression of emotions universal or dependent upon the culture and the prosodic organization of the speaker's language?II.In addition to activation, valence, and dominance, can other emotional dimensions be reliably inferred from speech?
The main fi nding of the present study is that the perceptual judgements of the Brazilian and Swedish listeners were virtually the same in both conditions (when judging utterances in Brazilian Portuguese and in Swedish).This was evidenced by the PCA, which revealed for both cultures a similar pattern in the distribution of the emotional labels and dimensions according to their loadings for the two principal components.This result indicates that the listeners from both cultures had a similar perception of the emotions across the utterances.The only observed differences were in the agreement between the judges and in the correlation of some of the acoustic parameters with the PCA factors.The acoustic parameters which explained the listeners' judgements were in general the same for both cultures (which is noteworthy, given that the Brazilian Portuguese and the Swedish have a different prosodic organization and thus some parameters such as f0 are used differently by these languages in the linguistic domain).In spite of this fact, some parameters presented a better correlation with one culture (e.g.f0 interquantile semi-amplitude for the factor NEUTRALITY of the Swedish listeners in experiment I or f0 median for the PC1 of the Brazilian judges in experiment II and intensity skewness in experiment I).When the Swedish subjects evaluated stimuli in Brazilian Portuguese, the agreement between them was similar to the agreement between the Brazilian subjects for the same stimuli.However, the Brazilian listeners presented a slightly worse agreement than the Swedish listeners when evaluating speech samples in Swedish.
Since the present study investigated only two cultures (countries and languages), it is not possible to conclude from our data that the patterns of vocal expression of emotions are universal.Nevertheless, the fi ndings of this study do suggest that these patterns may be universal to some extent and that some minor cultural and language-specifi c differences are also possible.This result is consistent with a number of studies which have investigated other cultures and languages (Scherer, Banse, & Wallbott 2001;Burkhardt et al. 2006;Menezes, Erickson, & Han 2012;Paulmann & Uskul 2014).Differences may arise from the fact that listeners with different mother languages have learned to interpret the variations on the acoustic parameters differently due to the role these parameters play in the prosody of their mother language and also because of possible culture-specifi c expressions or display rules (Ekman & Friesen 1969;Scherer 1985).Future studies have to investigate other cultures and languages which differ from each other with regard to the prosodic structure in order to advance our understanding of this problem.
For experiment I, the PCA showed that the listeners of both nationalities evaluated the emotions described by the eight adjectives jointly by means of two major dimensions: HAPPINESS and NEUTRALITY.The fi rst one, HAPPINESS, is a combination of two emotional groups: one with the emotions of positive valence (joy, contentment, and enthusiasm) and the other with the negative emotions (moved, sadness, anguish, and distress).These groups reveal that the listeners judged the adjectives within each group in the same way, despite the possible semantic differences between them.This suggests that the PCA technique can help to avoid the problem of labelling the emotions with discrete labels, which arises for example because of the semantic differences between languages (Goddard 2002).
The results from the experiment II indicate that, apart from the classic dimensions of activation and valence, other emotional dimensions related to the appraisal of the eliciting event and to action tendency can also be inferred from speech.This fi nding has implications for our understanding of the functions of the vocal communication of emotions in social interactions, as it suggests that from the speaker's voice, one can, in addition to merely labelling the speaker's emotional state with a discrete label, infer what happened to this person (a good or a bad event), the intensity of this event, and how he/she might act on the situation.As noted by some authors (Darwin 1872(Darwin /2009;;Cornelius 2000;Scherer 2000), this fact may have helped our ancestors to cope with events in the environment, such as warning others about the presence of predators or enemies, and also to engage in social activity by showing affection, approval, or disapproval.In addition, the proportion of the variance of the fi rst principal component (the one that represents the highest percentage of the total variance of the Cross-cultural and cross-linguistic perception of authentic emotions through speech judges' responses in both experiments) that is accounted for by the acoustic parameters was higher for the emotional dimensions.This indicates that dimensions can be more directly interpreted as the result of the physiological changes of the speaker and also more directly related to the acoustic parameters of speech, as previously suggested by some authors (Scherer 1986;Pereira 2000;Schröder et al. 2001;Barbosa 2009).It is for future research to investigate whether other emotional dimensions can also be reliably inferred from speech and their usefulness for describing and distinguishing emotions.
The limitation of the present study is that only speech samples from female speakers were used.This casts doubt on whether the fi ndings reported here can be generalised to male speakers.However, there are no studies, to our knowledge, which have presented evidence suggesting that female speakers' vocal expressions of emotions are perceived or expressed differently from male speakers'.In a review of 75 studies which presented recognition rates for males and females at decoding nonverbal cues of emotion conveyed through the face, body, or voice tone, Hall (1978) found an advantage of the female judges and that the magnitude of this effect did not vary signifi cantly with the sex or the age of the person of the stimulus.In a study using event-related potentials (ERPs) to determine the time-course of neural responses to emotional speech, Paulmann et al. (2008) found that emotional vocal expressions of anger, fear, disgust, and sadness can be distinguished from neutral vocalizations within 200 ms after the beginning of the sentence regardless of the sex or the age of the speaker.Therefore, it is unlikely that our fi ndings would be different if utterances from male speakers had also been used.Nonetheless, future research should use utterances from speakers of both sexes to reveal any possible effect of this variable on the perception of vocal expressions of emotions.
In conclusion, this study contributes to our understanding of the cross-cultural perception of emotions through speech by providing evidence from a comparison between Brazilian and Swedish listeners' judgements of emotional speech of the mother language and also of the foreign language (Swedish and Brazilian Portuguese).This comparison is also an important contribution of the current study, since the majority of the studies on the cross-cultural emotion recognition have used as stimuli emotional expressions from only one cultural group and language (Elfenbein & Ambady 2002; Paulmann & Uskul 2014).Furthermore, the method used in this work (which combines automatic extraction of acoustic parameters from the speech signal, PCA, and multiple linear regression analysis) has application in speech technology and artifi cial intelligence, as it can be used in the development of softwares which automatically recognise the speaker's emotional state from the voice (which have been applied to monitoring the affective state of customers during call centre conversations), human-machine interaction systems, and in the development of more sophisticated text-to-speech converters.

Figure 1 -
Figure 1 -Adjectives of the experiment I plotted according to their loadings for the fi rst factor of the PCA (HAPPINESS) and for the second one (NEUTRALITY).Top: Brazilian listeners' judgements; Bottom: Swedish listeners' judgements.

Figure 2 -
Figure 2 -Dimensions of the experiment II plotted according to their loadings for the fi rst factor of the PCA (horizontal axis) and for the second (vertical axis).Top: Swedish listeners' judgements; Bottom: Brazilian listeners' judgements.

Table 1 -
Kappa values for the eight emotional adjectives judged by the Brazilian and Swedish listeners in the experiment I and their corresponding z value.

Table 2 -
Loadings of the adjectives judged by the Brazilian and Swedish listeners in the experiment I for the fi rst principal component (PC1) and for the second (PC2).

Table 5 -
Loadings of the fi ve emotional dimensions judged by the Brazilian and Swedish listeners in the experiment II for the fi rst principal component (PC1) and for the second (PC2).