Construction of face databases for tasks to recognize facial expressions of basic emotions: a systematic review

ABSTRACT. Recognizing the other's emotions is an important skill for the social context that can be modulated by variables such as gender, age, and race. A number of studies seek to elaborate specific face databases to assess the recognition of basic emotions in different contexts. Objectives: This systematic review sought to gather these studies, describing and comparing the methodologies used in their elaboration. Methods: The databases used to select the articles were the following: PubMed, Web of Science, PsycInfo, and Scopus. The following word crossing was used: “Facial expression database OR Stimulus set AND development OR Validation.” Results: A total of 36 articles showed that most of the studies used actors to express the emotions that were elicited from specific situations to generate the most spontaneous emotion possible. The databases were mainly composed of colorful and static stimuli. In addition, most of the studies sought to establish and describe patterns to record the stimuli, such as color of the garments used and background. The psychometric properties of the databases are also described. Conclusions: The data presented in this review point to the methodological heterogeneity among the studies. Nevertheless, we describe their patterns, contributing to the planning of new research studies that seek to create databases for new contexts.


INTRODUCTION
E motions play an important role in society life, as they enable interaction among people. According to the evolutionary theories, all emotions derive from a set of basic emotions common to both humans and animals and which are genetically determined 1,2 . One of the ways for us to recognize the other's emotion is through facial expressions, since the face is one of the most expressive visual stimuli in society life 3 . The ability to recognize emotions through the face can already be perceived in newborns, a fact that justifies the innate nature of this skill 4 .
From a study using a systematized task, Ekman and Friesen 5 postulated six basic emotions, which are related to evolutionary adaptations and can be universally recognized, namely, happiness, sadness, fear, disgust, surprise, and anger. In addition, they identified that the cultural aspects did not modulate the way in which these emotions were expressed 5 . Thus, the evidence indicated that all human beings had the same movements of the facial muscles under certain circumstances 6,7 , turning the ability to express emotions into a behavioral phenotype.
However, a number of studies began to notice that, within this phenotype common to human beings, some variables could modulate the way to recognize these facial expressions, such as cultural context 8 , age 9 , gender 10 , and race 11 . Taking these variables into account, several studies started to construct and validate specific face databases to assess the ability to recognize emotions through facial expressions [12][13][14][15][16] since, when selecting a set of facial expression stimuli, it is necessary to consider characteristics of the model that are expressing the emotions, as well as who will recognize them. Therefore, the existing facial expression databases present great diversity with regard to the physical characteristics of those who express the emotions, the way in which emotions are induced during the construction of the image database, and how they are presented in the validation stage [12][13][14] . Despite the methodological differences across the studies, they follow important standards for the construction and validation of the series of stimuli. Comparing the methodology used by the studies in the creation of these databases, regardless of the characteristics of who expresses the stimuli, can contribute to the planning of new research studies that seek to create face databases for new contexts. Thus, the objective of this systematic review was to gather studies that constructed face databases to assess the recognition of facial expressions of basic emotions, describing and comparing the methodologies used in the stimuli construction phase.

Search strategies and eligibility criteria
The search strategy for this systematic review was created and implemented prior to study selection, in accordance with the checklist presented in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 17 . The databases used to select the articles were the following: PubMed, Web of Science, PsycInfo, and Scopus. The following word crossing was used: "Facial expression database OR Stimulus set AND development OR Validation." The searches were conducted from June to December 8, 2021.
The lists of references of the selected articles were also researched for additional sources. The inclusion criteria were surveys that constructed face databases to assess the recognition of basic emotions, published in original articles or disclosed on official websites, without language or time restrictions. Letters to the editor, books and book chapters, reviews, comments, notes, errata, theses, dissertations, and bibliographic/ systematic reviews were excluded. In addition, it is worth noting that only the construction stage of the databases was included in this review.
Therefore, additional studies conducted after construction, such as normative data, were not contemplated in the analysis.

Study selection
All the articles found in the databases were saved in the Rayyan electronic reference manager. After removing duplicate articles and according to the inclusion criteria of this study, all articles were evaluated by two independent researchers (DF and BF) through their titles and abstracts. In this stage, the researchers classified the articles as "yes," "no," or "perhaps." Subsequently, the researchers reached consensus as to whether the articles recorded as "perhaps" should be included in the review.
After the inclusion of these studies, three researchers (DM, BF, and MB) read the articles in full and extracted information such as year of publication and study locus, name of the database built, characteristics of the participants who expressed the emotions (number of participants, place of recruitment, gender, age and race), basic emotions expressed, and final total of stimuli included in the database and their specific characteristics (Table 1) [12][13][14][15][16] . Subsequently, the methodological characteristics of the databases were collected, such as the method used to elicit the emotions, patterns in the capture of stimuli, criteria used in the validation stage, sample characteristics in the validation stage, and psychometric qualities assessed ( Table 2) [12][13][14][15][16] .

Risk of bias
The studies selected in this review are for the construction of face databases. In this sense, the traditional risk of bias tools used in randomized and nonrandomized studies is not applicable. The task elaborated by the studies must offer valid and interpretable data for the assessment of facial recognition of basic emotions of individuals in certain contexts. Therefore, the quality of the studies included can be observed based on the analyses performed for the reliability and validity of the databases elaborated 18,19 .

Data analysis
We analyzed the psychometric properties assessed by the studies in the stage for the validation of the stimuli ( Table 2) 64,65 . This information is important to assess the quality of the database that was elaborated. Qualitatively, we followed the standards for educational and psychological testing of the American Educational Research Association 20 and the stages specified in Resolution 09-2018 of the Brazilian Federal Council of Psychology 21 , which regulates the dimensions necessary for the assessment of psychological tests. Consequently, information based on the analysis of the database items and the measures for validity evidence were obtained (Table 2).
In addition, we sought to identify in Table 2 when the psychometric measure assessed by the studies presented satisfactory indexes. For accuracy, as a reference standard we used the consensus among most of the studies on the construction of face databases that include stimuli with recognition rates ≥70%. In some cases, the studies established other rates for recognition, which were indicated as symbols in the table.
Since accuracy is a fundamental indicator for stimuli selection and has been widely used as a quality parameter for construction studies, this variable is included in the table as an indicator of both precision and content-based validity evidence, since it is a precision measure that was used to validate the database content. For agreement among the evaluators, the studies generally use Cohen's or Fleiss' kappa indexes. Therefore, we used value ≥60% as a reference 22,23 . For internal consistency, we used Cronbach's alpha value >0.70 as a reference 24 . Figure 1 presents the search and selection process for the 36 articles included in this systematic review [12][13][14][15][16][17] .        -Race: European-American (59%), African-American (18%), Asian-American (6%), Hispanic-American (6%), and other races (12%) -Precision: Accuracy † and testretest † -Validity evidence: Content-based: Accuracy and testretest  *Only images with ≥50% accuracy were included in the final database; † Satisfactory indexes; ‡ There was a significant difference in precision between the analyzed variables; § The mean rate of correct identification of the emotions was 62.5%; // Only images recognized by ≥15 evaluators were included in the final database; ¶ There was no significant difference in precision between the analyzed variables; # The mean rate of correct identification of the emotions was 66%; **Only images with ≥60% accuracy were included in the final; † † Accuracy is presented for each emotion and varied from 44 to 100%; ‡ ‡ Only images recognized by at least 55% of the evaluators were included in the final database. The mean recognition of the final database was 63%; § § The mean recognition rate of the final database varied from 47 to 94%.   Table 1 presents the general characteristics of the face databases included and Table 2 presents the methodological characteristics used to create each of them. United States. In relation to the theoretical framework used for the construction of the databases, 75% of the studies were empirically based. In other words, the limitations of the databases already built were the basis for this construction.

Views & Reviews
Most of the articles (61.1%) elaborated databases made up by six basic emotions (i.e., happiness, sadness, fear, anger, disgust, and surprise), as well as neutral faces. Some databases did not neutral faces, or surprise and disgust. Two databases only included happiness and neutral faces, one database only included happiness, fear, and neutral; and another included only happiness, sadness, anger, and surprise.
In relation to the participants, 41.7% of the studies selected resorted to actors (either amateur or professional) to express the emotions. The mean age of the actors varied from 13.24 to 73.2 years, with four studies including different age groups in their databases. Only five of the studies with actors included different races in their samples, and seven studies included any of the specific race, namely, Caucasian, Japanese, Korean, Polish, Indian, or Chinese. Three studies did not report the actors' race.
In relation to the other studies, that is, those that present the basic emotions expressed by community-dwelling individuals, inserted in various contexts, presented ages varying from 4 months to 93 years, and five of these studies included volunteers of different ages. Of these, 10 studies included participants of different races and the remaining studies included only one race, namely, Korean, Caucasian, Indian, and Chinese. Three studies did not report the participants' race. With regard to the presentation of the stimuli, 86.1% of the studies included colored faces in their databases, four studies used black and white faces, and one study included both colored and black and white faces in its database.
Most of the databases included (75%) present static stimuli, four studies are of dynamic stimuli, and five databases have both static and dynamic stimuli. Five studies presented open and closed mouth expressions, and other studies included additional features such as varying intensities and varying angles. The final total stimuli included in the databases varied from 42 to 18,800.

Method used to elicit the emotions
The method used to elicit the emotions varied across the studies. In general, more than one method was used in this stage. Predominantly, 44.4% of the studies used specific situations as one of the ways to elicit the intended emotions, such as "Imagine that you have just won the lottery; imagine that you have just lost a loved one." The studies also used instructions based on the muscle movement of the emotions considering protocols such as the Investigator's Guide for the Facial Action Coding System (FACS), others used a photograph as a model, and others elicited the emotions from photographs and/or videos. Two studies that built faces with infants and children used an instructional protocol, performed by the parents, to elicit the intended emotions. In one study, the individuals could express the emotion any way they wanted. Three studies elicited emotions in the participants through verbal instructions, such as "Make a happy face" and one study used workshops to teach children how to express basic emotions as well as a Directed Facial Action Task used to guide movement of anatomical landmarks.

Recording the stimuli
Most of the studies sought to establish and describe patterns to record the stimuli. For example, the images were photographed against a white background, black, or gray, and the individuals wore black or white garments. In addition, 55.6% of the studies established distractors that should be removed from the volunteers so that the images could be recorded, such as jewelry, accessories, and strong makeup.

Validation stage
The number of participants who validated the faces constructed by the studies varied from 4 to 1,362, and most of the participants who validated the stimuli were inserted in a university context. The way to validate the final stimuli in the database varied across the studies. The majority included recognition accuracy as one of the criteria, with images included reaching recognition percentages from >50 to ≥75%. The studies also used other criteria to include the stimuli in the final database, such as agreement among the evaluators.

Psychometric properties of the final database
Only one study did not include accuracy as a precision measure. In most of the cases, it was also used to validate the task content and even for item analysis. One study also used the method of halves as a precision measure. In 66.7% of the studies, the stimuli were recognized with ≥70% accuracy.
Test-retest reliability was a variable used to assess task precision in four studies, all presenting satisfactory indexes for this dimension. Regarding the measures of validity evidence, 10 studies used Cohen's kappa or Fleiss' kappa to validate the task content according to the agreement among the evaluators. All of them presented satisfactory indexes in this dimension. Only one study used Cronbach's alpha to assess internal consistency, also reporting a satisfactory value.
Six studies analyzed the items' difficulty. Three studies used Item Response Theory (IRT); one study analyzed difficulty according to the intensity and representativeness scores; one study used the Classical Test Theory (CTT); and one study used discrimination.
Two studies presented validity evidence based on the internal structure. One of them used exploratory factor analysis and the other resorted to factor analysis through the two-parameter Bayesian model. In addition, the other study presented validity evidence based on the convergent relationship, presenting a descriptive comparison of the database built with the POFA bank, with satisfactory indexes.
Fourteen (38.9%) studies presented validity evidence based on the relationship with other variables.

DISCUSSION
The ability to recognize emotional facial expressions can be modulated by variables such as gender, age, and race. In this sense, a number of studies sought to elaborate valid facial expression databases to assess recognition of emotions in specific populations and contexts. However, the methodological heterogeneity among construction studies can make it difficult to create patterns for the construction of these stimuli, regardless of the context and characteristics of who express them. This systematic review sought to gather the studies that built face databases to assess recognition of basic emotions, describing and comparing the methodologies used in its development.

General characteristics of the face databases included
The way to present the stimuli of an emotion recognition test has already been target of discussions among researchers in the area, since a pioneering study showed that the recognition of static and dynamic facial emotional stimuli involves different neural areas 66 . In this review, most of the studies consist of static stimulus databases. The difference in the recognition of static or dynamic stimuli is still an unanswered discussion, given that some studies report a higher rate of recognition of dynamic stimuli 67,68 while others point to a minimal or no difference in the recognition of these stimuli 69,70 .
Khosdelazad et al. 71 investigated the differences in the performance of 3 emotion recognition tests in 84 healthy participants. The results point to a clear difference in the performance of tests with static or dynamic stimuli, with the stimuli that change from a neutral face to the intended emotion (dynamic) being the most difficult to be recognized, given the low performance in the test 71  as age and schooling also modulated performance in the tests, highlighting the importance of normative data regardless of the type of stimulus chosen 71 .
Several stimuli databases for facial expressions of emotions were developed in order to be used in specific populations and cultures 72 . Cultural issues must be taken into account when understanding these emotional expressions, as they can exert an influence on their recognition 73 . A study that considered ethnicity as an influencing factor in the performance of emotion recognition tasks and compared this ability to identify emotions between Australian and Chinese individuals verified that people perform worse when classifying emotions that are expressed on faces of another ethnicity 74 . In this sense, the cultural characteristics of the stimulus presented can also modulate performance in the test.
In addition to the difference in the pattern of response when recognizing emotions from another culture, studies showed that there is still a difference in the pattern of intensity recognized, regardless of the race or gender of the stimulus presented 75,76 . This fact happens probably because we manage our emotions according to the our learnings throughout our lives, clearly shaped by the cultural context in which we are inserted 76,77 . Thus, we learned in certain situations to hide or amplify our emotions, consequently affecting how we recognize emotions and highlighting the clear influences of culture on our social and cognitive abilities 76,78 .
Furthermore, when we think about the modulating character of the cultural context in the recognition of emotions, it is important to highlight the impact that socioeconomic status can also have on this ability. In particular, some countries and regions with greater socioeconomic disparities may reflect different patterns of cognitive abilities 79 . For example, a large international study investigated, in 12 countries and 587 participants, the influence of nationality on core social cognition skills 80 .
After controlling the analyses for other modulating variables such as age, sex, and education, the results showed that a variation of 20.76% (95%CI 8. 26-35.69) in the test score that evaluated emotion recognition can be attributed to the nationality of the individuals evaluated 80 . These results make us reflect on the cultural disparities that exist in underdeveloped countries and how these aspects can influence the social and cognitive variables, as well as the recognition of emotions discussed here.
In addition, aspects related to the participant's profile can also interfere in task performance. Five studies in this review presented open and closed mouth expressions and other studies included additional features such as varying intensities, gaze directions, and varying angles. These variables can also modulate task performance. Emotions expressed with the mouth open seem to increase the intensity of the emotion perceived by the subject 81,82 . Consequently, incorporating this face variation to the database can be important to assess the emotion experienced by the individual who recognizes the stimuli. In addition, open-mouthed facial expressions seem to draw more the attention of the respondent than closedmouthed expressions 81 .
Hoffmann et al. 83 found a correlation between the intensity and accuracy of recognition of an emotion, where higher intensities were associated with greater accuracy in the perception of the face. However, Wingenbach et al. 84 did not find effects of the intensity level on expression recognition. Despite the controversial results regarding emotion intensity, it can still be an important variable to be taken into account in the construction of databases in order to compare recognition between different degrees of intensities.
The perception of the emotion expressed can also be modulated by the gaze direction of the person expressing it 85 so that when gaze is directed at the participant, this recognition is greater than when compared to the look avoided 86 . In addition, photographing the expressions from different angles can increase the ecological validity of the database built 38 .

Methodological characteristics used in the studies
Method used to elicit the emotions An important methodological choice in the studies that elaborate face databases is the way in which the stimuli will be elicited and who is going to express them. Our results show that most of the studies included in this systematic review resort to actors (either amateur or professional) to express the emotions. Such methodological choice can be justified by the fact that people who have experience in acting are able to express more realistic emotions than individuals without any experience 87 . Thus, resorting to actors to act out emotions can be advantageous with regard to bringing the emotions expressed to a more real context.
The literature indicates that there are three different ways to induce emotions, namely: • Posed emotions; • Induced emotions; and • Spontaneous emotions 88,89 .
Posed emotions are those expressed by actors or under specific guidance, tending to be less representative of an emotion expressed in a real context 89 . Induced emotions have a more genuine character than posed emotions, as varied eliciting stimuli are presented to the participant in order to generate the most spontaneous emotion possible 89 . However, it is noteworthy that this way of inducing emotion can also have limitations as to its veracity, since induction is carried out in a context controlled by the researcher 89 . Spontaneous emotions are considered closer to a real-life context. However, due to their observable character, their recording could only be possible when the individuals are not aware that they are being recorded. Thus, any research procedure can bias this spontaneity 89 .
To increase induction effectiveness, the studies use a combination of techniques and procedures to facilitate achievement of the intended emotions. Among the 36 studies analyzed in this review, 44.4% used specific hypothetical situations as one of the ways to elicit the intended emotions, such as "Imagine that you have just won the lottery; imagine that you have just lost a loved one." Thus, despite induction being generated in a controlled context, using hypothetical everyday situations aims at remedying the limitation of expressions that are not very representative of real life.

Recording the stimuli
All construction studies try to capture stimuli following some kind of pattern. Some explore this pattern more in detail and others are more objective. Despite this, the data included in this review indicate that it is important to standardize the clothes worn by the participants and the background they are positioned against during the capture of stimuli.
In addition, most construction studies have established distractors that should be removed prior to image capture, such as jewelry, accessories, and strong makeup. Our hypothesis is that these distractors could direct the attention of those who respond to the task and exert an impact on recognition performance, since attention can be a modulating variable in emotional tasks 90 .

Validation stage
The way to validate the stimuli in the databases elaborated varies greatly across the studies. Based on the methods used in the construction, the validation criteria are defined. Accuracy is the most used precision indicator in the development and validation of face databases that assess recognition of emotions 12,13 , which is why it was presented in most of the studies included. Recognition rate ≥70% is the most frequently used. However, the choice of which criterion to adopt at this stage is varied, and it is common to adopt other rates and criteria to validate the database, such as intensity, clarity, and agreement between evaluators.

Psychometric properties of the final database
We seek to follow the standards established by Resolution 09-2018 of the Federal Council of Psychology, which regulates the necessary dimensions for the assessment of psychological tests to verify the psychometric qualities of the databases. Although the studies present construction of tasks and not instruments, recognition of emotions is an important skill that allows for interaction in society and can be used to assess social cognition to predict the diagnosis of mental disorders 91 .
The analyses presented by the studies in this stage are also heterogeneous. However, some dimensions presented in the studies become strictly necessary to verify the quality of the database elaborated. With regard to the technical requirements, it is important to evaluate dimensions related to precision and validity evidence of the constructed task 20,21 . It is worth noting that normative data are also important to assess the quality of the task. However, this variable and other important analyses were not included in this review as they are found in articles published separately. This review showed that the studies that elaborate face databases for the recognition of emotions present heterogeneous methods. However, similarities between the studies allow us to trace important patterns for the development of these stimuli, such as using more than one method to elicit the most spontaneous emotion possible, standardizing the characteristics of the volunteers for capturing the stimuli, validating the database based on preestablished criteria, and presenting data referring to precision and validity evidence. With regard to future directions related to the research methods, greater standardization of the methods for eliciting and validating emotions would make the choice of the type of task to be used in each context more reliable.