A comparison of three statistical methods applied in the identification of eating patterns

This work aimed to compare the results of three statistical methods applied in the identification of dietary patterns. Data from 1,009 adults between the ages of 20 and 65 (339 males and 670 females) were collected in a population-based cross-sectional survey in the Metropolitan Region of Rio de Janeiro, Brazil. Information on food consumption was obtained using a semi-quantitative food frequency questionnaire. A factor analysis, cluster analysis, and reduced rank regression (RRR) analysis were applied to identify dietary patterns. The patterns identified by the three methods were similar. The factor analysis identified “mixed”, “Western”, and “traditional” eating patterns and explained 35% of the data variance. The cluster analysis identified “mixed” and “traditional” patterns. In the RRR, the consumption of carbohydrates and lipids were included as response variables and again “mixed” and “traditional” patterns were identified. Studies comparing these methods can help to inform decisions as to which procedures best suit a specific research scenario. Food Consumption; Food Habits; Statistical Factor Analysis Introduction The human diet is extremely complex. When energy and nutrient intakes are used to study the effects of diet on the development of disease, it is often difficult to observe associations between specific outcomes and diet. The analysis of food consumption through an examination of dietary patterns is an alternative strategy to understand the relationships among dietary elements. The identification of eating patterns yields a more coherent model in the study of eating habits that makes it possible to identify population subgroups at risk of disease and to propose wellgrounded dietary guidelines 1,2,3. Dietary patterns can be identified by a priori hypothesis-oriented methods (such as indexes and scores based on food guides and nutritional recommendations) or by exploratory statistical procedures that analyze the food intake data co-variation structure and reveal dietary patterns, which are interpreted a posteriori 3,4. Two approaches have been frequently used: the factor analysis 5,6,7,8,9,10 and the cluster analysis 11,12,13,14. Recently, the method known as reduced rank regression analysis (RRR) or maximum redundancy analysis has been applied for the same purpose 15,16,17,18,19. This approach is both a priori hypothesis-oriented and an exploratory statistical method, since its application requires the definition of response variables, based on the scientific knowledge of the disease ARTIGO ARTICLE METHODS APPLIED ON EATING PATTERNS IDENTIFICATION 2139 Cad. Saúde Pública, Rio de Janeiro, 26(11):2138-2148, nov, 2010 physiology, and the subsequent analysis aims to explore the data 20. In the present study, three methods (factor analysis, cluster analysis and RRR) are compared and applied to the identification of dietary patterns in a sample of adults living in a low socioeconomic neighborhood in the greater metropolitan region of Rio de Janeiro, Brazil. Material and methods The analysis included male and female subjects between the ages of 19 and 65; data were obtained in a population-based cross-sectional study carried out in the municipality of Duque de Caxias in the state of Rio de Janeiro, Brazil. Subject selection was based on a three-stage random sampling process (census tract, residences and individuals). The calculated sample size was 1,125 households and was based on a 14.5% prevalence rate of “extreme poverty” (defined as a monthly per capita income equal to one quarter of the monthly minimum wage, or approximately US$ 30 in 2005) and an acceptable maximum error of 5%. Data were collected in household interviews conducted between May and December 2005. All subjects signed a free consent form, and the research was approved by the Institutional Research Board of the Institute of Social Medicine at the State University of Rio de Janeiro. Detailed information about sampling and data collection methods can be found in Salles-Costa et al. 21. Food consumption was estimated using a semi-quantitative food frequency questionnaire (FFQ) that had been validated for the Rio de Janeiro area adult population 22 which included 82 foods, three servings sizes and eight options for reporting the frequency of food intake (ranging from “never or almost never” to “more than three times per day”). In the first step of the analysis, the reported intake frequencies of the 82 food items in the FFQ were converted to daily consumption frequencies; these food items were grouped into 21 categories according to their nutritional characteristics and usual frequency of consumption in the study population (Table 1). Some foods were not included in any group and maintained isolated for different reasons; for example eggs, soft drinks and sweetened fruit juices were kept separate because of their nutritional composition; rice, beans and bread constituted three different categories because they are major staple foods in the regional diet and consumed in large quantities by the studied population. Factor analysis Factor analysis is “a set of statistical techniques that are applied with the aim of representing or describing a number of initial variables by a smaller number of hypothetical variables” 23 (p. 223). Thus, a smaller set of variables (called “factors” or “components”) can be identified from the correlation structure of a given set of variables; these factors represent the variance of the original data 24,25. The Bartlett Test of Sphericity (BTS) and the Kaiser-Meyer-Olkin Measure of Sampling Adequacy (KMO) were used to evaluate whether the data were suitable for a factor analysis. The principal component analysis (PCA) method was used for factor extraction, and the factors were orthogonally rotated using the varimax procedure in order to improve the interpretation of the results 25,26. The procedure generated factor loadings for each variable (in this case the food groups) related to each factor (or dietary pattern) identified. These loadings measure the correlation between the identified dietary pattern and the original variables (or food groups) 26. Thus, factor loadings with positive values indicate that the variable is associated with the eating pattern, and negative values show that the food group is inversely related to the pattern. Larger factor loading values indicate a greater contribution of that food group to the dietary pattern. Food items were retained in the pattern if the factor loading value was equal to or above 0.30. The communalities represent the variance of each item explained by all factors combined. In this study, a minimum communality value equal to or greater than 0.25 was considered to be acceptable 23. The analysis also estimates eigenvalues, or the proportion of the total data variance that can be explained by each factor. The decision to consider a dietary pattern was based on the scree test 27, a chart that depicts the eigenvalues and their corresponding extracted factors (Figure 1). The points were plotted on a chart, and factors with eigenvalues located before the inflection point of the plotted line were retained. Factor scores are standardized variables that represent the adherence of the subject to each factor 24,25. Those values were saved for subsequent analysis of the association between dietary patterns and a given outcome. Most statistical software programs are able to perform a factor analysis. In this study, the analysis was performed with the Data Reduction – Factor Analysis procedure in the SPSS version 16.0 package, selecting the following options: in exCunha DB et al. 2140 Cad. Saúde Pública, Rio de Janeiro, 26(11):2138-2148, nov, 2010 Table 1 Food groups considered for the identifi cation of dietary patterns of adults (N = 1,009) living in a low socioeconomic neighborhood in the Rio de Janeiro Metropolitan Area, Brazil, 2005.

A comparison of three statistical methods applied in the identification of eating patterns Comparação de três métodos estatísticos aplicados na identifi cação de padrões alimentares

Introduction
The human diet is extremely complex.When energy and nutrient intakes are used to study the effects of diet on the development of disease, it is often difficult to observe associations between specific outcomes and diet.The analysis of food consumption through an examination of dietary patterns is an alternative strategy to understand the relationships among dietary elements.The identification of eating patterns yields a more coherent model in the study of eating habits that makes it possible to identify population subgroups at risk of disease and to propose wellgrounded dietary guidelines 1,2,3 .
Dietary patterns can be identified by a priori hypothesis-oriented methods (such as indexes and scores based on food guides and nutritional recommendations) or by exploratory statistical procedures that analyze the food intake data co-variation structure and reveal dietary patterns, which are interpreted a posteriori 3,4 .Two approaches have been frequently used: the factor analysis 5,6,7,8,9,10 and the cluster analysis 11,12,13,14 .Recently, the method known as reduced rank regression analysis (RRR) or maximum redundancy analysis has been applied for the same purpose 15,16,17,18,19 .This approach is both a priori hypothesis-oriented and an exploratory statistical method, since its application requires the definition of response variables, based on the scientific knowledge of the disease ARTIGO ARTICLE Cad.Saúde Pública, Rio de Janeiro, 26 (11):2138-2148, nov, 2010 physiology, and the subsequent analysis aims to explore the data 20 .
In the present study, three methods (factor analysis, cluster analysis and RRR) are compared and applied to the identification of dietary patterns in a sample of adults living in a low socioeconomic neighborhood in the greater metropolitan region of Rio de Janeiro, Brazil.

Material and methods
The analysis included male and female subjects between the ages of 19 and 65; data were obtained in a population-based cross-sectional study carried out in the municipality of Duque de Caxias in the state of Rio de Janeiro, Brazil.Subject selection was based on a three-stage random sampling process (census tract, residences and individuals).The calculated sample size was 1,125 households and was based on a 14.5% prevalence rate of "extreme poverty" (defined as a monthly per capita income equal to one quarter of the monthly minimum wage, or approximately US$ 30 in 2005) and an acceptable maximum error of 5%.
Data were collected in household interviews conducted between May and December 2005.All subjects signed a free consent form, and the research was approved by the Institutional Research Board of the Institute of Social Medicine at the State University of Rio de Janeiro.Detailed information about sampling and data collection methods can be found in Salles-Costa et al. 21.
Food consumption was estimated using a semi-quantitative food frequency questionnaire (FFQ) that had been validated for the Rio de Janeiro area adult population 22 which included 82 foods, three servings sizes and eight options for reporting the frequency of food intake (ranging from "never or almost never" to "more than three times per day").
In the first step of the analysis, the reported intake frequencies of the 82 food items in the FFQ were converted to daily consumption frequencies; these food items were grouped into 21 categories according to their nutritional characteristics and usual frequency of consumption in the study population (Table 1).Some foods were not included in any group and maintained isolated for different reasons; for example eggs, soft drinks and sweetened fruit juices were kept separate because of their nutritional composition; rice, beans and bread constituted three different categories because they are major staple foods in the regional diet and consumed in large quantities by the studied population.

Factor analysis
Factor analysis is "a set of statistical techniques that are applied with the aim of representing or describing a number of initial variables by a smaller number of hypothetical variables" 23 (p. 223).Thus, a smaller set of variables (called "factors" or "components") can be identified from the correlation structure of a given set of variables; these factors represent the variance of the original data 24,25 .
The Bartlett Test of Sphericity (BTS) and the Kaiser-Meyer-Olkin Measure of Sampling Adequacy (KMO) were used to evaluate whether the data were suitable for a factor analysis.The principal component analysis (PCA) method was used for factor extraction, and the factors were orthogonally rotated using the varimax procedure in order to improve the interpretation of the results 25,26 .
The procedure generated factor loadings for each variable (in this case the food groups) related to each factor (or dietary pattern) identified.These loadings measure the correlation between the identified dietary pattern and the original variables (or food groups) 26 .Thus, factor loadings with positive values indicate that the variable is associated with the eating pattern, and negative values show that the food group is inversely related to the pattern.Larger factor loading values indicate a greater contribution of that food group to the dietary pattern.Food items were retained in the pattern if the factor loading value was equal to or above 0.30.The communalities represent the variance of each item explained by all factors combined.In this study, a minimum communality value equal to or greater than 0.25 was considered to be acceptable 23 .
The analysis also estimates eigenvalues, or the proportion of the total data variance that can be explained by each factor.The decision to consider a dietary pattern was based on the scree test 27 , a chart that depicts the eigenvalues and their corresponding extracted factors (Figure 1).The points were plotted on a chart, and factors with eigenvalues located before the inflection point of the plotted line were retained.
Factor scores are standardized variables that represent the adherence of the subject to each factor 24,25 .Those values were saved for subsequent analysis of the association between dietary patterns and a given outcome.Most statistical software programs are able to perform a factor analysis.In this study, the analysis was performed with the Data Reduction -Factor Analysis procedure in the SPSS version 16.0 package, selecting the following options: in ex-  Scree plot test to defi ne the factors to be retained in the model (factor analysis).

Cluster analysis
Cluster analysis is another method used to identify dietary patterns.Instead of grouping factors, this method is based on the clustering of individuals according to regularities in their food intake.Thus, the aim of the method is to assign individuals to subgroups (or clusters) in which food consumption is relatively homogeneous; the intra-individual variability inside the cluster is supposedly small, while the between-groups variability is important because of the differences in food intake among subgroups 1,28,29 .In this procedure, individuals were classified in a predefined number of clusters by means of the Euclidean distance metric.The initial cluster seeds were followed by repeated comparisons between the means of the initial clusters and subsequent updates of the cluster groupings and means.Subjects were moved between mutually exclusive clusters and new means were computed until the distances between the observations within a cluster were minimized relative to the distances between the clusters 30,31,32 .
Most statistical software packages are able to perform a cluster analysis.In this study, the Classify -K-Means Cluster procedure in the SPSS program (version 16.0) was used with 20 maximum iterations.This procedure is particularly useful for large samples.The analysis was performed for two and three clusters, and the most interpretable solution was compared with the other methods of dietary pattern identification.

Reduced rank regression analysis (RRR)
The RRR (also called the maximum redundancy analysis method), introduced in nutritional epidemiology by Hoffmann et al. 19 , is similar to the factor analysis.However, the RRR includes two different kinds of variables: predictors and responses.The factor analysis cannot always satisfactorily identify dietary patterns that are predictors of disease, and the inclusion of response variables in the RRR approach represents an advance in the study of the effects of diet on the development of chronic disease.The RRR simultaneously encompasses both hypothesisoriented and explanatory approaches 2 .Usually, the predictor variables are the intakes of specific foods or food groups and the response variables are defined by biomarkers or other mediating variables (e.g., nutrients or ratios of nutrients) that are presumed to be important for the development of the disease under study 19 .Unlike the factor analysis (which yields factors that explain the maximum variation in the consumption of food groups), the RRR analysis yields a linear combination of food groups that explain the maximum variation in the response variables 2,33,34 .
In an RRR, the number of response variables limits the number of dietary patterns that can be identified 19,35 .In this study, the 21 food groups mentioned above were the predictor variables in the RRR, while the RRR response variables were the consumption of carbohydrates and fat (both in grams per day).These response variables were chosen because of their known importance to weight variation 36,37,38 .The RRR procedure calculates X and Y-scores 15 ; in the present analysis, the X-scores were derived from the intakes of each food group (the predictor variables) and the Y-scores were based on the intakes of carbohydrates and fat (the response variables).The RRR was completed with the PLS procedure in SAS for Windows, version 9.1 (SAS Inst., Cary, USA).The SAS code for the RRR procedure is shown in the Table 2. Cronbach's alpha value was estimated to ascertain the extracted patterns' internal consistency, and, finally, the patterns were named based on the interpretation of the data.
To compare the methods, the means of the X-scores (RRR) and factor scores (factor analysis) were calculated and analyzed according to the identified cluster solutions.

Results
A total of 1,253 individuals in the examined age range were interviewed; 222 individuals (17% of the total data set) were excluded because they reported implausible energy consumption [less than 500kcal/day (n = 5) or more than 6,000kcal/ day (n = 217)].Another 22 subjects were excluded because of incomplete information.The final sample consisted of 1,009 people; 34% (n = 339) were male and 66% (n = 670) were female.The average age of the participants was 39 (standard deviation -SD = 12 years), and there was no statistically significant difference in mean age by sex (Student's t test; p-value = 0.73).
Table 3 describes the dietary patterns identified through the factor, cluster and RRR analysis.The factor analysis indicated that three factors should be retained.These factors were: "mixed" (characterized by the consumption of cereals, fish and shrimp, vegetables, fruits, eggs, meats and caffeinated beverages), "traditional" (characterized by the consumption of rice and beans, breads, sauces, sugar and fats) and "Western" (characterized by a higher intake of juices, cakes The solution of the cluster analysis identified two patterns, as follows: "mixed" (characterized by the consumption of fruit, juices, cakes and cookies, soft drinks, dairy products, pastries, snacks, fast food, sauces and fats) and "traditional" (characterized by the consumption of rice, beans, bread and sugar).In this analysis, 202 individuals were associated with the "mixed" cluster and 807 subjects were included in the "traditional" cluster.
The two patterns defined by the RRR were the "mixed" (including cereals, leafy greens, vegetables, roots, meat, eggs, sausage, caffeinated beverages, soft drinks, juice, milk and dairy products, sweets, cakes and cookies, sauces and fats, fast foods and sugar) and the "traditional" (which included caffeinated beverages, bread, rice and beans, and which was inversely associated with the consumption of cereals, fish and shrimps, meat and eggs).
Table 4 shows the factor loadings of patterns obtained from the factor analysis and RRR, the variation in food groups and nutrient sources explained by each of the dietary patterns and the Cronbach's alpha values.The "mixed" dietary pattern identified by the factor analysis explained 16% of the intake variance; the "Western" pattern explained 10% of the intake variance and the "traditional" pattern explained 9% of the intake variance.These patterns together explained 35% of the total variance in food consumption.
The patterns identified by the RRR explained, together, 17% of the data variance.The "mixed" pattern explained 54% of the carbohydrate intake variance and 67% of the fat intake variance.The "traditional" pattern explained 60% of the carbohydrate intake variance and 72% of the fat intake variance.Together, the response vari- ables explained 60% of the variability of the first pattern and 66% of the variability of the second pattern.The Cronbach's alpha values identified for the three methods were acceptable: the values in the factor analysis were 0.64, 0.55 and 0.48, for "mixed", "Western" and "traditional" patterns, respectively.In the cluster analysis, the Cronbach's alpha values observed for "mixed" and "traditional" patterns were 0.67 and 0.44, respectively.In the RRR, the Cronbach's alpha values were 0.56 for the "mixed" pattern and 0.47 for the "traditional" pattern (Table 5).For the three analyses, there was no indication that the removal of any food group would improve the Cronbach's alpha.
The means and 95% confidence intervals of the factor scores obtained from the factor analysis and of the X-scores from the RRR were calculated for the two cluster solutions (Figure 2).The "mixed" cluster presented the highest mean factor scores for the "mixed" pattern and the lowest mean scores for the "traditional" pattern.On the other hand, in the "traditional" cluster, the factor scores were less impressive, but the highest scores were observed for the "traditional" pattern identified in the factor analysis.In the RRR patterns, in the "mixed" cluster were observed Table 5 Median intake frequency of food groups according to the clusters of dietary patterns (and respective Cronbach's alpha).Adults (N = 1,009) living in a low socioeconomic neighborhood in the Rio de Janeiro Metropolitan Area, Brazil, 2005.

Food groups
Cluster for the "mixed" RRR pattern.The highest scores in the "traditional" cluster were observed for the "traditional" pattern, although the mean was insignificant and nearly zero.

Discussion
This study found that factor, cluster and RRR analyses identified two comparable patterns: the "mixed" and the "traditional".The "mixed" patterns yielded by the three methods presented common foods groups such as cereals, leafy greens, vegetables, roots, meat, eggs, sausage and caffeinated beverages; similarly, the "traditional" patterns described by the three methods shared foods, like rice, beans and bread.The factor analysis also allowed for the identification of a third pattern, denominated as "Western", that included some of the items assigned to the "mixed" patterns derived by the cluster and the RRR analysis, for example, fast-foods, soft drinks, juice, milk and dairy, sweets, cakes and cookies.Although there are many published scientific articles discussing the identification of dietary patterns, few studies directly compare these three methods.Some studies compared the results of factor and cluster analysis 24,39,40,41,42 .Additionally, factor analysis was compared with RRR by Di Bello et al. 43 and Hoffman et al. 34 .However, like in many other fields of knowledge, comparative studies are very useful as researchers choose the most appropriate method for a specific scenario.
Other studies have also observed similar dietary patterns when comparing factor and cluster analysis 24,29,31,39,42 .Lozada et al. 39 carried out a cross-sectional study involving 477 women aged 12-19 years; like the present study, they compared the means of extracted factor scores in each identified cluster in order to characterize cluster versus factor results.The authors also concluded that the dietary patterns obtained from the two methods were highly comparable.Newby et al. 29 analyzed the dietary patterns of 459 healthy subjects enrolled in the Baltimore Longitudinal Study of Aging.The authors compared cluster and factor analysis in relation to plasma lipid biomarkers and observed a high degree of compatibility between the patterns.Mean (and 95% confi dence interval -95%CI) of factor scores for patterns derived by factor and reduced rank regression (RRR) analysis according to the clusters identifi ed in cluster analysis.Adults (N = 1,009) living in a low socioeconomic neighborhood in the Rio de Janeiro Metropolitan Area, Brazil, 2005.
Cluster analysis has the advantage of categorizing each subject exclusively into one pattern.Factor analysis, however, attributes factor score values based on the extracted dietary patterns of all subjects and can make it difficult to translate results to the individual level.The concept of assigning a factor score is less intuitive than assigning individuals to a subgroup; on the other hand, factors are traditionally estimated by procedures that guarantee their statistical independence (such as orthogonal rotation), which allows them to be used in linear regression or other types of modeling 31 .Also, factor analysis appears to be reproducible over time and across different di-etary assessment methods 44 .In this study, factor analysis was able to correctly reveal the diversity and specificity of the dietary patterns; it identified the "Western" pattern in addition to the "mixed" and "traditional" patterns.
In the present study, RRR explained less of the consumption variability than factor analysis, since RRR focuses on the variation of the response variables that can be explained by the dietary patterns.The RRR allows the inclusion of an a priori hypothesis through the response variables; therefore, this approach allows researchers to examine biologically important intermediate variables in order to identify and interpret the associations between diet and disease.However, it is necessary to include response variables that are presumably predictive of the health outcomes under study, and this may not always be feasible 30 .Additionally, Schulze & Hoffman 20 pointed out that it is unlikely that the RRR would be able to identify dietary patterns that are related to all of the potential pathways involved in the relationship between diet and disease.
The use of one of the three compared procedures involves some degree of subjectivity, since some decisions are made arbitrarily, e.g., the definition of the variables to be included in a model, the number of factors that should be retained and the name associated with each pattern.This freedom of action and the fact that the identified dietary patterns are specific for a particular data set make it harder to compare between different studies 28 .
In summary, the factor analysis is more formalized and more commonly used in studies that are trying to identify dietary patterns.Cluster analysis has the advantage of generating food patterns that are mutually exclusive.RRR represents an interesting alternative approach for the study of dietary patterns, since, by including pre-defined variables, prior information is taken into account in the study.This makes RRR a powerful tool for the confirmation or rejection of a hypothesis 45 .
The assumptions and objectives of a particular study must affect the selection of the statistical technique that will be used to identify food patterns.Studies of the identification of dietary patterns are ongoing, and analyses that compare and improve these methodologies are essential to provide technical and scientific bases for deciding which method is suitable in each specific research scenario.

Table 1
Food groups considered for the identifi cation of dietary patterns of adults (N = 1,009) living in a low socioeconomic neighborhood in the Rio de Janeiro Metropolitan Area, Brazil, 2005.
Sweets Caramels, ice cream, pudding, flan, other sweets, chocolate in powder form or in bars Sugar Added sugar Snacks and fast food Fried chips, other salty snacks, pizza and popcorn Sauces and fats Mayonnaise, bacon, butter or margarine

Table 2
SAS code for the reduced rank regression analysis (RRR) applied in the identifi cation of dietary patterns.

Table 3 Dietary
patterns identifi ed by factor, cluster, and reduced rank regression (RRR) analysis.Adults (N = 1,009) living in a low socioeconomic neighborhood in Rio de Janeiro Metropolitan Area, Brazil, 2005.

Table 4
Dietary patterns, factor loadings, proportion of variance explained and Cronbach's alpha for factor and reduced rank regression (RRR) analysis.Adults (N = 1,009) living in a low socioeconomic neighborhood in the Rio de Janeiro Metropolitan Area,Brazil, 2005.