Food patterns associated with overweight in 7-11-year old children: machine-learning approach

Longitudinal study, whose objective was to present a better strategy and statistical methods, and demonstrate its use with the data across the 2013-2015 period in schoolchildren aged 7 to 11 years, covered with the same food questionnaire (WebCAAFE) survey in Flori-anopolis, southern Brazil. Six meals/snacks and 32 foods/beverages yielded 192 possible combinations denominated meal/snack-Specific Food/ beverage item (MSFIs). LASSO algorithm (LAS-SO-logistic regression) was used to determine the MSFIs predictive of overweight/obesity, and then binary (logistic) regression was used to further analyze a subset of these variables. Late break-fast, lunch and dinner were all associated with increased overweight/obesity risk, as was an anticipated lunch. Time-of-day or meal-tagged food/ beverage intake result in large number of variables whose predictive patterns regarding weight status can be analyzed by machine learning such as LASSO, which in turn may identify the patterns not amenable to other popular statistical methods such as binary logistic regression.


Introduction
Most food questionnaires applied at early school age are limited to asking about the frequency of consuming foods and beverages of interest, and avoid the questions on their quantity to reduce the cognitive burden of respondents.The studies based on food frequency questionnaires (FFQ) provided only sparse evidence of the association between meal/snack frequency and overweight or obesity 1 .On the other hand, a nationally representative UK study found that boys ate larger portions in the later part of the day 2 .
Furthermore, two recent reviews of dietary patterns derived by dimension reduction techniques for school-age children 3,4 found that only a few were predictive of overweight, especially when adjusted for well-established factors that contribute to this outcome, such as child age, sex, family income, and educational level.Although the reviews found high consistency in identifying specific food/beverage items associated with overweight/obesity, no evidence of relevant associations could not be verified through meta-analysis because of the heterogeneity of the studies.
There are few longitudinal studies with children of primary school age.Various studies showed that eating behavior at a later age is influenced by the food intake patterns acquired in the early school years, so that about half of obese children become obese adults [5][6][7] .Furthermore, dietary patterns of macronutrients are considered a better representation of complex interactions between nutrients 8 , thus highlighting the need for longitudinal studies of schoolchildren's dietary patterns.
Virtually all publications on food patterns' association with overweight first derive the patterns by a dimension-reduction method, such as principal components, and factor/cluster/latent class analysis, then relate these patterns to the outcome through logistic or multinomial regression 3,4 .The latter is applied when overweight includes two categories (with and without obesity), so that the weight status has three levels, and normal weight is a natural reference category.
However, this strategy is not optimal if the predictive value of eating patterns is the primary goal of the analysis.The present study aims to present a better strategy and statistical methods to respond to this question and demonstrate its use with the data from a longitudinal study of 7-11-year old children in southern Brazil.

Material and methods
The data were collected in a longitudinal population study of 7-11-year-old children, recruited in 2013, and followed up in 2014 and 2015 with the same WebCAAFE (acronym for "web-based food intake and physical activity evaluation" in Portuguese) questionnaire 9 .Only public schools were targeted and about 95% of these fully participated in the data collection.Three schools without a computer room were excluded.Primary sampling units were 2 nd to 5 th -grade classrooms, all of whom participated in the surveys.Mentally handicapped and visually impaired children did not take part in this research.Both child and parental consent were obtained for 81.6% of the children.
Among the children with informed consent and complete information on age, sex, weight, height, and food consumption, 9.1% were excluded because of implausible dietary data, such as reporting less than four food items per day or out of the mean ± 3 standard deviation interval.
The studies were conducted according to the guidelines set out in the Code of Ethics of the World Medical Association (Declaration of Helsinki) and all procedures involving human subjects were approved by the Human Studies Committee of the Federal University of Santa Catarina (protocols 037/02, 028/06).Written informed consent was obtained from the parents and oral assent was obtained from the children.
Schoolteachers instructed the children to click on the WebCAAFE icon on the computer screen and start answering the questions; they also made themselves available to clarify possible doubts.An animated robot-like Avatar guided the children while answering the questionnaire.Before closing the block on food consumption, the children were presented with a tray of the foods/beverages they selected for each meal and asked to check and revise their answers if necessary.
Being overweight or obese was the primary outcome of interest.It was defined as a body mass index (BMI) z-score of 2.00 or higher, adjusted for child sex and age 10 , following the definition of the World Health Organization.Anthropometric weight and height measurements were obtained by trained physical education teachers following standard procedures 11 from the children who were present at the school on the day of data collection.
Statistical analysis used three steps to evaluate the additional predictive value of food/beverage items and their patterns concerning overweight/ obesity as a binary dependent variable, in addition to already established factors predictive of this outcome available from the WebCAAFE: child age, sex, school shift, and the survey year.In the first step, the lasso (least absolute selection and shrinkage operator) based on accuracy improvement in five-fold cross-validation, was used to select the variables associated with the outcome over and above the control variables (child age, sex, school shift, and survey year).In the second step, the lasso-selected and control variables were independent variables for overweight/ obesity in multiple logistic regression, and the time-of-day food patterns were extracted from the results, such as drinking milk for breakfast and after dinner, eating sweets only for dinner, and similar.In the third step, post hoc contrasts were calculated to compare the children with a particular food pattern to all other children.
Family income was available only through the mean census sector income depending on the school location and was transformed into quintiles of the income distributions.The choice of the control variables was based on the publications in the area of schoolchild nutrition, in particular with earlier research of the same population 12 .In total, 192 exposure variables (MSFI) and five control variables (child age, sex, school shift, survey year, and family income) were considered for statistical analysis.
Descriptive statistics presented key demographic characteristics and the most frequent MSFIs choices.The commando "xpologit" with the accuracy improvement based on five-fold cross-validation was used within Stata software, version 16.1 (StataCorp.Stata: Release 16.Statistical Software.College Station, TX: StataCorp LLC, 2019).The area under the ROC curve (AU-ROC) for logistic model accuracy was calculated using a five-fold cross-validation implemented in the cvauroc Stata program.
WebCAAFE has been amply validated and showed good external validity 13,14 , reproducibil-ity 15,16 , and usability 9 .Its accuracy was shown to be close to that of other similar instruments 13 .
The appeal of machine learning methods, such as lasso, to select independent variables predictive of overweight among 192 MSFI variables, lies in their better statistical properties over more traditional methods such as stepwise, especially when there is a large number of independent variables, possibly highly correlated.Stepwise is known as a "greedy" algorithm with a tendency to select too many independent variables, thus overfitting the regression model, whereas lasso is less prone to this difficulty 17 .While the stepwise method produces an inconsistent estimator whose asymptotic distribution is not normal in this situation and does not converge to the true value 18 , the lasso selection is consistent 19,20 and has better finite-sample properties, especially when combined with cross-validation 21 .The latter was employed in the present study by partitioning the data into five non-overlapping subsets and averaging over these to calculate the model estimates.

Results
Approximately 3.18% of the WebCAAFE reports over the 2013-2015 period were excluded from analysis due to implausibly low (< 4) total frequency of foods and beverages reported for 24 h, resulting in the analytical sample size of 6585 (Table 1).There were fewer children at the extremes of the age distribution.Also, only a small fraction of the children went to all-day school, and only 1.42% were underweight.Overweight (including obesity) was found in about 35% of the participants.
Lasso-based odds ratios (OR) for multiple binary logistic regression produced 192 P-values but only those ≤ 0.05 are presented in Table 3.
Eating a traditional Brazilian breakfast including bread/biscuits and milk, possibly with chocolate, was associated with a significant reduction in the risk of being overweight/obese.Reporting consumption of fried food or cream biscuits for the morning snack was also associated with lowering this risk, whereas eating meat or chicken at lunch increased the risk, as did nuggets consumed for the afternoon snack.The children eating sweets or cream biscuits during the latter snack had lower odds of being overweight/ obese.Canned foods, eggs, seafood, chips, or similar products consumed for dinner increased the chance of being overweight/obese, whereas eating cooked beans had the opposite effect.The  vegetables consumed at the night snack were associated with the increase in the odds of being overweight/obese while eating cream biscuits for this snack pointed to a decrease in these odds.The lasso-selected variables predictive of overweight/obesity were further analyzed in multiple binary logistic regression to look for time-of-day patterns in consuming specific foods or beverages associated with the outcome (Table 4).
Eating cereals and drinking milk at lunchtime may be indicative of late breakfast as these items are typically consumed in the first-morning meal.A small group of 38 (17 + 21) children with these MSFIs had their risk of being overweight increased by 24.7% and 17%, respectively, compared to those with different MSFIs (Table 4).
Eating green leaves (salad) only after dinner may be due to dieting, and was associated with an additional overweight risk of 15.2% compared to those who did not share this eating pattern.The same magnitude of effect was found for eating nuggets for lunch.
Drinking milk exclusively for dinner may be indicative of dieting, and increased the risk of being overweight by 13.8% in comparison to all other children (Table 4).A slightly lesser magnitude of the risk effect was found for eating cooked beans at morning snacks, suggesting an anticipated lunch, and for consuming vegetable soup at major meals, possibly indicative of dieting.Eating instant pasta exclusively after dinner may be suggestive of fast food being prepared for late dinner, and added 12.3% of overweight risk compared to all other eating patterns (Table 4).Having sweets on the lunch menu only was associated with a risk increase of 11.7%.Eating canned food as an afternoon snack may be indicative of a fast-food solution for late lunch, with an 11.3% risk increase.A similar risk increase was found for eating vegetables for major meals, possibly associated with dieting.The 8.5% risk increase for eating fish/seafood exclusively at dinner may be due to the local cultural preference for frying these foods, thus adding a significant amount of fat.
In total, 940 (14.6%) of the children belonged to the patterns with increased overweight/obesity risk, with the median risk increase in the probability of 13.5% and interquartile range between 11.5 and 15.2% (Table 4).
Finally, two multiple logistic regression models with overweight as the dependent variable were compared in terms of the AUROC: using five control variables as the only independent predictors versus adding lasso-selected variables.The former model showed an AUROC of 0.55 (95%CI 0.52-0.56),compared to the latter with a significantly improved AUROC of 0.62 (95%CI 0.60-0.63).

Discussion
To the best of the author's knowledge, this is the first study to derive MSFIs predictive of overweight directly by penalized binary (logistic) regression using a lasso.It allowed simultaneous analysis of 192 binary exposure variables and identified 12 eating patterns with an increased risk of being overweight.
A major methodological shift in the present study compared to the other eating pattern studies was to search for the patterns based on their predictive value instead of based on the homogeneity regarding food/beverage intake.The latter needs an additional step to relate the patterns to the outcome of interest, and their extraction is not guided by the power to separate those with versus those without the outcome.Principal component, common factors, cluster, and latent class analyses are the most popular dimension-reduction techniques to identify food intake, physical activity, and other lifestyle patterns 3,4 .
However, it is a discriminant analysis that searches for distinct combinations of indepen-dent variables regarding the outcome.Binary logistic regression breaks down with hundreds of exposure variables, especially when many of these are low-frequency binary variables.In other words, there is a need for sparse discriminant analysis to identify dietary patterns in this situation 22 .A modest but significant increase in the predictive accuracy of overweight/obesity with the addition of lasso-selected variables points to the potential of this analysis to improve the classification of individuals at risk of excess weight during childhood and early adolescence.
Machine learning algorithms sometimes used to be criticized as a "black box" approach but their utility has proved indispensable in genomics, chemometrics, and other "big data" research, especially when the number of subjects/ units relative to the number of variables is small.With time-of-the-day or meal tag, the number of variables easily reaches hundreds, not to mention other lifestyle indicators such as physical activity and screen behavior.
Among many machine learning algorithms such as optimal pruning, bagging, boosting, and neural networks, lasso was chosen for its solid statistical theory as opposed to predominantly computational definitions of some of these algorithms.Moreover, the lasso has been extensively tested and widely applied in many areas of science 17 .
Most of the results from the present study are in line with those of other dietary pattern studies 3,4 regarding obesogenic foods, such as high-fat and/or fast-food items (nuggets, instant pasta, fried fish/seafood, and canned food) and dieting (consuming green leaves/salad after dinner, vegetables and vegetable soup for major meals).Postponing meals may result in consuming MSFIs later in the day or skipping them altogether.The present study showed that late breakfast, lunch, and dinner were all associated with increased overweight/obesity risk, as was anticipating lunch.This is in line with other studies that found evidence for the importance of the meal timing, such as chronobiological studies related to diet.
Negative feedback regulates the intake of macronutrients both in the short and long term, with the latter peaking at two days 23 .Eating regular meals and snacks with higher frequency and stable schedule helps regulate weight 24 .In adults, higher energy intake at lunchtime and lower intake in the evening are both associated with a lower risk of overweight/obesity 25 .
A similar conclusion can be drawn for schoolchildren whose dietary patterns include ener-gy-dense traditional Brazilian foods, such as rice and cooked beans, consumed at midday 26 .In mice, eating during nighttime disturbed circadian rhythms and led to leptin resistance, physical inactivity, excessive eating, adiposity increase, metabolic disorders, and obesity 27 .The latter points to likely physiological mechanisms in humans too.
The results of a study based on daily food/ beverage intake of the same data used in the present study showed no statistically significant association between the dietary patterns extracted by latent class analysis and child weight status 28 .However, when the data were meal-tagged and analyzed with lasso, the association was found for several patterns, thus reinforcing the importance of the time-of-day analysis.The finding of such analysis is important to develop effective interventions toward a healthy diet, that take into account both group-specific (age, sex, income, culture) and time-of-day-specific dietary patterns 28,29 .
A notable sparsity of time trend studies on food consumption in early school age makes it difficult to compare the present study findings with those of the other trend analysis, such as similar studies with adolescents.In Brazil, unhealthy eating habits showed a significant increase, especially among adolescents from low-income families 30 .The consumption of ultra-processed foods has increased at the expense of unprocessed foods such as rice, beans, and fruits, although sugary drink intake has decreased in the last decade 31 .A similar decreasing trend in the consumption of sugary drinks was found in the USA among children and adolescents 32,33 .European adolescents have increased their intake of fruits and vegetables over the decade of 2010 34 .In most developed countries, the consumption of dairy products decreases significantly in adolescence compared to childhood 35 .
Among the limitations of the present study, the lack of the food/beverage quantity and possible information bias, inherent to all FFQs, should be kept in mind.Also, each child responded about his/her diet for only one 24-hour period, but the representativeness of the response on the population level should have been maintained as the questionnaires applied referred to every day of the week and one weekend day (Sunday).Furthermore, the meaning assigned to the MS-FIs was rather qualitative than quantitative, thus prone to misinterpretation, in particular concerning dieting.Also, the values of AUROC obtained in the present study were still rather low for individual classification, and require further investigation and improvement before a routine use in school settings.
The strengths of this study include large coverage (95%) of the target population and large sample size, thus resulting in large statistical power to detect relatively small associations between dietary patterns and overweight/obesity, as well as the use of an amply validated instrument.

Conclusions
Late breakfast, lunch, and dinner were all associated with increased overweight/obesity risk, as was an anticipated lunch.Time-of-day or mealtagged food/beverage intake results in a large number of variables whose predictive patterns regarding weight status can be analyzed by lasso.Such analysis may identify the patterns not amenable to be found by other popular statistical methods such as binary logistic regression.

Collaborations
All authors were involved in analyzing the studies, reviewing and interpreting the results and writing the manuscript.All authors takes responsibility for all aspects of the reliability and freedom from bias of the data presented and their discussed interpretation.

Table 2 .
Percentage of top five foods, highlighted in bold letters, consumed for each meal/snack.

Table 1 .
Socio-demographic characteristics and weight status of 7-11-year-old schoolchildren from public schools in Florianopolis, Brazil.

Table 3 .
Lasso-based odds ratios for overweight/obesity with P-value < 0.05 among 192 binary independent food/beverage variables, controlled for age, sex, school shift, family income, and survey year.

Table 4 .
Statistically significant (P < 0.1) average marginal increase in the probability of overweight regarding selected eating patterns, based on post-hoc multivariate logistic regression contrasts (N = 6,420).