MACHINE LEARNING TECHNIQUES APPLIED TO LIGNOCELLULOSIC ETHANOL IN SIMULTANEOUS HYDROLYSIS AND FERMENTATION

Abstract - This paper investigates the use of machine learning (ML) techniques to study the effect of different process conditions on ethanol production from lignocellulosic sugarcane bagasse biomass using S. cerevisiae in a simultaneous hydrolysis and fermentation (SHF) process. The effects of temperature, enzyme concentration, biomass load, inoculum size and time were investigated using artificial neural networks, a C5.0 classification tree and random forest algorithms. The optimization of ethanol production was also evaluated. The results clearly depict that ML techniques can be used to evaluate the SHF (R 2 between actual and model predictions higher than 0.90, absolute average deviation lower than 8.1% and RMSE lower than 0.80) and predict optimized conditions which are in close agreement with those found experimentally. Optimal conditions were found to be a temperature of 35 °C, an SHF time of 36 h, enzymatic load of 99.8%, inoculum size of 29.5 g/L and bagasse concentration of 24.9%. The ethanol concentration and volumetric productivity for these conditions were 12.1 g/L and 0.336 g/L.h, respectively.


INTRODUCTION
One of the most promising methods to obtain renewable energy in an environmentally sustainable way is to produce it from cheap and abundant biomass sources like lignocellulosic materials. Bioethanol production from waste crop and crop residues could potentially surpass 491 GL/year worldwide. Under such circumstances, ethanol production from lignocellulosic biomass is a promising technology, and several techniques have been proposed to reduce the recalcitrance of the lignocellulosic matrix structure, reduce the cost of enzyme production and improve enzymatic hydrolysis and fermentation (Chen et al. 2014;Wu et al. 2011;Karlsson et al. 2014;Prado et al. 2014) Although companies and academia have made a lot of progress, enzymatic hydrolysis remains one of the critical bottle-necks as a result of the large amounts of enzyme required for hydrolysis, the complexity of mass transfer and the large number of chemical reactions with the generation of inhibition products (Khare et al. 2015;Goldbeck et al. 2013). The combination of hydrolysis and fermentation in a simultaneous process represents one strategy that can lower capital cost, facilitate the recovery of the product and reduce contamination and inhibition (Ohgren et al. 2007;Ask et al. 2012;Saha et al. 2011). There-fore, a large number of studies have been conducted to evaluate the effects of different biomasses, solid loading, inhibition and hydrolysis conditions on the feasibility of ethanol production by simultaneous saccharification and fermentation (SSF) (Cuevas et al. 2015;Asada et al. 2015;Narra et al. 2015;Gu et al. 2014;Chong et al. 2014).
There is great interest in using machine learning (ML) procedures like artificial neural networks (ANNs), classification trees (CTs) and random forests (RFs) in the context of achieving feasible production of ethanol from lignocellulosic biomasss by SSF, but few studies using ANN to describe the reduction in cost of enzyme production and improve the steps of enzymatic hydrolysis and fermentation are available (Vani et al. 2015;Das et al. 2015;Giordano et al. 2013;Gitifar et al. 2013), and no study on methodologies other than ANN has yet been reported. Consequently, the aim of this study was to use the ability of ML techniques (ANN, RF and CT) to model the effects of temperature, time, biomass and inoculum size on ethanol fermentation by SSF. The optimization of ethanol production was also evaluated.

Raw Materials
All the ethanol fermentations were performed using the enzyme complex (Enz) produced in situ by extraction of the enzyme content provided by solid state cultivation (SSC) and exploded sugarcane bagasse (Bag) with a severity factor (SF) of 3.4 donated by the Centro de Tecnologia Canavieira (CTC, Brazil) which contained about 50% water, 30.0% cellulose, 7.3% hemicellulose, 11.2% lignin and 1.5% ash (content analysis performed as described in Browning, 1967). The SF of 3.4 was chosen from a previous study where Bag samples with SFs of 3.4 and 3.8 were tested, and the best result was obtained using the sample with an SF of 3.4 (data not shown here). The Enz was produced using the same Bag (SF of 3.4) and rice bran (RB), as described below. The RB was purchased from Cocal Foods (Uberlândia-MG, Brazil). The raw materials were stored at -18 °C and subsequently milled and sieved through a 1.8 mm mesh prior to their use as samples in the experiments.

Microorganisms, Fermentations and Enzyme Complex
The SSF was performed using Saccharomyces cerevisiae Y904 (AB Brasil, Pederneiras-SP, Brazil) and an enzyme complex obtained from SSC using a strain of Aspergillus niger reported in a previous study (Fischer et al. 2014). The conditions used in SSC, enzyme production and SSF are described in Table 1.

Experimental Strategy and Overview of Proposed ML Methods
To model the effects of process variables (time, load of bagasse, enzyme, temperature and inoculum concentrations) on ethanol production and find the optimized conditions of SSF, a total of 17 experimental runs with different sampling times were performed, and a total of 1560 experimental points expressing the evolution of cells and SSF products were collected. The operational conditions used in the runs are presented in Table 2.

Process-Step Description SSC
A. niger cells were produced by submerged fermentation at 30 ± 2 °C, agitated at 150 rpm in a rotatory shaker using Czapec medium composed of (g/L): NaNO3 (2.0), K2HPO4 (1.0), MgSO4 (0.5), KCl (0.5), FeSO4 (0.01) and glucose (20.0). After 48 h of submerged fermentation the cells were harvested by centrifugation at 8000 g for 10 min and the cell pellets were washed twice, re-suspended in sterile water and used to start the SSC (1.0 × 10 8 spores/g of solid medium). The SSC was done in a 0.25 L conical flask reactor at 30 ± 2 °C containing 40 g of solid medium (composed of 40% Bag and 60% RB) and 40 g of water containing the harvest cells. Enz production Forty (40) g of solid fermented medium was mixed with 50 mL of 1.0% (w/w) aqueous Tween 80 at 30 °C in a 500 mL closed Duran bottle. The mixture was agitated for 10 min and the extracted slurry was filtered to collect the Enz (liquid fraction).

ANN Model
The ANN model containing three layers was implemented and used to find optimal conditions employing R software and the library AMORE (http://cran.r-project.org/web/packages/AMORE/) as follows. First, the values of variables and responses were normalized using z-score standardization (calculated for each data set of variables and response by subtracting its mean value and dividing the result by the standard deviation). Second, the data set was categorized into two random subsets: a training data set (2/3 of the original experimental data set) and a test data set (1/3 of the original experimental data set). Third, a total of 500 ANN were tested, varying the number of hidden neurons and transfer functions (purelin, sigmod and tansig) in layers to optimize the ANN for both data sets (training and validation) and achieve a coefficient of determination (R 2 ) close to 1 and a reduction of the root mean squared error (RMSE) and the absolute average deviation (AAD) calculated according to Equations (2), (3) and (4), respectively: where n is the number of points, calc i Y is the predicted value, exp i Y is the experimental value, Ym is the average value of all experimental data and MSE is the mean square error. Third, the connection weights of the ANN were used to calculate the effect of features (variables of the process) on ethanol production, as described in Equation (5) (Garson 1991): where Ij is the relative importance of the j th input variable on ethanol concentration, Ni and Nh are the number of input and hidden neurons, Ws are the connection weights, the subscripts i, h and O refer to the input, hidden and output layers, respectively, and the subscripts k, m and n represent the input, hidden and output neurons, respectively. Fourth, the optimized conditions related to ethanol production were determined using the ANN and an R script for ant colony optimization (ACO) written as described in Dorigo et al. (1996). The ACO was used with different randomly initiated input variables to secure the solution corresponding to the best multi-objective optimizations. Accordingly, the optimal ethanol concentration for the optimal volumetric ethanol productivity and the optimal ethanol conversion for the lowest time were found. The volumetric productivity was calculated as the ratio between the ethanol concentration and time of fermentation, and the ethanol conversion was defined as described in equation 6 (Naveen et al.
where Et is the ethanol concentration at time t, Et0 is the initial ethanol concentration, 0.51 is the conversion factor for glucose to ethanol based on the stoichiometry of yeast, f is the glucan fraction of dry biomass, Dry_Bag is the dry biomass and 1.11 is the conversion factor for glucan to glucose.

Random Forest (RF) Model
RF is a non-parametric ML algorithm derived from a classification and regression tree and per-forms very well when compared with ANN and other ML methodologies. RF characteristics include robustness to noise, tuning simplicity and ability to handle high dimensional non-linear problems (Breiman, Friedman and Stone 1984;Breiman 2001;Seyedosseini and Tasdizen 2015;Liaw and Wiener 2002). In this work, the use of RF was performed using the RF library for R language (Random Forest) and was used to describe the SSF and predict the influence of variables using the available measure of increase of node purity according to MSE (IncMSE) present in the software. To ensure good predictive performance of the RF, a total of 1000 RFs containing different numbers of trees and variables in each of the branches (parameters of the method) were evaluated. The evaluation of the optimal RF model was conducted using the same division of experimental points (two random data sets, containing 2/3 for training and 1/3 for test) and the same criteria described in the ANN methodology (i.e., reduce RMSE and AAD as much as possible, and obtain an R 2 as close as possible to 1 in both the training and test data sets).

C5.0 Model
C5.0 has become the industry standard for producing decision trees. It is based on the concept of entropy of information for recursively separate observations in branches to construct a tree based on rules that are logically understood (Mistikoglu et al. 2015;Lantz 2013). In this work, C5.0 was used as follows: (a) the ethanol concentration was described in three classes: low, if it was below the first quartile, high, if it was equal to or greater than the fourth quartile and medium, if it was between the first and fourth quartile; (b) C5.0 script was written using the default library for R (http://cran.r-project.org/web/ packages/C50/) for ranking the variables of the process based on their ability to partition the data and find the rules for the correct classification of ethanol production.

Analytical Methods
The cell concentrations in SSF and those required to begin SSC were determined by counting in a Neubauer chamber and by estimation from the optical density at 600 nm, respectively. The estimation methodology used a correlation determined a priori between the optical density and number of colonies obtained in using a spread plate methodology after 48 h of incubation at 40 °C. The inoculating plates contain Czapek with agar and the same nutrient concentrations described above. The sugars and ethanol concentrations were determined by high performance liquid chromatography (HPLC; Shimadzu LC-20A) equipped with a refractive index detector, a Supelcogel Ca column operated at 80 °C and deionized water (pH 7.0) as the mobile phase at a flow rate of 0.5 mL/min. Table 3 presents some descriptive statistics of the experimental runs, and Figures 1 and 2 show the results of the sugars and metabolites of fermentation detected during the SSF runs, respectively. The inspection of this table and the figures reveals that the yeast cell growth was found to be low, and the accumulation of arabinose and glycerol was lower when compared with the production of xylose, ethanol and acetic acid. These results suggest active utilization of glucose and that the concentrations of the metabolites and pentoses found are likely to have a significant impact on microorganism viability and ethanol production. Loss of viability of cells during fermentation was not observed (data not presented in figures). According to the literature, acetic acid can inhibit the cell metabolism as a result of an increase in the ATP required for cell maintenance (Mariorella et al., 1983;Narendranath et al., 2001;Sousa et al., 2012), xylose can inhibit the pathway of glucosephosphorylating enzymes (Fernandez et al., 1985), arabinose can positively affect the enzymatic hydrolysis of lignocellulosic biomass by reduction of crystallinity (Fengcheng et al., 2013) and glycerol is essential for balancing the redox potential in the absence of oxygen and osmoregulation of the cell (Neivoig et al., 1997). The high concentrations of inhibitors found suggest the choice of the configuration of the fermentation in two steps described as separated hydrolysis and fermentation (SHF) as more favourable than one step described as SSF. However, the process operation in SSF or SHF modes is an open question. Although SSF has been widely described as more favourable than SHF because it results in an improved ethanol yield by reducing product inhibition and a reduction in cost as there is no need for separate reactors (Narra et al. 2015;Ask et al. 2012), both configurations have advantages and disadvantages. According to the literature, the accumulation of glucose that inhibits cellulase activity (Gosh et al., 1982;Alfani et al., 1990;Ohgren et al., 2007) is not present in the latest generation of commercial enzymes, which work equally well in SSF and SHF (Pachos et al., 2015). On the other hand, the suboptimal temperatures in SSF are expected to be minimized by using thermotolerant microorganisms (Narra et al., 2015;Naveen et al., 2011).   Table 2 and the data represent the average of measurements in duplicate.  Table 2 and the data represent the average of measurements in duplicate. Table 4 presents the Pearson correlation coefficients among the variables obtained. The correlations obtained are important to better evaluate the process and select the variables with high predictive power for modelling ethanol production using ML methodologies. A high correlation between the pairs ethanolglycerol, inoculum-cell concentration, xylose-arabinose, xylose-bagasse and arabinose-bagasse is observed. The bivariate correlations also show that: a) there is not a high correlation between cell concentration and variables distinct from the inoculum; b) ethanol is associated with increased time, cell, xylose, arabinose, acetic acid and glycerol and decreased glucose. According to the correlation coefficients, the variables time, acetic acid, glucose, xylose and arabinose cannot be chosen simultaneously to describe the ML models because they are highly correlated, indicating redundant information. Consequently, the ML models used in this study have time, temperature and the concentrations of enzyme, inoculum and bagasse as dependent variables to describe the ethanol concentration. Table 5 summarizes the best models found to describe ethanol using ANN and RF, and Figure 3 presents details of the RF adjustment. Based on the low values of RMSE and AAD found, the deviations of the models from the experimental results are satisfactory, and, according to the high values of R 2 , the accuracy of the RF and ANN to predict future outcome is also satisfactory. Accord-ing to the values, it is possible to say that RF and ANN fitted well to the experimental data. In addition, it should be noted that ANN produced lower values of AAD than RF to describe ethanol. For this reason, ANN was selected for additional studies.  Note: * Statistically significant correlations (P < 0.05). Codes of the variables are presented in Table 3.  Figure 3: Results of the adjustment of the RF methodology: the importance of effects predicted using RF (right), the worst and best results according to the parameter mtry (left). Codes of the variables are presented in Table 3.

RESULTS AND DISCUSSION
According to the ANN, the importance of time, enzyme, bagasse, inoculum and temperature calculated by Equation 4 were, 50.1, 18.3, 17.7, 8.1 and 5.9%, respectively. The RF prediction of the importance of variables on ethanol production provided by IncMSE (an available measure present in the RF algorithm that represents the increase of MSE when each predictor is replaced in turn by a random noise) is the same observed using the ANN (Figure 3, left). These results indicate that all of the variables used are important to describe the ethanol concentration during SSF. The poor correlation between these input variables and ethanol described by bivariate correlations (Table 3) and the importance of these variables described by ANN and RF suggest that highly nonlinear interactive effects are found in SSF. Complex interactive effects between the input variables are expected in SSF using a high a load of solid. Bellido et al. (2011), studying the inhibition effects in ethanol production from wheat straw using Scheffersomyces stipitis, found a synergistic inhibition effect between acetic acid and furaldehydes. Pietrzak and Kawa-Rygielska (2015), in a study using a high concentration of solid biomass and saccharification of starch, found lower dynamics of ethanol production caused by the synergistic stressing action of sugars and ethanol.
According to the C5.0 methodology, different combinations of variables are able to yield a high production of ethanol (Table 6), and the importance of time, enzyme, bagasse, inoculum and temperature were 100, 100, 99.1, 90.3 and 68.9%, respectively. The percentages indicate the number of times each variable was used to describe the rules of classification presented in Table 5. Despite the fact that C5.0 is related to a qualitative analysis, the results found are very close to the results obtained using ANN and RF, suggesting that this methodology, which has not been used before in fermentation studies, can be very useful to describe SSF and other kinds of fermentation. The comparison between C5.0, RF and ANN show that: a) all variables tested are important to describe ethanol production; b) the relative importance of bagasse and enzyme are nearly the same in value and rank; and c) the relative importance of time, inoculum and temperature are nearly in the same order as that found using ANN and RF.
The optimum values predicted by ACO-ANN for simultaneous optimization of volumetric productivity and ethanol concentration found a volumetric productivity, an ethanol concentration and a conversion of 0.345 g/Lh, 12.1 g/L and 0.29 g/g, respectively, at the set input conditions of 99.8% enzyme, 35 C, 29.5 g/L of inoculum, bagasse concentration of 24.9% and 36 h of SSF. The experimental validation under optimized conditions determined that the volumetric productivity and ethanol concentration were 0.336 g/Lh and 12.1 g/L, respectively, which is in close agreement with the ACO-ANN results. In terms of the error of these results, it is important to note that, according to the theory of error propagation, the magnitude of errors in inoculum size, enzymatic loading and bagasse concentration are 0.15 g/L, 0.14% and 1 g/L, respectively. The comparison of these results with the literature results demonstrates that an optimization goal was found since a high concentration of ethanol was obtained at optimized conditions. Das et al. (2015), studying ethanol production by different microorganisms (Scheffersomyces stipitis, Candida shehatae and Saccharomyces cerevisiae) using hyacinth as lignocellulosic biomass and a commercial enzyme, found S. stipitis to be the best microorganism, with an optimal ethanol concentration of 10.4 g/L (ethanol conversion of 0.104 g/g) after 36 h. Asada et al. (2015), using thermotolerant yeast S. cerevisiae BA11, commercial enzyme and cedar lignocellulosic biomass, obtained their best results of 9.96 g/L of ethanol (conversion not reported) in a batch process of 24 h and 26.5 g/L (conversion of 0.741 g/g) after 60 h in a fed-batch process using the same yeast and detoxification to reduce inhibition effects. Swain and Khrishnan (2015), studying ethanol production by S. cerevisiae and Candida tropicalis using commercial enzyme in a SHF (72 h of hydrolysis and 18 h of fermentation) and rice straw, found C. tropicalis to be the best microorganism, with an optimal ethanol concentration and conversion of 26.2 g/L and 0.992 g/g, respectively. The optimum values predicted by ACO-ANN for the optimization of ethanol conversion were 0.45 g/g and 11.5 g/L of ethanol at the set input conditions of 86.0% enzyme, 33.7 C, 34.2 g/L of inoculum, bagasse concentration of 15.1% and 33.7 h of SSF. This value represents a 1.5-fold increase in ethanol conversion compared to that observed in the first optimization. These results also suggest that the ML model proposed is in good agreement with the expected results, which demonstrate that higher ethanol concentrations can be reached without achieving a very high ethanol conversion (Pachos et al. 2015).
Although the potential ML to predict and optimize the lignocellulosic ethanol production was evaluated in a study using traditional microorganisms for both the production of the enzyme complex and ethanol, it could be used directly to optimize other situations. This is important because the production of lignocellulosic ethanol continues to face technical and economic challenges as it seeks to find a costeffective process with ethanol concentration and volumetric productivities higher than 4% and 1 g/lh, respectively (Petersen et al., 2015;Jin et al., 2012;Kang et al., 2015;Raele et al., 2014), which will be possible in the future by several strategies, including fermentations using a single genetic modified yeast available to ferment both C5 and C6 sugars (He et al., 2015;Lever, 2015;Baeyens et al., 2015) and using yeast strains which combine thermotolerance and higher ethanol productivity (Narra et al., 2015;Hasunuma and Kondo, 2012).

CONCLUSIONS
The ML methodologies were successfully able to predict the effects of temperature, bagasse load, inoculum size and enzyme load without requiring the knowledge of the kinetics and the inhibition process. In addition, it was shown that the RF and ANN mathematical models are effective in evaluating the production of ethanol. The temperature of 35 C, SSF time of 36 h, enzymatic load of 99.8%, inoculum size of 29.5 g/L and bagasse concentration of 24.9% was considered to be the optimum for the simultaneous optimization of volumetric productivity and concentration of ethanol, which were found to be 0.336 g/Lh and 12.1 g/L, respectively.