Acessibilidade / Reportar erro

Identification of commercial blocks of outstanding performance of sugarcane using data mining

ABSTRACT

In order to achieve more efficient agricultural production systems, studies relating to the patterns of influence factors on commercial blocks of outstanding performance can be performed to assist management practices. The performance is considered to be the difference between the yield of a given block and the average yield of the homogeneous group that it belongs to. The methods available to identify these outstanding blocks are usually subjective. The aim of this study was to propose an objective and repeatable approach to identify outstanding performance blocks. The proposed approach consisted of performance determination, using regression trees, and the classification of these blocks by k-means clustering. This approach was illustrated using a sugarcane model. The main factors influencing the tonnes of cane per hectare (TCH) and total recoverable sugar (TRS) yields were found to be crop age and water availability during ripening, respectively. These were used to create potential yield groups, and blocks with high and low performance were identified. The proposed approach was found to be valid in the identification of outstanding sugarcane blocks, and it can be applied to different crops or in the context of precision agriculture.

clustering; regression tree; yield variability

INTRODUCTION

In the search for more efficient agricultural production systems, research studies have been performed with the aim to identify patterns in environmental and management factors that influence the yield variability from block-to-block in diverse crops (ANDRIANASOLO et al., 2014ANDRIANASOLO, F. N.; CASADEBAIG, P.; MAZA, E.; et al. Prediction of sunflower grain oil concentration as a function of variety, crop management and environment using statistical models. European Journal of Agronomy, Amsterdam, v. 54, n.3, p. 84–96. doi: 10.1016/j.eja.2013.12.002, 2014.
https://doi.org/10.1016/j.eja.2013.12.00...
; RENAUD-GENTIÉ et al., 2014RENAUD-GENTIÉ, C.; BURGOS, S.; BENOÎT, M. Choosing the most representative technical management routes within diverse management practices: Application to vineyards in the Loire Valley for environmental and quality assessment. European Journal of Agronomy, Amsterdam, v. 56, p. 19–36. doi: 10.1016/j.eja.2014.03.002, 2014.
https://doi.org/10.1016/j.eja.2014.03.00...
; ZHANG, 2012ZHANG, J. Effects of soil properties and agronomic practices on wheat yield variability in Fengqiu County of North China Plain. African Journal of Agricultural Research, Nairobi, v. 7, n. 11. p.1650-1658. doi: 10.5897/AJAR11.1436, 2012.
https://doi.org/10.5897/AJAR11.1436...
; TITTONELL et al., 2008TITTONELL, P.; SHEPHERD, K.; VANLAUWE, B.; GILLER, K. Unravelling the effects of soil and crop management on maize productivity in smallholder agricultural systems of western Kenya—An application of classification and regression tree analysis. Agriculture, Ecosystems & Environment, Amsterdam, v. 123, n. 1-3, p. 137–150. doi: 10.1016/j.agee.2007.05.005, 2008.
https://doi.org/10.1016/j.agee.2007.05.0...
).

This type of study, focusing on specific blocks of outstanding performance, can help in the adoption of more appropriate management procedures. The patterns associated with blocks of superior performance can be used to confirm that certain measures result in greater productivity, while low performance patterns indicate the need for improvement in management, or the identification of specific conditions that should be avoided (TANUSKA et al.., 2012TANUSKA, P.; VAZAN, P.; KEBISEK, M.; MORAVCIK, O.; SCHREIBER, P. Data Mining Model Building as a Support for Decision Making in Production Management. In: Wyld, D. C.; Zizka, J.; Nagamalai, D. (Ed.). Advances in computer science, engineering & applications. v. 166, p.695–701. Berlin: Springer Berlin, 2012. Disponível em: <http://link.springer.com/10.1007/978-3-642-30157-5_69, 2012>. Acesso em: 11 jun. 2015.
http://link.springer.com/10.1007/978-3-6...
; VAZAN et al., 2011VAZAN, P.; TANUSKA, P.; KEBISEK, M. The data mining usage in production system management. World Academy of Science, Engineering and Technology, Dubai, v. 5, n. 5, p. 922–926, 2011.). The performance of a given block is considered to be the difference between the yield of the block and the average yield of the homogeneous group that it belongs to.

Despite the benefits of this type of analysis (LAWES & LAWN, 2005LAWES, R. A.; LAWN, R. J. Applications of industry information in sugarcane production systems. Field Crops Research, Amsterdam, v. 92, n. 2–3, p. 353–363. doi: 10.1016/j.fcr.2005.01.033, 2005.
https://doi.org/10.1016/j.fcr.2005.01.03...
), the methods used to determine the potential yield groups and the identification of outstanding blocks are subjective, therefore, it is highly dependent of the interpretation of each specialist (TITTONELL et al., 2008)TITTONELL, P.; SHEPHERD, K.; VANLAUWE, B.; GILLER, K. Unravelling the effects of soil and crop management on maize productivity in smallholder agricultural systems of western Kenya—An application of classification and regression tree analysis. Agriculture, Ecosystems & Environment, Amsterdam, v. 123, n. 1-3, p. 137–150. doi: 10.1016/j.agee.2007.05.005, 2008.
https://doi.org/10.1016/j.agee.2007.05.0...
.

Given the large number of variables submitted for analysis in this type of study (TANUSKA et al.., 2012TANUSKA, P.; VAZAN, P.; KEBISEK, M.; MORAVCIK, O.; SCHREIBER, P. Data Mining Model Building as a Support for Decision Making in Production Management. In: Wyld, D. C.; Zizka, J.; Nagamalai, D. (Ed.). Advances in computer science, engineering & applications. v. 166, p.695–701. Berlin: Springer Berlin, 2012. Disponível em: <http://link.springer.com/10.1007/978-3-642-30157-5_69, 2012>. Acesso em: 11 jun. 2015.
http://link.springer.com/10.1007/978-3-6...
; VAZAN et al., 2011VAZAN, P.; TANUSKA, P.; KEBISEK, M. The data mining usage in production system management. World Academy of Science, Engineering and Technology, Dubai, v. 5, n. 5, p. 922–926, 2011.; LAWES & LAWN, 2005LAWES, R. A.; LAWN, R. J. Applications of industry information in sugarcane production systems. Field Crops Research, Amsterdam, v. 92, n. 2–3, p. 353–363. doi: 10.1016/j.fcr.2005.01.033, 2005.
https://doi.org/10.1016/j.fcr.2005.01.03...
), data mining techniques are a promising alternative due to their ability to address complex databases with noise (HAN et al., 2012)HAN, J.; KAMBER, M.; PEI, J. Data mining : concepts and techniques. Amsterdam; Boston: Elsevier; Morgan Kaufmann, 2012. 703 p., a common feature of commercial block databases. In particular, the regression tree induction technique is able to identify and rank the factors that influence a production system (WITTEN & FRANK, 2011)WITTEN, I. H.; FRANK, E. Data mining : practical machine learning tools and techniques. 3rd ed. Amsterdam: Morgan Kaufman, 2011. 665 p., and may be useful for establishing potential yield groups to determine the performance of blocks.

The same application can also be extended to precision agriculture, where instead of analysing the variability between specific block-to-block, the variability between different areas within the same block are considered. A study by SOUZA et al. (2010)SOUZA, Z. M. DE; CERRI, D. G. P.; COLET, M. J.; et al. Análise dos atributos do solo e da produtividade da cultura de cana-de-açúcar com o uso da geoestatística e árvore de decisão. Ciência Rural, Santa Maria, v. 40, n. 4, p. 840–847, 2010. demonstrated the use of a decision tree to identify the main factors affecting the yield variability within a block. The main factor was identified to be the declivity of the land. Although this factor is not manageable, this approach defined the potential yield groups and identified outstanding points within the block.

Due to the benefits associated with analysing the patterns of blocks of outstanding performance, and the lack of an objective approach for their identification, the aim of this study was to propose an objective and repeatable approach for the identification of sugarcane blocks of outstanding performance.

MATERIAL AND METHODS

The computational package used in this research was JMP version 11.2 (SAS INSTITUTE INC., 2013SAS INSTITUTE INC. JMP 11 specialized models. Cary: SAS Institute, 2013. 224 p.).

The approach was conducted using a database which included yield values for sugarcane (Saccharum officinarum) blocks, in addition to the specific influencing factors related to this crop, provided by a production unit located in the western region of São Paulo, Brazil. This database contained a total of 2255 records, and each record corresponded to the data from one block, collected by the plant during normal working conditions. In addition to the goal attributes, tonnes of cane per hectare (TCH) and total recoverable sugar (TRS), there were another 65 descriptor attributes related to the environmental and management conditions. The approach described below independently addressed the two goal attributes, therefore, the whole process was conducted independently for TCH and TRS.

The proposed approach for the identification of blocks of outstanding performance consisted of several steps to determine the most relevant performance factors (attributes) for each goal (TCH and TRS) and the clustering of blocks, based on these factors.

The aim of determining the performance is to equalise the potential yield of the different blocks, such that they can be compared while taking into account the environment in which they were grown.

The performance of each block, related to the TCH and TRS attributes, is calculated by the difference between its yield and the average yield of the potential yield group from which the block belongs, divided by the standard deviation of the group, according to [eq. (1)]. The two performance variables for TCH and TRS were termed ZTCH and ZTRS, respectively. In this way, the average ZTCH or ZTRS of each group had a value of zero and the standard deviation was equal to one.

where,

ZPRODij was the standard performance of block i in the potential yield group j;

PRODij was the yield of block i in the potential yield group j;

was the average of the potential yield group, and

Sj was the standard deviation of the group of blocks with the same potential yield group.

To determine the potential yield groups, we used the regression tree induction for the particular goal attribute, thereby identifying and ordering the most influential factors on the production system. We then selected the factors of greatest importance. Each combination of levels for the selected factors corresponded to the potential yield group.

The regression tree is a non-parametric modelling technique that can be used to explain the responses of a dependent variable (goal attribute) based on a series of continuous or categorical independent attributes. This technique makes recursive divisions in the data based on the independent variables, in order to increase homogeneity in the response of the goal attribute (WITTEN & FRANK, 2011WITTEN, I. H.; FRANK, E. Data mining : practical machine learning tools and techniques. 3rd ed. Amsterdam: Morgan Kaufman, 2011. 665 p.).

After the performance for each block had been determined, the blocks were grouped in order to identify the outstanding blocks. We used the k-means clustering method, which minimises the difference between points within the same group and maximises the difference between the k groups defined by the user (FORGEY, 1965FORGEY, E. Cluster analysis of multivariate data: efficiency vs. interpretability of classification. Biometrics, Oxford, v. 21, n.3, p. 768, 1965.). This process begins with determining the k centres of random groups, then each of the records is allocated to the centre group that is most closely located. The centres, representing the average records of the groups, are recalculated, and the records are then moved if the recalculated centre of another group is closer. This process continues until stable.

Three (k = 3) clusters were selected, and the algorithm was applied exclusively to the performance data, ignoring the descriptor attributes. The clusters formed were specified as either Low, Medium or High performance, according to their average value. The Low and High clusters were considered to have the most outstanding performance, while the blocks of the Average cluster were not considered to be outstanding.

Other values of k clusters could be used in the k-means method (MERTLER & VANNATTA, 2013MERTLER, C. A.; VANNATTA, R. A. Advanced and multivariate statistical methods: practical application and interpretation. Glendale: Pyrczak Publishing, 2013. 368 p.), thereby identifying different groups of outstanding blocks, which can also help with the subsequent search for patterns. For example, two or five clusters, generating High/Low and Very High/High/Average/Low/Very Low groups, respectively. For a k value greater than 4, the use of the Kruskal-Wallis non-parametric test is recommended (GLANTZ, 2011)GLANTZ, S. A. Primer of biostatistics. 7th ed. New York: McGraw-Hill Education, 2011. 352 p. to verify if the different k values are indeed generating distinct clusters.

RESULTS AND DISCUSSION

The relative importance of the influence factors on TCH, obtained by the regression tree, showed an effect of crop age compared to the other attributes analysed (Table 1). This attribute was found to be responsible for 56% of the total variability explained by the regression tree, whereas the second most important factor, soil texture, was only responsibly for 4% of the variability. For this reason, crop age was selected for the preparation of potential yield groups for TCH.

TABLE 1
Relative and accumulated importance of specific influence factors on the yield of sugarcane per hectare (TCH) obtained by the regression tree.

There is known to be a declining trend in sugarcane yield with increasing crop age (INMAN-BAMBER, 2013INMAN-BAMBER, N. . Sugarcane yields and yields-limiting processes. In: P. H. Moore; F. C. Botha (Eds.); Sugarcane: physiology, biochemistry, and functional biology. Chichester: John Wiley & Sons, 2013. p. 579-600. Disponível em: <http://doi.wiley.com/10.1002/9781118771280, 2013>. Acesso em: 8 nov. 2014.
http://doi.wiley.com/10.1002/97811187712...
). However, the magnitude of the difference of this factor in relation to the other ones is unexpected, which could have possibly remained undetected without the use of the regression tree or other more elaborate analysis tools (e.g. multiple regression).

As shown in Figure 1a, the performance results for sugarcane yield per hectare (ZTCH), and the clusters, illustrate the potential yield groups created for each level of crop age and its average (equal to zero), in addition to its classification as either Low, Medium or High. The Low cluster was found to represent 24.1% of the total number of blocks, while the High cluster represented 24.2%. The ZTCH values that separated the Low and High performance blocks were -0.64 and 0.65, respectively. We noted that these values were valid for each of the potential yield groups. The participation percentage of both the Low and High clusters increased with increasing crop age, according to the results obtained from the Cochran-Armitage test for trend (Prob. > Z < 0.001 and Prob. > Z = 0.0386, respectively). In general, the participation in the initial years was found to be 18%, while for the more advanced years this value was 32%.

FIGURE 1
Values of (a) yield performance of sugarcane per hectare (ZTCH) and (b) yield of sugarcane per hectare (TCH), according to the levels of crop age and clusters.

Figure 1b shows the clusters obtained for ZTCH for the TCH attribute. We observed that there was no longer a fixed value for the segmentation of outstanding blocks, as identified previously. For example, planting blocks with an age of 18 months, classified as having Low performance (average yield of 63.1 tonnes ha-1), has a TCH value above the blocks of the 5th year, which is beyond that classified as High (57.3 tonnes ha-1). This is the main benefit of the proposed approach, as the objective and repeatable identification of outstanding blocks takes into account the environmental conditions in which they developed.

The observation that the High and Low clusters covered a similar number of records may be explained by the approximately normal distribution of TCH data (kurtosis of 0.09 and asymmetry of 0.05, where these values are equal to zero in a perfect normal distribution) and the clustering by k-means. As the data were found to be normally distributed, we expected the cut-off points between the clusters to be symmetrical, in addition to being close to that found previously using a box-plot method (GLANTZ, 2011GLANTZ, S. A. Primer of biostatistics. 7th ed. New York: McGraw-Hill Education, 2011. 352 p.).

Table 2 shows the results of the influence factors on TRS. The ‘water availability’ attribute during the ripening phase had the highest influence on TRS, with 51% of the variability explained by the model, also confirming a widely accepted concept in the literature (VAN HEERDEN et al., 2013VAN HEERDEN, P. D. R.; EGGLESTON, G.; DONALDSON, R. A. Ripening and postharvest deterioration. In: P. H. Moore; F. C. Botha (Eds.); Sugarcane: physiology, biochemistry, and functional biology. Chichester: John Wiley & Sons, 2013. Disponível em: <http://doi.wiley.com/10.1002/9781118771280, 2013>. Acesso em: 8 nov. 2014.). Therefore, this was the factor selected for preparation of the potential yield groups related to TRS.

TABLE 2
Relative and accumulated importance of the influence factors on the yield of total recoverable sugar (TRS) obtained by the regression tree.

Unlike the ‘crop age’ attribute (selected for the formation of potential yield groups for TCH), which has nominal levels, the ‘water availability’ factor is continuous. Therefore, we considered that the number of groups was equal to the number of different values belonging to this attribute. Thus, the average TRS value for each group was determined by [eq. (2)] (Prob. > F < 0.0001).

where,

j of blocks (kg sugar per tonne of sugarcane), and was the average TRS value for each group

DHMj was the water availability in the ripening phase for the same group of blocks, (mm.day-1).

ZTRS values lower than -1.0 and greater than 0.31 represented the groups of Low and High performance, respectively (Figure 2a). The High cluster represented 41.7% of the records and the Low cluster represented 14.4%. Unlike the cut-off points found for ZTCH, the cut-off points for ZTRS covered a distinct percentage of records, which may be explained by asymmetry to the left (kurtosis of 1.28 and asymmetry of -0.75). This confirms the ability of the k-means method to adjust itself to the distribution of the data (MERTLER & VANNATTA, 2013MERTLER, C. A.; VANNATTA, R. A. Advanced and multivariate statistical methods: practical application and interpretation. Glendale: Pyrczak Publishing, 2013. 368 p.).

FIGURE 2
Values of (a) yield performance of total recoverable sugars (ZTRS) and (b) yield of total recoverable sugars (TRS), according to the water availability during the ripening period and the clusters.

The impact of the cluster formation on TRS (Figure 2b), considering the ZTRS, showed similar results to that observed for TCH, despite the selected influence factor being continuous.

The participation of clusters in the total records varied significantly. This highlights the importance for the method used to evaluate blocks of outstanding performance to consider the distribution of the data at the moment of classification (in the case of k-means) instead of a fixed arbitrary value. Using clusters of outstanding blocks that are more homogeneous is expected to facilitate the search for beneficial patterns (HAN et al., 2012HAN, J.; KAMBER, M.; PEI, J. Data mining : concepts and techniques. Amsterdam; Boston: Elsevier; Morgan Kaufmann, 2012. 703 p.; FERRARO et al., 2009FERRARO, D. O.; RIVERO, D. E.; GHERSA, C. M. An analysis of the factors that influence sugarcane yield in Northern Argentina using classification and regression trees. Field Crops Research, Amsterdam, v. 112, n. 2–3, p. 149–157. doi: 10.1016/j.fcr.2009.02.014, 2009.
https://doi.org/10.1016/j.fcr.2009.02.01...
).

CONCLUSIONS

The proposed approach was found to be valid for the identification of sugarcane blocks of outstanding performance in an objective and repeatable manner. Therefore, this approach should be tested in other crops and in the context of precision agriculture, in which, instead of investigating the cause of variability from one block to another, we could attempt to understand the variation in georeferenced points within a particular block.

The results from the regression trees for TCH and TRS showed a significant influence of factors which are not easily controlled or manageable, which were used for equalising the potential yield. If the regression trees had identified factors that are easily altered by the adoption of direct management or planning techniques, this analysis would assist in continuous improvement of the production system.

ACKNOWLEDGEMENTS

The authors thank the BioEn Fapesp and Odebrecht Agroindustrial companies for financial support (process No. 2012/50049-3) and the São Paulo Research Foundation (FAPESP; Brazil). The authors also thank the professionals who were interviewed for this study.

REFERENCES

  • ANDRIANASOLO, F. N.; CASADEBAIG, P.; MAZA, E.; et al. Prediction of sunflower grain oil concentration as a function of variety, crop management and environment using statistical models. European Journal of Agronomy, Amsterdam, v. 54, n.3, p. 84–96. doi: 10.1016/j.eja.2013.12.002, 2014.
    » https://doi.org/10.1016/j.eja.2013.12.002
  • FERRARO, D. O.; RIVERO, D. E.; GHERSA, C. M. An analysis of the factors that influence sugarcane yield in Northern Argentina using classification and regression trees. Field Crops Research, Amsterdam, v. 112, n. 2–3, p. 149–157. doi: 10.1016/j.fcr.2009.02.014, 2009.
    » https://doi.org/10.1016/j.fcr.2009.02.014
  • FORGEY, E. Cluster analysis of multivariate data: efficiency vs. interpretability of classification. Biometrics, Oxford, v. 21, n.3, p. 768, 1965.
  • GLANTZ, S. A. Primer of biostatistics 7th ed. New York: McGraw-Hill Education, 2011. 352 p.
  • HAN, J.; KAMBER, M.; PEI, J. Data mining : concepts and techniques. Amsterdam; Boston: Elsevier; Morgan Kaufmann, 2012. 703 p.
  • VAN HEERDEN, P. D. R.; EGGLESTON, G.; DONALDSON, R. A. Ripening and postharvest deterioration. In: P. H. Moore; F. C. Botha (Eds.); Sugarcane: physiology, biochemistry, and functional biology. Chichester: John Wiley & Sons, 2013. Disponível em: <http://doi.wiley.com/10.1002/9781118771280, 2013>. Acesso em: 8 nov. 2014.
  • INMAN-BAMBER, N. . Sugarcane yields and yields-limiting processes. In: P. H. Moore; F. C. Botha (Eds.); Sugarcane: physiology, biochemistry, and functional biology. Chichester: John Wiley & Sons, 2013. p. 579-600. Disponível em: <http://doi.wiley.com/10.1002/9781118771280, 2013>. Acesso em: 8 nov. 2014.
    » http://doi.wiley.com/10.1002/9781118771280
  • LAWES, R. A.; LAWN, R. J. Applications of industry information in sugarcane production systems. Field Crops Research, Amsterdam, v. 92, n. 2–3, p. 353–363. doi: 10.1016/j.fcr.2005.01.033, 2005.
    » https://doi.org/10.1016/j.fcr.2005.01.033
  • MERTLER, C. A.; VANNATTA, R. A. Advanced and multivariate statistical methods: practical application and interpretation Glendale: Pyrczak Publishing, 2013. 368 p.
  • RENAUD-GENTIÉ, C.; BURGOS, S.; BENOÎT, M. Choosing the most representative technical management routes within diverse management practices: Application to vineyards in the Loire Valley for environmental and quality assessment. European Journal of Agronomy, Amsterdam, v. 56, p. 19–36. doi: 10.1016/j.eja.2014.03.002, 2014.
    » https://doi.org/10.1016/j.eja.2014.03.002
  • SAS INSTITUTE INC. JMP 11 specialized models Cary: SAS Institute, 2013. 224 p.
  • SOUZA, Z. M. DE; CERRI, D. G. P.; COLET, M. J.; et al. Análise dos atributos do solo e da produtividade da cultura de cana-de-açúcar com o uso da geoestatística e árvore de decisão. Ciência Rural, Santa Maria, v. 40, n. 4, p. 840–847, 2010.
  • TANUSKA, P.; VAZAN, P.; KEBISEK, M.; MORAVCIK, O.; SCHREIBER, P. Data Mining Model Building as a Support for Decision Making in Production Management. In: Wyld, D. C.; Zizka, J.; Nagamalai, D. (Ed.). Advances in computer science, engineering & applications v. 166, p.695–701. Berlin: Springer Berlin, 2012. Disponível em: <http://link.springer.com/10.1007/978-3-642-30157-5_69, 2012>. Acesso em: 11 jun. 2015.
    » http://link.springer.com/10.1007/978-3-642-30157-5_69
  • TITTONELL, P.; SHEPHERD, K.; VANLAUWE, B.; GILLER, K. Unravelling the effects of soil and crop management on maize productivity in smallholder agricultural systems of western Kenya—An application of classification and regression tree analysis. Agriculture, Ecosystems & Environment, Amsterdam, v. 123, n. 1-3, p. 137–150. doi: 10.1016/j.agee.2007.05.005, 2008.
    » https://doi.org/10.1016/j.agee.2007.05.005
  • VAZAN, P.; TANUSKA, P.; KEBISEK, M. The data mining usage in production system management. World Academy of Science, Engineering and Technology, Dubai, v. 5, n. 5, p. 922–926, 2011.
  • WITTEN, I. H.; FRANK, E. Data mining : practical machine learning tools and techniques 3rd ed. Amsterdam: Morgan Kaufman, 2011. 665 p.
  • ZHANG, J. Effects of soil properties and agronomic practices on wheat yield variability in Fengqiu County of North China Plain. African Journal of Agricultural Research, Nairobi, v. 7, n. 11. p.1650-1658. doi: 10.5897/AJAR11.1436, 2012.
    » https://doi.org/10.5897/AJAR11.1436

Publication Dates

  • Publication in this collection
    Sep-Oct 2016

History

  • Received
    11 June 2015
  • Accepted
    13 Apr 2016
Associação Brasileira de Engenharia Agrícola SBEA - Associação Brasileira de Engenharia Agrícola, Departamento de Engenharia e Ciências Exatas FCAV/UNESP, Prof. Paulo Donato Castellane, km 5, 14884.900 | Jaboticabal - SP, Tel./Fax: +55 16 3209 7619 - Jaboticabal - SP - Brazil
E-mail: revistasbea@sbea.org.br