Credit analysis using data mining: application in the case of a credit union

Sousa, Marcos de Moraes; Figueiredo, Reginaldo Santana

doi:10.4301/S1807-17752014000200009

Abstracts

The search for efficiency in the cooperative credit sector has led cooperatives to adopt new technology and managerial knowhow. Among the tools that facilitate efficiency, data mining has stood out in recent years as a sophisticated methodology to search for knowledge that is “hidden” in organizations' databases. The process of granting credit is one of the central functions of a credit union; therefore, the use of instruments that support that process is desirable and may become a key factor in credit management. The steps undertaken by the present case study to perform the knowledge discovery process were data selection, data pre-processing and cleanup, data transformation, data mining, and the interpretation and evaluation of results. The results were evaluated through cross-validation of ten sets, repeated in ten simulations. The goal of this study is to develop models to analyze the capacity of a credit union's members to settle their commitments, using a decision tree—C4.5 algorithm and an artificial neural network—multilayer perceptron algorithm. It is concluded that for the problem at hand, the models have statistically similar results and may aid in a cooperative's decision-making process.

Credit Unionism; Data Mining; Decision Tree; Artificial Neural Network

1. INTRODUCTION

This article describes the development of models to analyze the capacity of a credit union's associates to settle their commitments. Data mining techniques were used to develop the models.

To construct the models, the actual database of cooperative borrowers from a SICOOB (Sistema Cooperativo Brasileiro—Brazilian Cooperative System) credit union was used. It must be stressed that such data are difficult to access and collect.

A credit cooperative or union is a society and must be guided by a social purpose. However, it is also a financial institution and is regulated by norms imposed by the National Monetary Council of Brazil and the Brazilian Central Bank. Moreover, a credit union must aim to remain in the market permanently, which requires that resources be managed efficiently.

Cooperative Movement may be classified according to two concepts: the Rochdale principle, which aims to transform society and reform man; and the theoretical principle, developed at the University of Münster (Germany), which uses the tools of business administration science and views the cooperative as a modern enterprise (Pinho, 2004Pinho, D. B. (2004). O cooperativismo no Brasil: Da vertente pioneira à vertente solidaria. São Paulo: Saraiva.).

From the theoretical perspective, the Münster theory is the best developed. It is also known as the economic theory of cooperative cooperation, and its origins lie in the Institute for Cooperative Research at the University of Münster in Germany. In opposition to the Rochdale principle’s doctrinal assumptions, professors at the University of Münster, together with Latin American scholars, developed a “school” with methodological foundations that trace back to critical rationalism (Pinho, 2004Pinho, D. B. (2004). O cooperativismo no Brasil: Da vertente pioneira à vertente solidaria. São Paulo: Saraiva.).

Pinho (1982Pinho, D. B. (1982). O pensamento cooperativo e o cooperativismo brasileiro. CNPq/BNCC., p. 75), following Bettcher, exposes the following concept of the cooperative, based on the axioms and assumptions of the Münster theory: “Cooperatives are groups of individuals that defend their individual economic interests by means of a company that they jointly maintain”^¹. In this context, Frantz (1985:56) adds that the cooperative may also be understood as embodying a “[...] competition strategy aimed to maximize the results of each producer's individual economic action [...]”.

This study looks at credit unions from the perspective of the theoretical cooperative movement and, based on its assumptions and axioms, it finds that analyzing information for decision-making is a central condition. Tools and methodologies aimed at analyzing managerial information have greatly evolved in recent decades.

Credit-union management is very complex because it must balance cooperative members' yearnings and needs while competing in the market. A credit union’s characteristics of being an association of members, while also being a company in the market, must remain balanced.

The number of cooperative members and unions has been slowly increasing. According to data from the Organization of Brazilian Cooperatives (Organização das Cooperativas Brasileiras—OCB) (2014), there are currently more than 1,047 single credit cooperatives and 4,529 points of service in Brazil. The SICOOB is the largest cooperative credit system in Brazil, with 529 single cooperatives and 1,949 cooperative service points (Portal do Cooperativismo de Crédito, 2014Portal do Cooperativismo de Crédito. (2014). Dados consolidados dos sistemas cooperativos. Retrieved February 20, 2014, from http://cooperativismodecredito.coop.br/cenario-brasileiro/dados-consolidados=_dos-sistemas-cooperativos/
http://cooperativismodecredito.coop.br/c... ).

The dynamic and competitive environment of Brazil's financial market, together with recent changes in credit offerings, demands a professional posture, which leads credit unions to adopt the use of new technologies and management knowhow.

Oliveira (2001)Oliveira, D. P. R. (2001). Manual de gestão de cooperativas: Uma abordagem prática. São Paulo: Atlas. indicates that the professionalization of cooperative members and unions is a relevant trend. The sector has developed quickly, adopting an integration strategy using central cooperatives; thus, it must be tuned in to the most efficient management tools.

Credit analysis is one of the most important issues for financial institutions. Chaia (2003)Chaia, A. J. (2003). Modelos de gestão de risco de crédito e sua aplicabilidade ao mercado brasileiro. Dissertação de Mestrado. FEA/USP. stresses the importance of defining the type of analysis to be conducted and its coverage, and further warns of the dangers of copying and using other institutions' models, thus arriving at inadequate assessments.

One of the foremost methods of credit assessment used by financial institutions is credit scoring. Chaia (2003Chaia, A. J. (2003). Modelos de gestão de risco de crédito e sua aplicabilidade ao mercado brasileiro. Dissertação de Mestrado. FEA/USP., p. 23) defines this model as the use of statistical tools to identify the factors that determine the probability of a client going into default and notes that their main advantage is that “[...] grant-related decisions are made based on impersonal and standardized procedures, generating a higher degree of reliability”.

Given the previously discussed concept of cooperative, the use of such objective methodologies—such as credit scoring—in granting credit is highly relevant. Decisions made solely by evaluating subjective assessments are avoided. Koh, Tan, and Goh (2006) tie the progress of credit scoring to increased competitiveness, advances in computational technology, and the exponential growth of large databases.

Mester (1997)Mester, L. J. (1997). What’s the point of credit scoring?Business Review, 3, 3–16. notes that model exactness, current data, and model evaluation and readjustment are some critical factors in credit scoring and that flaws in these factors limit the use of the model. Given that lending is one of credit unions' central functions, the analysis of that function is fundamental to protecting the cooperative's collective assets.

Obtaining tools that classify and help predict the behavior of future loans is fundamental to credit management, helping to reduce process subjectivity, allowing more efficient resource allocation, and resulting in quicker responses to proposals.

Studies on credit analysis using data mining increase the models' precision and have been conducted by several authors in recent years (Abellán & Mantas, 2014Abellán, J., & Mantas, C. J. (2014). Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications,41(8), 3825–3830.; Akkoç, 2012Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178.; Bhattacharyya, Jha, Tharakunnel, & Christopher, 2011Bhattacharyya, S., Jha, S., Tharakunnel, K., & Christopher, J. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613.; Chang & Yeh, 2012Chang, S.-Y., & Yeh, T.-Y. (2012). An artificial immune classifier for credit scoring analysis. Applied Soft Computing,12(2), 611–618.; Chen & Huang, 2011Chen, S. C., & Huang, M. Y. (2011). Constructing credit auditing and control & management model with data mining technique. Expert Systems with Applications,38(5359-5365).; Crone & Finlay, 2012Crone, S. F., & Finlay, S. (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting, 28(1), 224–238.; Cubiles-De-La-Vega, Blanco-Oliver, Pino-Mejías, & Lara-Rubio, 2013Cubiles-De-La-Vega, M.-D., Blanco-Oliver, A., Pino-Mejías, R., & Lara-Rubio, J. (2013). Improving the management of microfinance institutions by using credit scoring models based on Statistical Learning techniques.Expert Systems with Applications, 40(17), 6910–6917.; García, Marqués, & Sánchez, 2012García, V., Marqués, A. I., & Sánchez, J. S. (2012). On the use of data filtering techniques for credit risk prediction with instance-based models. Expert Systems with Applications,39(18), 13267–13276.; Han, Han, & Zhao, 2013Han, L., Han, L., & Zhao, H. (2013). Orthogonal support vector machine for credit scoring. Engineering Applications of Artificial Intelligence, 26(2), 848–862.;Koh et al., 2006Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118.; Kruppa, Schwarz, Arminger, & Ziegler, 2013Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability estimates using machine learning.Expert Systems with Applications, 40(13), 5125–5131.; LaiLai, K. K., Yu, L., Wang, S., & Zhou, L. (2006). Credit risk analysis using a reliability-based neural network ensemble model. InArtificial Neural Networks-ICANN 2006 (pp. 682–690). Springer Berlin Heidelberg., Yu, Wang, & Zhou, 2006; Lemos, Steiner, & Nievola, 2005; Majeske & Lauer, 2013Majeske, K. D., & Lauer, T. W. (2013). The bank loan approval decision from multiple perspectives. Expert Systems with Applications, 40(5), 1591–1598.; Marqués, García, & Sánchez, 2012; Nie, Rowe, Zhang, Tian, & Shi, 2011; Oreski & Oreski, 2014Oreski, S., & Oreski, G. (2014). Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications, 41(4), 2052–2064.; Saberi et al., 2013Saberi, M., Mirtalaie, M. S., Hussain, F. K., Azadeh, A., Hussain, O. K., & Ashjari, B. (2013). A granular computing-based approach to credit scoring modeling. Neurocomputing, 122(25), 100–115.; Wang, Ma, Huang, & Xu, 2012Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.; Xiong, Wang, Mayers, & Monga, 2013Xiong, T., Wang, S., Mayers, A., & Monga, E. (2013). Personal bankruptcy prediction by mining credit card data. Expert Systems with Applications, 40(2), 665–676.; Yap, Ong, & Husain, 2011Yap, B. W., Ong, S. H., & Husain, N. H. M. (2011). Using data mining to improve assessment of credit worthiness. Expert Systems with Applications, 38(10), 13274–13283.; Zhong, Miao, Shen, & Feng, 2014Zhong, H., Miao, C., Shen, Z., & Feng, Y. (2014). Comparing the learning effectiveness of BP, ELM, I-ELM, and SVM for corporate credit ratings.Neurocomputing, 128(27), 285–295.; Zhou, Jiang, Shi, & Tian, 2011Zhou, X., Jiang, W., Shi, Y., & Tian, Y. (2011). Credit risk evaluation with kernel-based affine subspace nearest points learning method.Expert Systems with Applications, 38(4), 4272–4279.; Zhu, Li, Wu, Wang, & Liang, 2013Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267.).

Despite growing interest, there is still little application of this tool in cooperatives. Khatchatourian and Treter (2010)Khatchatourian, O., & Treter, J. (2010). APLICAÇÃO DA LÓGICA FUZZY PARA AVALIAÇÃO ECONÔMICO-FINANCEIRA DE COOPERATIVAS DE PRODUÇÃO.Revista de Gestão Da Tecnologia E Sistemas de Informação,7(1), 141–162. apply fuzzy logic to their analysis of the financial performance of production cooperatives in the Brazilian state of Rio Grande do Sul. Zhu, Li, Wu, Wang, and Liang (2013)Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267. use a support vector machine in their credit analysis of a credit union in Barbados.

Currently, there are several data mining techniques available. Therefore, the intention here is to examine which data mining methodology provides the best credit-analysis results for credit unions. To this end, this study’s objective is to determine whether a data mining model can perform well for classifying and predicting credit unions' credit management.

2. THEORETICAL FRAMEWORK

The term knowledge discovery in databases (KDD) was first used in 1989 to stress that knowledge is the final product of the discovery process in databases (Fayyad, Piatetsky-Shapiro, & Smyth, 1996Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. A I Magazine,17(3), 37–54.).

Until 1995, the terms KDD and data mining were understood by many authors as synonymous (Lemos et al., 2005Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.). Fayyad et al. (1996)Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. A I Magazine,17(3), 37–54. defines and distinguishes KDD and data mining as follows: the former refers to the general process of discovering useful knowledge from data, whereas the latter refers to the specific application of algorithms to extract patterns and models from data. In the view of these authors, data mining thus is a step in the KDD process that consists of applying algorithms in the production of a particular set of patterns and models.

Fayyad et al. (1996)Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. A I Magazine,17(3), 37–54. refer to patterns as model components. This study used the concept of model defined by Pidd (1998Pidd, M. (1998). Modelagem empresarial: Ferramentas para tomada de decisão. São Paulo: Atlas., p. 23): “A model is an external and explicit representation of part of a reality, observed by a person who wishes to use that model to understand, alter, manage, and control part of that reality”.

Goldschmidt and Passos (2003, p. 6) split KDD activities into three groups: (i) technological development, a group that incorporates “[...] the initiatives for conception, betterment and development of algorithms, support tools and technologies [...]” into the KDD process; (ii) KDD execution, a group composed of activities related to using algorithms, tools, and technologies developed in the search of knowledge; (iii) application of results, as a group that consists of the models developed in the KDD execution, in a such way that “[...] the activities turn toward the application of results in the context in which the KDD process was conducted”.

Hybrid or composite techniques and models are usually compared. The following models of credit and risk analysis stand out: (a) logistic regression (Akkoç, 2012Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178.; Bhattacharyya et al., 2011Bhattacharyya, S., Jha, S., Tharakunnel, K., & Christopher, J. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613.; Cubiles-De-La-Vega et al., 2013Cubiles-De-La-Vega, M.-D., Blanco-Oliver, A., Pino-Mejías, R., & Lara-Rubio, J. (2013). Improving the management of microfinance institutions by using credit scoring models based on Statistical Learning techniques.Expert Systems with Applications, 40(17), 6910–6917.; Han et al., 2013Han, L., Han, L., & Zhao, H. (2013). Orthogonal support vector machine for credit scoring. Engineering Applications of Artificial Intelligence, 26(2), 848–862.; Ju & Sohn, 2014Ju, Y. H., & Sohn, S. Y. (2014). Updating a credit-scoring model based on new attributes without realization of actual data. European Journal of Operational Research, 234(1), 119–126.;Koh et al., 2006Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118.; Kruppa et al., 2013Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability estimates using machine learning.Expert Systems with Applications, 40(13), 5125–5131.; Nie et al., 2011Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.; Wang et al., 2012Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.;Yap et al., 2011)Yap, B. W., Ong, S. H., & Husain, N. H. M. (2011). Using data mining to improve assessment of credit worthiness. Expert Systems with Applications, 38(10), 13274–13283.; (b) decision trees (Abellán & Mantas, 2014Abellán, J., & Mantas, C. J. (2014). Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications,41(8), 3825–3830.; Bhattacharyya et al., 2011Bhattacharyya, S., Jha, S., Tharakunnel, K., & Christopher, J. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613.; Chen & Huang, 2011Chen, S. C., & Huang, M. Y. (2011). Constructing credit auditing and control & management model with data mining technique. Expert Systems with Applications,38(5359-5365).; Crone & Finlay, 2012Crone, S. F., & Finlay, S. (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting, 28(1), 224–238.; Koh et al., 2006Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118.; Kruppa et al., 2013Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability estimates using machine learning.Expert Systems with Applications, 40(13), 5125–5131.;Lemos et al., 2005Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.; Nie et al., 2011Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.; Wang et al., 2012Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.; Yap et al., 2011)Yap, B. W., Ong, S. H., & Husain, N. H. M. (2011). Using data mining to improve assessment of credit worthiness. Expert Systems with Applications, 38(10), 13274–13283.; (c) neural networks (Akkoç, 2012Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178.; Chen & Huang, 2011Chen, S. C., & Huang, M. Y. (2011). Constructing credit auditing and control & management model with data mining technique. Expert Systems with Applications,38(5359-5365).; Koh et al., 2006Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118.; Lai et al., 2006Lai, K. K., Yu, L., Wang, S., & Zhou, L. (2006). Credit risk analysis using a reliability-based neural network ensemble model. InArtificial Neural Networks-ICANN 2006 (pp. 682–690). Springer Berlin Heidelberg.; Nie et al., 2011Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.; Oreski & Oreski, 2014Oreski, S., & Oreski, G. (2014). Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications, 41(4), 2052–2064.; Saberi et al., 2013Saberi, M., Mirtalaie, M. S., Hussain, F. K., Azadeh, A., Hussain, O. K., & Ashjari, B. (2013). A granular computing-based approach to credit scoring modeling. Neurocomputing, 122(25), 100–115.; Wang et al., 2012)Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.; (d) support vector machines (Bhattacharyya et al., 2011Bhattacharyya, S., Jha, S., Tharakunnel, K., & Christopher, J. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613.; Nie et al., 2011Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.; Xiong et al., 2013Xiong, T., Wang, S., Mayers, A., & Monga, E. (2013). Personal bankruptcy prediction by mining credit card data. Expert Systems with Applications, 40(2), 665–676.; Zhong et al., 2014Zhong, H., Miao, C., Shen, Z., & Feng, Y. (2014). Comparing the learning effectiveness of BP, ELM, I-ELM, and SVM for corporate credit ratings.Neurocomputing, 128(27), 285–295.; Zhu et al., 2013)Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267.; and (e) ensemble methods (Abellán & Mantas, 2014Abellán, J., & Mantas, C. J. (2014). Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications,41(8), 3825–3830.; García et al., 2012García, V., Marqués, A. I., & Sánchez, J. S. (2012). On the use of data filtering techniques for credit risk prediction with instance-based models. Expert Systems with Applications,39(18), 13267–13276.; Marqués et al., 2012Marqués, A. I., García, V., & Sánchez, J. S. (2012). Two-level classifier ensembles for credit risk assessment. Expert Systems with Applications, 39(12), 10916–10922.; Nie et al., 2011Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.; Wang et al., 2012)Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68..

The studies' real application contexts were obtained from organizations in Canada (Xiong et al., 2013Xiong, T., Wang, S., Mayers, A., & Monga, E. (2013). Personal bankruptcy prediction by mining credit card data. Expert Systems with Applications, 40(2), 665–676.), Germany (Han et al., 2013Han, L., Han, L., & Zhao, H. (2013). Orthogonal support vector machine for credit scoring. Engineering Applications of Artificial Intelligence, 26(2), 848–862.; Koh et al., 2006)Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118., Croatia (Oreski & Oreski, 2014)Oreski, S., & Oreski, G. (2014). Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications, 41(4), 2052–2064., Peru (Cubiles-De-La-Vega et al., 2013)Cubiles-De-La-Vega, M.-D., Blanco-Oliver, A., Pino-Mejías, R., & Lara-Rubio, J. (2013). Improving the management of microfinance institutions by using credit scoring models based on Statistical Learning techniques.Expert Systems with Applications, 40(17), 6910–6917., China (Nie et al., 2011)Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285., Turkey (Akkoç, 2012)Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178., and Barbados (Zhu et al., 2013)Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267.. Credit analysis using data mining is still rare in Brazil: Lemos et al. (2005)Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234. find the same methodology being employed for bank credit analysis, using a branch office of Banco do Brasil as their locus.

Decision trees are one of the most prominent and popular data mining methods (Wang et al., 2012Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.). According to Lemos et al. (2005Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234., p. 229), a decision tree is the only method that provides results in a hierarchical manner; i.e., “[...] the most relevant attribute is placed in the first node of the tree, and less relevant attributes are placed in subsequent nodes”. A decision tree is therefore a structure that is used to split a large amount of data into successive, smaller sets by applying a sequence of decision rules (Berry & Linoff, 2004Berry, M. J. A., & Linoff, G. (2004). Data mining techniques: For marketing, sales and customer relationship management (2nd ed.). Indianapolis: wiley Publishing.).

The construction of decision trees is especially attractive in the KDD context, which according to Gehrke (2003)Gehrke, J. (2003). Decision tree. In The handbook of data mining (pp. 3–23). New Jersey: Lawrence Erlbaum Associates. is attributable to the following reasons: intuitive and easy-to-understand results; non-parametric properties that are therefore applicable to exploratory treatments; relatively fast construction when compared to other methods; and accuracy that can be compared to the accuracy of other models.

Decision trees are commonly converted into decision rules. A decision tree may be observed as:

[...] a graph in which each non-leaf node represents a predicate (condition) involving an attribute and a set of values. The leaf nodes correspond to the attribution of a value or set of values to a problem attribute (Goldschmidt & Passos, 2005Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier., p. 57).

In light of this observation, paths in the tree correspond to rules of the type, “IF <conditions> THEN <conclusion>”. Several algorithms have been developed based on the induction of decision trees, among which the following stand out: C4.5, CART (classification and regression trees), QUEST (quick, unbiased, efficient statistical trees) and CHAID (chi-square automatic interaction detectors).

An artificial neural network (ANN) is a mathematical model based on the brain structure, ordered into layers and connections. The origin of ANNs dates back to 1943, but it was in the 1980s that greater interest in the method appeared, its development fostered mainly by advances in information technology (Braga, Carvalho, & Ludermir, 2000Braga, A. de P., Carvalho, A. P. de L. F., & Ludermir, T. B. (2000). Redes neurais artificiais: Teoria e aplicações. Rio de Janeiro: LTC.).

In Goldschmidt and Passos's (2005Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier., p. 175) view, ANNs can be observed as “[...] mathematical models inspired by the working principles of biological neurons and the brain's structure”. Such models, according to those authors, allow the simulation of human abilities such as learning, generalizing, associating, and abstracting.

Braga et al. (2000Braga, A. de P., Carvalho, A. P. de L. F., & Ludermir, T. B. (2000). Redes neurais artificiais: Teoria e aplicações. Rio de Janeiro: LTC., p. 1) define ANNs as “distributed parallel systems composed of simple processing units (nodes) that compute some given (usually non-linear) mathematical function, [...] arranged into one or more layer and interconnected by a large number of connections [...]”

An ANN's structure therefore consists of neuron layers and weighted connections. As shown in Figure 1, neurons are represented by nodes and weighted connections are represented by arrows.

Figure 1
– ANN architecture

Typically, there are three stages of ANN processing: the input layer, in which the data are received; the internal layer, usually called the hidden layer, which is responsible for processing the data and may consist of more than one actual layer; and the output layer,which provides the result (Larose, 2005Larose, T. D. (2005). Discovering knowledge in data: An introduction to data mining. New Jersey: John Wiley & Sons.).

The first step when applying ANNs is the network's learning phase, in which parameters are adjusted. This learning may be of two types: supervised or unsupervised. The first type occurs when output (or target) variables' values are provided, the second type occurs in the absence of those values.

Braga et al. (2000Braga, A. de P., Carvalho, A. P. de L. F., & Ludermir, T. B. (2000). Redes neurais artificiais: Teoria e aplicações. Rio de Janeiro: LTC., p. 227) mention as positive points, which arouse interest in the method, the ability to learn and later to generalize the possibility of mapping multivariable functions, self-organization, the process of time-series, the possibility of using a large number of input variables, and the possibility of using samples. Because the model is considered non-parametric, these authors further stress, “[...] there is no great need to understand the process itself”. However, these authors also consider this last aspect to constitute the primary criticism of the model; i.e., the model's inability to clarify how its results are generated. Due to this peculiarity, ANNs are also called “black boxes”.

3. METHODOLOGY

In this research, we opted to use case studies, which according to Yin (2010)Yin, Robert, K. (2010). Estudo de caso: planejamento e métodos (4th ed.). Porto Alegre: Bookman., are adequate to study contemporary events in a real-life context when controlling variables becomes more difficult for the researcher. This case study is unique and contemplates one unit of analysis, which involves one Credit Union of the SICOOB system—structured to contemplate investigation into the union from the perspective of theoretical cooperative movement.

The cooperative's database was used to evaluate the credit analysis system's performance. This database comprises the historical data of natural persons' analyses from 2003 to 2007. Due to a change in the information system, it was not possible to gather data from before this period.

Data referring to credit analysis is highly confidential and strategic, due to banking secrecy and the risk of competitors acquiring the data. This makes such data very difficult for third parties to obtain. The choice to study a cooperative was therefore due to its willingness to provide the data.

The cooperative currently uses the SisBr application of the SICOOB system as a credit analysis tool. This application contains the information used by management and the board of directors to make decisions about whether to grant credit.

The study followed the steps suggested by Fayyad, Piatetsky-Shapiro, and Smyth (1996)Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. A I Magazine,17(3), 37–54. for the knowledge discovery process: data selection, data pre-processing and cleanup, data transformation, data mining, data interpretation, and the evaluation of results.

To reach its goals, this study was based on the activities involved in conducting KDD, as discussed by Goldschmidt and Passos (2005)Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier.. Among the available data mining techniques, neural networks and decision trees were used, both of which are common in empirical studies.

Data collection and selection corresponded to the process of capturing, organizing, and selecting the data made available for the modeling and data-mining phases, and thus required accurate examination. Dasu and Johnson (2003)Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. New Jersey: John Wiley & Sons. note the following factors that are helpful in analyzing variables: previous experience, knowledge, quantity of results, and quality of results.

3. RESULTS

This section describes the simulation results of the techniques investigated in this article—namely, the decision tree and ANN techniques—along with the statistical tests comparing them.

No missing values are found in the database. Chart 01 shows the structure of the constructed database, together with its variables and their possible values.

Thumbnail

Chart 01
- Database structure

The output vector is given by variables 27 to 39, corresponding to the period from July 2007 through June 2008. Variables 02 and 10 are not part of the credit analysis conducted by the cooperative and were added to broaden the analysis. These variables were gathered from cooperative members' records and represent the data made available by the cooperative.

Variables 11 through 26 are currently used for credit analysis and represent the cooperative member's borrowing history. Codes 27 through 39 represent the output variable adopted in the study and depict the period from July 2007 through June 2008. These are the data that the cooperative provided. Older data were not available.

The number of variables used in the analysis was consistent with other studies. For example, Koh et al. (2006)Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118. used 20 variables, and Lemos et al. (2005)Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234. used 24.

The variable “cooperative member code” was discarded because it was useful only to identify individuals while collecting data. The variable “attributed risk” was used only in the data pre-processing and cleanup stages; it was not used in the transformation and modeling stages because it refers to the output of the model used by the cooperative, and therefore it represents the result of the model currently being used. The variable “aggregate results” represents the result of the period of analysis (July 2007 through June 2008) and according to the cooperative's business rules and the study's goals, is the model's target output variable.

Data transformation aims to help carry out the data mining techniques. As recommended by Goldschmidt and Passos (2005)Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier., data were grouped into a single two-dimensional table.

The data were collected from two sources in the cooperative: first, from credit assessment; and second, from records gathered manually, record by record. In this survey, historical data from 211 individual members were used, out of which 22 were in default and 189 were in good standing. This data represents all of the cooperative’s member-borrowers. Given the difference in numbers between default and good standing, there may be bias and overfitting problems (Chawla, 2005Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook (pp. 853–867). New Jersey: Springer.; Horta, Borges, Carvalho, & Alves, 2011Horta, R. A. M., Borges, C. C. H., Carvalho, F. A. A., & Alves, F. J. S. (2011). Previsão de insolvência: Uma estratégia para balanceamento da base de dados utilizando variáveis contábeis de empresas brasileiras.Sociedade, Contabilidade E Gestão, 6(2), 21–36.). To solve this problem, a technique called SMOTE (synthetic minority oversampling technique) (Chawla, 2005)Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook (pp. 853–867). New Jersey: Springer. was used to insert observations of cooperative members in default. This algorithm is one of the most used in the literature (Horta et al., 2011)Horta, R. A. M., Borges, C. C. H., Carvalho, F. A. A., & Alves, F. J. S. (2011). Previsão de insolvência: Uma estratégia para balanceamento da base de dados utilizando variáveis contábeis de empresas brasileiras.Sociedade, Contabilidade E Gestão, 6(2), 21–36.. Thus, 110 observations of the minority class—i.e., members in default—were created, and the sample totaled 321 observations, out of which 132 were in default and 189 were in good standing. Next, the database was randomized to avoid a concentration of the same values into a given data set while cross-validating, which would have led to overfitting.

For the computational implementation of the decision tree and neural network techniques, the database described in this study was used, taking into consideration, for each cooperative member, the previously described variables. For example, a model generated by the decision tree technique was selected for transformation into decision rules during the post-processing phase. The public domain computational tool WEKA (Waikato environment for knowledge analysis) was chosen to perform this task.

Goldschmidt and Passos (2005Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier., p. 50) argue that for a more reliable evaluation of the knowledge model, “[...] the data used in constructing the model should not be the same as used in this model's evaluation”. Those authors further state that there should be at least two divisions: training and testing. The first division comprises the data used in constructing the model; the second division comprises the data for evaluation.

Splitting the data set served to simplify, summarize and reduce the database's size and variability, resulting in the selection of more sophisticated and accurate models (Dasu & Johnson, 2003Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. New Jersey: John Wiley & Sons.).

In this study, to increase assessment neutrality, K-fold cross-validation was used for both the decision tree and the ANN. According to Goldschmidt and Passos (2005Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier., p. 51), in this method, the database is randomly split with N elements into K separate subsets: “each of the K subsets is used as a testing set, and the remaining (K-1) subsets are combined into a training set. The process is repeated K times, so that K models are generated and evaluated [...]”. The data were split into ten sets and repeated across ten simulations as proposed by Witten and Frank (2005)Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco: Elsevier.. Cross-validation has been found in several studies on credit analysis (Akkoç, 2012Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178.; Chang & Yeh, 2012Chang, S.-Y., & Yeh, T.-Y. (2012). An artificial immune classifier for credit scoring analysis. Applied Soft Computing,12(2), 611–618.; Han et al., 2013)Han, L., Han, L., & Zhao, H. (2013). Orthogonal support vector machine for credit scoring. Engineering Applications of Artificial Intelligence, 26(2), 848–862..

For this study, multi-layer ANNs, i.e., multilayer perceptron (MLP), were used with the back propagation learning algorithm. The number of neurons in the input layer was 66, plus two in the intermediate and two in the output layers. For all tests, a learning rate of 0.01 was used, given that this rate improved classification as observed in the simulations and was also used by Lemos et al., (2005)Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.. The momentum rate was not used as in Lemos et al. (2005)Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.; moreover, adding this rate did not improve classification performance.

Supervised learning was used in this ANN. Ferreira (2005Ferreira, J. B. (2005). Mineração de dados na retenção de clientes em telefonia celular. Dissertação de Mestrado. PUC-RIO., p. 37) describes this type of learning as follows: “[...] the network is trained by supplying it with input values and the respective output values [...].

For a comparative analysis of the models, the total percentage of correctly predicted values was used as the parameter in the two-tailed corrected resample t-test, at a significance level of 0.05 (or 5%), with nine degrees of freedom, as proposed byWitten and Frank (2005)Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco: Elsevier. in equation 1, presented below:

3.1 Decision Tree

In this study, the J4.8 tool was chosen for use. It is the WEKAimplementation of the C4.5 decision tree algorithm. According to Goldschmidt and Passos (2005)Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier., this tree is broadly used and accepted.

A model generated by the decision tree technique was taken to exemplify the rules and confusion matrix. The model below generated 41 leaves, i.e., sets of decision rules of the if-then type. Some rules of the first set are shown below:

If liquidity guarantee = high guarantee liquidity (sale in less than 6 months), then default;
If liquidity guarantee = moderate guarantee liquidity (sale in 6 to 12 months) and level of commitment = up to 20% of the average net income, then good standing;
If liquidity guarantee = moderate guarantee liquidity (sale in 6 to 12 months) and level of commitment = from 20% to 30% of the average net income, then good standing.

Chart 02 shows the confusion matrix generated by the testing set of the tree being evaluated. This matrix shows the instances classified as predicted and actual to assess the models' hit and miss types. The main diagonal contains the correctly classified values. The values are given as absolute numbers.

Thumbnail

Chart 02
- Confusion matrix of a model developed using the decision tree method

In this example, the model based on the C4.5 decision tree algorithm correctly classified 302 records, which corresponds to accuracy rate of 94.08%, and incorrectly classified 19 observations, or 5.92%.

3. 2. Neural Network

The ANN in this study, as previously discussed in the theoretical framework, contained three layers: input, intermediate, and output. The network used supervised learning because the model's output values were supplied.

As in the case of the decision tree, a model was used to exemplify the results generated by the WEKA package. Chart 03shows the confusion matrix for the model obtained.

Thumbnail

Chart 03
- Confusion matrix for a model developed using the ANN method

>

This ANN-based model, constructed using the MLP algorithm, classified 294 records correctly and 27 records incorrectly, corresponding to 91.59% and 8.41% accuracy, respectively.

3. 3. Model Evaluation

This section comparatively evaluates the two models developed in this study: ANN (MLP algorithm) and decision tree (C4.5 algorithm). To conduct this evaluation, the total percentage of correctly predicted values was used.

Table 01 shows the percentage result of correctly predicted values and the respective standard deviation of the simulations carried out using the studied models.

Thumbnail

Table 01
— Comparative evaluation

The simulations performed using WEKA's experimenter tool indicate that the decision tree technique implementing the C4.5 algorithm is statistically similar, according to a two-tailed test, to the ANN with the MLP algorithm at the 0.05 significance level. Figure 02shows WEKA's output for the simulations of the previously described problem.

Figure 02
: Output for the simulations run using WEKA's experimenter tool

The decision tree's performance for the current problem is better than the one obtained by Yap et al. (2011)Yap, B. W., Ong, S. H., & Husain, N. H. M. (2011). Using data mining to improve assessment of credit worthiness. Expert Systems with Applications, 38(10), 13274–13283., in which the error rate is 28.1%. The study by Lemos et al. (2005)Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234., despite not performing statistical tests, obtains a higher hit rate with the neural network than with the decision tree. Although statistically similar, decision trees are considered to be easy to use (Lemos et al., 2005)Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234..

Our results indicate that the classification models based on data mining developed herein may be useful to the cooperative in its assessments, thus improving performance, as previously found in the analysis of a microcredit organization (Cubiles-De-La-Vega et al., 2013Cubiles-De-La-Vega, M.-D., Blanco-Oliver, A., Pino-Mejías, R., & Lara-Rubio, J. (2013). Improving the management of microfinance institutions by using credit scoring models based on Statistical Learning techniques.Expert Systems with Applications, 40(17), 6910–6917.) and a credit union (Zhu et al., 2013Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267.).

4. CONCLUSIONS

The goal of this study is to develop and evaluate data mining models to classify and predict the behavior of cooperative members' behavior in honoring their obligations. The decision tree and ANN, both data mining techniques, were used to develop the model.

The process of data preparation and modeling followed the steps suggested in the literature: data selection, data pre-processing and cleanup, data transformation, data mining, interpretation, and validation of the results. The data were divided into training and testing sets.

Although the decision tree's accuracy in the simulations is 97.07%, compared to 95.58% with the ANN, the decision-tree-based C4.5 algorithm obtains a result that is statistically similar to that of the model that was based on the MLP artificial neural network.

The knowledge discovery process and the use of models based on data mining developed here may provide the cooperative with practical advantages. Understanding the variables and their relationships may help in better classifying and predicting cooperative members' behavior. In-depth assessment of the variables may further help in including variables that might be important and excluding others that turn out not to be relevant, with the advantage of providing more succinct and precise credit management models, reducing execution time and improving decision accuracy. The analysis of discrepant or outlier cases may be relevant to creating a new classification or, conversely, finding undesirable patterns.

This study is limited by the following issues: a lack of other cooperatives' databases for comparison, evaluation, and validation of the model; limitations of the information system used by the cooperative, which precluded collecting input variable data from before 2003 and provided data referring to the output value only for the past six months; and the lack of integration between some database modules and electronic spreadsheets.

The following proposals are left for future studies: using different databases to validate the credit analysis model; using other data mining techniques; using hybrid models, combining different techniques to improve classification and predictive performance; investment analysis, evaluating the type of error and the financial impact that the model has on the cooperative's profitability; and evaluation of discrepant cases, particularly those cases involving the variable "capital", to check for the existence of new patterns and classifications.

REFERENCES

Abellán, J., & Mantas, C. J. (2014). Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications,41(8), 3825–3830.
Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178.
Berry, M. J. A., & Linoff, G. (2004). Data mining techniques: For marketing, sales and customer relationship management (2nd ed.). Indianapolis: wiley Publishing.
Bhattacharyya, S., Jha, S., Tharakunnel, K., & Christopher, J. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613.
Braga, A. de P., Carvalho, A. P. de L. F., & Ludermir, T. B. (2000). Redes neurais artificiais: Teoria e aplicações Rio de Janeiro: LTC.
Chaia, A. J. (2003). Modelos de gestão de risco de crédito e sua aplicabilidade ao mercado brasileiro Dissertação de Mestrado. FEA/USP.
Chang, S.-Y., & Yeh, T.-Y. (2012). An artificial immune classifier for credit scoring analysis. Applied Soft Computing,12(2), 611–618.
Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook (pp. 853–867). New Jersey: Springer.
Chen, S. C., & Huang, M. Y. (2011). Constructing credit auditing and control & management model with data mining technique. Expert Systems with Applications,38(5359-5365).
Crone, S. F., & Finlay, S. (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting, 28(1), 224–238.
Cubiles-De-La-Vega, M.-D., Blanco-Oliver, A., Pino-Mejías, R., & Lara-Rubio, J. (2013). Improving the management of microfinance institutions by using credit scoring models based on Statistical Learning techniques.Expert Systems with Applications, 40(17), 6910–6917.
Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning New Jersey: John Wiley & Sons.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. A I Magazine,17(3), 37–54.
Ferreira, J. B. (2005). Mineração de dados na retenção de clientes em telefonia celular Dissertação de Mestrado. PUC-RIO.
García, V., Marqués, A. I., & Sánchez, J. S. (2012). On the use of data filtering techniques for credit risk prediction with instance-based models. Expert Systems with Applications,39(18), 13267–13276.
Gehrke, J. (2003). Decision tree. In The handbook of data mining (pp. 3–23). New Jersey: Lawrence Erlbaum Associates.
Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático Rio de Janeiro: Elsevier.
Han, L., Han, L., & Zhao, H. (2013). Orthogonal support vector machine for credit scoring. Engineering Applications of Artificial Intelligence, 26(2), 848–862.
Horta, R. A. M., Borges, C. C. H., Carvalho, F. A. A., & Alves, F. J. S. (2011). Previsão de insolvência: Uma estratégia para balanceamento da base de dados utilizando variáveis contábeis de empresas brasileiras.Sociedade, Contabilidade E Gestão, 6(2), 21–36.
Ju, Y. H., & Sohn, S. Y. (2014). Updating a credit-scoring model based on new attributes without realization of actual data. European Journal of Operational Research, 234(1), 119–126.
Khatchatourian, O., & Treter, J. (2010). APLICAÇÃO DA LÓGICA FUZZY PARA AVALIAÇÃO ECONÔMICO-FINANCEIRA DE COOPERATIVAS DE PRODUÇÃO.Revista de Gestão Da Tecnologia E Sistemas de Informação,7(1), 141–162.
Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118.
Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability estimates using machine learning.Expert Systems with Applications, 40(13), 5125–5131.
Lai, K. K., Yu, L., Wang, S., & Zhou, L. (2006). Credit risk analysis using a reliability-based neural network ensemble model. InArtificial Neural Networks-ICANN 2006 (pp. 682–690). Springer Berlin Heidelberg.
Larose, T. D. (2005). Discovering knowledge in data: An introduction to data mining New Jersey: John Wiley & Sons.
Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.
Majeske, K. D., & Lauer, T. W. (2013). The bank loan approval decision from multiple perspectives. Expert Systems with Applications, 40(5), 1591–1598.
Marqués, A. I., García, V., & Sánchez, J. S. (2012). Two-level classifier ensembles for credit risk assessment. Expert Systems with Applications, 39(12), 10916–10922.
Mester, L. J. (1997). What’s the point of credit scoring?Business Review, 3, 3–16.
Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.
OCB. (2014). Organização das Cooperativas Brasileiras.Números Retrieved February 20, 2014, from http://www.ocb.org.br/site/ramos/credito_numeros.asp
» http://www.ocb.org.br/site/ramos/credito_numeros.asp
Oliveira, D. P. R. (2001). Manual de gestão de cooperativas: Uma abordagem prática São Paulo: Atlas.
Oreski, S., & Oreski, G. (2014). Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications, 41(4), 2052–2064.
Pidd, M. (1998). Modelagem empresarial: Ferramentas para tomada de decisão São Paulo: Atlas.
Pinho, D. B. (1982). O pensamento cooperativo e o cooperativismo brasileiro CNPq/BNCC.
Pinho, D. B. (2004). O cooperativismo no Brasil: Da vertente pioneira à vertente solidaria São Paulo: Saraiva.
Portal do Cooperativismo de Crédito. (2014). Dados consolidados dos sistemas cooperativos Retrieved February 20, 2014, from http://cooperativismodecredito.coop.br/cenario-brasileiro/dados-consolidados=_dos-sistemas-cooperativos/
» http://cooperativismodecredito.coop.br/cenario-brasileiro/dados-consolidados=_dos-sistemas-cooperativos/
Saberi, M., Mirtalaie, M. S., Hussain, F. K., Azadeh, A., Hussain, O. K., & Ashjari, B. (2013). A granular computing-based approach to credit scoring modeling. Neurocomputing, 122(25), 100–115.
Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.
Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco: Elsevier.
Xiong, T., Wang, S., Mayers, A., & Monga, E. (2013). Personal bankruptcy prediction by mining credit card data. Expert Systems with Applications, 40(2), 665–676.
Yap, B. W., Ong, S. H., & Husain, N. H. M. (2011). Using data mining to improve assessment of credit worthiness. Expert Systems with Applications, 38(10), 13274–13283.
Yin, Robert, K. (2010). Estudo de caso: planejamento e métodos (4th ed.). Porto Alegre: Bookman.
Zhong, H., Miao, C., Shen, Z., & Feng, Y. (2014). Comparing the learning effectiveness of BP, ELM, I-ELM, and SVM for corporate credit ratings.Neurocomputing, 128(27), 285–295.
Zhou, X., Jiang, W., Shi, Y., & Tian, Y. (2011). Credit risk evaluation with kernel-based affine subspace nearest points learning method.Expert Systems with Applications, 38(4), 4272–4279.
Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267.

11
All direct quotations were translated by the authors.
Published by/ Publicado por: TECSI FEA USP – 2014 All rights reserved.

Análise de crédito por meio de mineração de dados: aplicação em cooperativa de crédito

Authorship SCIMAGO INSTITUTIONS RANKINGS

1. INTRODUÇÃO

O presente artigo trata do desenvolvimento de modelos para analisar a capacidade dos associados de uma cooperativa de crédito de saldar os seus compromissos. Para tal, foram utilizadas técnicas de Mineração de Dados (Data Mining).

Para a construção do modelo foi utilizada a base de dados real de cooperados tomadores de crédito de uma cooperativa de crédito do sistema SICOOB (Sistema Cooperativo Brasileiro). Ressalta-se que são dados de difícil acesso e coleta.

O cooperativismo de crédito é uma sociedade de pessoas e deve ser norteado por uma finalidade social. Entretanto, é também, uma instituição financeira e é regulamentada pelas normas impostas pelo Conselho Monetário Nacional e pelo Banco Central e, ademais, deve também ter o objetivo de permanência no mercado, o que impõe uma gestão eficiente dos recursos.

O cooperativismo pode ser classificado em duas vertentes: a doutrináriarochdaleana, que pretendia transformar a sociedade e reformar o homem; e a teórica, desenvolvida na Universidade de Münster (Alemanha), utilizando o instrumental da ciência da administração de empresa, vislumbrando a cooperativa como uma empresa moderna (Pinho, 2004Pinho, D. B. (2004). O cooperativismo no Brasil: Da vertente pioneira à vertente solidaria. São Paulo: Saraiva.).

Na perspectiva teórica, a Teoria de Münster é a que mais se desenvolveu, também conhecida por “Teoria Econômica da Cooperação Cooperativa”, com origem no Instituto de Cooperativismo da Universidade de Münster, na Alemanha. Professores desta universidade, conjuntamente com pesquisadores latino-americanos, em oposição aos pressupostos doutrinários rochdaleanos, desenvolveram esta “Escola”, cuja fundamentação metodológica advém do racionalismo crítico (Pinho, 2004Pinho, D. B. (2004). O cooperativismo no Brasil: Da vertente pioneira à vertente solidaria. São Paulo: Saraiva.).

Pinho (1982Pinho, D. B. (1982). O pensamento cooperativo e o cooperativismo brasileiro. CNPq/BNCC., p. 75) expõe, segundo as ideias de Boettcher, que o seguinte conceito de cooperativa baseado nos axiomas e pressupostos da Teoria de Münster: “as cooperativas são agrupamentos de indivíduos que defendem seus interesses econômicos individuais por meio de uma empresa que eles mantêm conjuntamente”. Neste contexto, Frantz (1985:56) acrescenta que a cooperativa também pode ser compreendida como a definição de uma “[...] estratégia de competição com o objetivo de maximizar os resultados da ação econômica individual de cada produtor [...]”.

Esta pesquisa vislumbra as cooperativas de crédito na ótica do cooperativismo teórico e, partindo dos pressupostos e axiomas desenvolvidos, a análise da informação para tomada de decisões é condição central. Ferramentas e metodologias que visam à análise de informações gerenciais têm evoluído muito nas últimas décadas.

A gestão de uma cooperativa de crédito é complexa, pois necessita manter o equilíbrio entre os anseios e necessidades dos cooperados e competir no mercado. As características de associação para os cooperados e de empresa para o mercado devem estar em certo equilíbrio.

O número de cooperados e de cooperativas vêm aumentando paulatinamente. Segundo dados da OCB (2014)OCB. (2014). Organização das Cooperativas Brasileiras.Números. Retrieved February 20, 2014, from http://www.ocb.org.br/site/ramos/credito_numeros.asp
http://www.ocb.org.br/site/ramos/credito... , existem hoje no Brasil, 1.047 cooperativas de crédito singular e 4.529 pontos de atendimento. O SICOOB é o maior sistema de crédito cooperativo do Brasil, congrega 529 cooperativas singulares e 1.949 pontos de atendimento cooperativo (Portal do Cooperativismo de Crédito, 2014Portal do Cooperativismo de Crédito. (2014). Dados consolidados dos sistemas cooperativos. Retrieved February 20, 2014, from http://cooperativismodecredito.coop.br/cenario-brasileiro/dados-consolidados=_dos-sistemas-cooperativos/
http://cooperativismodecredito.coop.br/c... ).

O ambiente dinâmico e competitivo do mercado financeiro brasileiro concomitante com mudanças na oferta de crédito nos últimos anos exige a adoção de uma postura profissional, o que conduz as cooperativas de crédito a adotarem o uso de novas tecnologias e conhecimentos gerenciais.

Oliveira (2001)Oliveira, D. P. R. (2001). Manual de gestão de cooperativas: Uma abordagem prática. São Paulo: Atlas. aponta a profissionalização de cooperados e de cooperativas como uma tendência relevante. O setor tem se desenvolvido de forma rápida, adotando uma estratégia de integração por meio de cooperativas centrais e necessitam, neste sentido, estar altamente em sintonia com o que há de mais eficiente em ferramentas de gestão.

Analisar o crédito constitui certamente um dos pontos mais importantes em instituições financeiras. Chaia (2003)Chaia, A. J. (2003). Modelos de gestão de risco de crédito e sua aplicabilidade ao mercado brasileiro. Dissertação de Mestrado. FEA/USP.destaca a importância da definição do tipo de análise a ser feito e da abrangência da mesma e ainda alerta para o perigo de copiar e utilizar modelos de outras instituições, resultando assim em avaliações inadequadas.

Um dos principais métodos de avaliação de crédito utilizado pelas instituições financeiras é o credit scoring. Chaia (2003Chaia, A. J. (2003). Modelos de gestão de risco de crédito e sua aplicabilidade ao mercado brasileiro. Dissertação de Mestrado. FEA/USP., p. 23) define este modelo como o uso de ferramental estatístico na identificação dos fatores determinantes da probabilidade de o cliente tornar-se inadimplente, e aponta como principal vantagem o fato de que “[...] decisões sobre a concessão são tomadas com base em procedimentos impessoais e padronizados, gerando um maior grau de confiabilidade”.

Com a concepção de cooperativa discutida anteriormente, torna-se altamente relevante o uso de tais metodologias objetivas na concessão de crédito, tal como o credit scoring. Evita-se que a decisão seja tomada somente pela avaliação em julgamentos subjetivos. Koh, Tan, e Goh (2006)Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118. condiciona o progresso do credit scoring ao aumento da competitividade, avanços na tecnologia computacional e no aumento exponencial de grandes bancos de dados.

Mester (1997)Mester, L. J. (1997). What’s the point of credit scoring?Business Review, 3, 3–16. indica que a exatidão do modelo, a atualização dos dados e a avaliação e readequação dos modelos são alguns fatores críticos do credit scoring. Falhas nesses fatores limitam o uso de tal modelo. Tendo em vista que a concessão de crédito é um dos processos centrais das cooperativas de crédito, a análise de tal processo caracteriza-se como ponto fundamental para proteger o patrimônio coletivo da cooperativa.

A obtenção de ferramentas que classifiquem e ajudem a prever comportamentos de futuras concessões é fundamental para a gestão de crédito, com a vantagem de diminuir a subjetividade no processo, permitir a condução mais eficiente dos recursos e proporcionar maior celeridade nas propostas.

Estudos de verificação de análise de crédito por meio de mineração de dados aumentam a precisão dos modelos e foram realizados por vários autores nos últimos anos (Abellán & Mantas, 2014Abellán, J., & Mantas, C. J. (2014). Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications,41(8), 3825–3830.;Akkoç, 2012Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178.; Bhattacharyya, Jha, Tharakunnel, & Christopher, 2011Bhattacharyya, S., Jha, S., Tharakunnel, K., & Christopher, J. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613.;Chang & Yeh, 2012Chang, S.-Y., & Yeh, T.-Y. (2012). An artificial immune classifier for credit scoring analysis. Applied Soft Computing,12(2), 611–618.; Chen & Huang, 2011Chen, S. C., & Huang, M. Y. (2011). Constructing credit auditing and control & management model with data mining technique. Expert Systems with Applications,38(5359-5365).; Crone & Finlay, 2012Crone, S. F., & Finlay, S. (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting, 28(1), 224–238.; Cubiles-De-La-Vega, Blanco-Oliver, Pino-Mejías, & Lara-Rubio, 2013Cubiles-De-La-Vega, M.-D., Blanco-Oliver, A., Pino-Mejías, R., & Lara-Rubio, J. (2013). Improving the management of microfinance institutions by using credit scoring models based on Statistical Learning techniques.Expert Systems with Applications, 40(17), 6910–6917.; García, Marqués, & Sánchez, 2012García, V., Marqués, A. I., & Sánchez, J. S. (2012). On the use of data filtering techniques for credit risk prediction with instance-based models. Expert Systems with Applications,39(18), 13267–13276.; Han, Han, & Zhao, 2013Han, L., Han, L., & Zhao, H. (2013). Orthogonal support vector machine for credit scoring. Engineering Applications of Artificial Intelligence, 26(2), 848–862.; Koh et al., 2006Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118.; Kruppa, Schwarz, Arminger, & Ziegler, 2013Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability estimates using machine learning.Expert Systems with Applications, 40(13), 5125–5131.; Lai, Yu, Wang, & Zhou, 2006Lai, K. K., Yu, L., Wang, S., & Zhou, L. (2006). Credit risk analysis using a reliability-based neural network ensemble model. InArtificial Neural Networks-ICANN 2006 (pp. 682–690). Springer Berlin Heidelberg.; Lemos, Steiner, & Nievola, 2005Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.; Majeske & Lauer, 2013Majeske, K. D., & Lauer, T. W. (2013). The bank loan approval decision from multiple perspectives. Expert Systems with Applications, 40(5), 1591–1598.; Marqués, García, & Sánchez, 2012Marqués, A. I., García, V., & Sánchez, J. S. (2012). Two-level classifier ensembles for credit risk assessment. Expert Systems with Applications, 39(12), 10916–10922.; Nie, Rowe, Zhang, Tian, & Shi, 2011Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.; Oreski & Oreski, 2014Oreski, S., & Oreski, G. (2014). Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications, 41(4), 2052–2064.; Saberi et al., 2013Saberi, M., Mirtalaie, M. S., Hussain, F. K., Azadeh, A., Hussain, O. K., & Ashjari, B. (2013). A granular computing-based approach to credit scoring modeling. Neurocomputing, 122(25), 100–115.; Wang, Ma, Huang, & Xu, 2012Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.; Xiong, Wang, Mayers, & Monga, 2013Xiong, T., Wang, S., Mayers, A., & Monga, E. (2013). Personal bankruptcy prediction by mining credit card data. Expert Systems with Applications, 40(2), 665–676.; Yap, Ong, & Husain, 2011Yap, B. W., Ong, S. H., & Husain, N. H. M. (2011). Using data mining to improve assessment of credit worthiness. Expert Systems with Applications, 38(10), 13274–13283.; Zhong, Miao, Shen, & Feng, 2014Zhong, H., Miao, C., Shen, Z., & Feng, Y. (2014). Comparing the learning effectiveness of BP, ELM, I-ELM, and SVM for corporate credit ratings.Neurocomputing, 128(27), 285–295.; Zhou, Jiang, Shi, & Tian, 2011Zhou, X., Jiang, W., Shi, Y., & Tian, Y. (2011). Credit risk evaluation with kernel-based affine subspace nearest points learning method.Expert Systems with Applications, 38(4), 4272–4279.; Zhu, Li, Wu, Wang, & Liang, 2013Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267.).

Apesar do crescente interesse, a aplicação dessas ferramentas em cooperativas ainda é pouco realizado. Khatchatourian e Treter (2010)Khatchatourian, O., & Treter, J. (2010). APLICAÇÃO DA LÓGICA FUZZY PARA AVALIAÇÃO ECONÔMICO-FINANCEIRA DE COOPERATIVAS DE PRODUÇÃO.Revista de Gestão Da Tecnologia E Sistemas de Informação,7(1), 141–162. aplicaram lógica Fuzzy na análise do desempenho financeiro em cooperativas de produção do Rio Grande do Sul. Zhu, Li, Wu, Wang, e Liang (2013)Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267.utilizaram Support vector machine - máquina de vetores de suporte – na análise de crédito em cooperativa de crédito de Barbados.

Há atualmente diversas técnicas de Mineração de Dados disponíveis. Assim, pretendeu-se examinar qual metodologia de mineração oferece melhores resultados na análise de crédito para Cooperativas de Crédito. Neste sentido, indaga-se, se um modelo de Mineração de Dados pode ter bom desempenho na classificação e previsão na gestão de crédito em cooperativas de crédito.

2. REFERENCIAL TEÓRICO

A terminologia “Descoberta de Conhecimento em Base de Dados” (Knowledge Discorery in Databases – KDD) foi utilizada pela primeira vez em 1989 para destacar que o conhecimento é o produto final do processo de descoberta em base de dados (Fayyad, Piatetsky-Shapiro, & Smyth, 1996Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. A I Magazine,17(3), 37–54.).

Os termos KDD e Mineração de Dados foram entendidos por muitos pesquisadores como sinônimos até 1995 (Lemos et al., 2005Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.). Fayyad et al. (1996)Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. A I Magazine,17(3), 37–54. conceitua e distingue KDD e Mineração de Dados da seguinte forma: o primeiro refere-se ao processo geral de descobrir conhecimento útil dos dados e o segundo à aplicação específica de algoritmos para a extração de padrões e modelos dos dados. No conceito destes autores, Mineração de Dados seria, então, um passo no processo de KDD, consistindo de empregar análise de dados e algoritmos na produção de um conjunto particular de padrões e modelos.

Fayyad et al. (1996)Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. A I Magazine,17(3), 37–54. denomina padrões como componentes dos modelos. Neste estudo foi utilizado o conceito de modelo definido por Pidd (1998Pidd, M. (1998). Modelagem empresarial: Ferramentas para tomada de decisão. São Paulo: Atlas., p. 23): “Modelo é uma representação externa e explícita de parte da realidade vista pela pessoa que deseja usar aquele modelo para entender, mudar, gerenciar e controlar parte daquela realidade”.

Goldschmidt e Passos (2003, p. 6) dividem as atividades de KDDem três grupos: (i) desenvolvimento tecnológico – este grupo compreende “[...] as iniciativas de concepção, aprimoramento e desenvolvimento de algoritmos, ferramentas e tecnologias de apoio [...]” no processo de KDD; (ii) execução de KDD – este grupo inclui atividades relacionadas à utilização dos algoritmos, ferramentas e tecnologias desenvolvidas na procura de conhecimento; (iii) aplicação de resultados – com os modelos desenvolvidos na execução de KDD, “[...] as atividades se voltam à aplicação dos resultados no contexto em que foi realizado o processo de KDD”.

É comum a comparação de técnicas e modelos híbridos ou compostos. Dentre os modelos de análise de crédito e risco destacam-se o uso de: (a) regressão logística (Akkoç, 2012Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178.; Bhattacharyya et al., 2011Bhattacharyya, S., Jha, S., Tharakunnel, K., & Christopher, J. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613.; Cubiles-De-La-Vega et al., 2013Cubiles-De-La-Vega, M.-D., Blanco-Oliver, A., Pino-Mejías, R., & Lara-Rubio, J. (2013). Improving the management of microfinance institutions by using credit scoring models based on Statistical Learning techniques.Expert Systems with Applications, 40(17), 6910–6917.; Han et al., 2013Han, L., Han, L., & Zhao, H. (2013). Orthogonal support vector machine for credit scoring. Engineering Applications of Artificial Intelligence, 26(2), 848–862.; Ju & Sohn, 2014Ju, Y. H., & Sohn, S. Y. (2014). Updating a credit-scoring model based on new attributes without realization of actual data. European Journal of Operational Research, 234(1), 119–126.; Koh et al., 2006Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118.; Kruppa et al., 2013Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability estimates using machine learning.Expert Systems with Applications, 40(13), 5125–5131.; Nie et al., 2011Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.; Wang et al., 2012Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.; Yap et al., 2011)Yap, B. W., Ong, S. H., & Husain, N. H. M. (2011). Using data mining to improve assessment of credit worthiness. Expert Systems with Applications, 38(10), 13274–13283.; (b) árvores de decisão (Abellán & Mantas, 2014Abellán, J., & Mantas, C. J. (2014). Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications,41(8), 3825–3830.; Bhattacharyya et al., 2011Bhattacharyya, S., Jha, S., Tharakunnel, K., & Christopher, J. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613.; Chen & Huang, 2011Chen, S. C., & Huang, M. Y. (2011). Constructing credit auditing and control & management model with data mining technique. Expert Systems with Applications,38(5359-5365).; Crone & Finlay, 2012Crone, S. F., & Finlay, S. (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting, 28(1), 224–238.; Koh et al., 2006Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118.; Kruppa et al., 2013Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability estimates using machine learning.Expert Systems with Applications, 40(13), 5125–5131.; Lemos et al., 2005Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.; Nie et al., 2011Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.; Wang et al., 2012Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.; Yap et al., 2011)Yap, B. W., Ong, S. H., & Husain, N. H. M. (2011). Using data mining to improve assessment of credit worthiness. Expert Systems with Applications, 38(10), 13274–13283.; (c) redes neurais (Akkoç, 2012Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178.; Chen & Huang, 2011Chen, S. C., & Huang, M. Y. (2011). Constructing credit auditing and control & management model with data mining technique. Expert Systems with Applications,38(5359-5365).; Koh et al., 2006Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118.; Lai et al., 2006; Nie et al., 2011Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.; Oreski & Oreski, 2014Oreski, S., & Oreski, G. (2014). Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications, 41(4), 2052–2064.; Saberi et al., 2013Saberi, M., Mirtalaie, M. S., Hussain, F. K., Azadeh, A., Hussain, O. K., & Ashjari, B. (2013). A granular computing-based approach to credit scoring modeling. Neurocomputing, 122(25), 100–115.;Wang et al., 2012)Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.; (d)Support vector machine - máquina de vetores de suporte (Bhattacharyya et al., 2011Bhattacharyya, S., Jha, S., Tharakunnel, K., & Christopher, J. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613.; Nie et al., 2011Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.; Xiong et al., 2013Xiong, T., Wang, S., Mayers, A., & Monga, E. (2013). Personal bankruptcy prediction by mining credit card data. Expert Systems with Applications, 40(2), 665–676.; Zhong et al., 2014Zhong, H., Miao, C., Shen, Z., & Feng, Y. (2014). Comparing the learning effectiveness of BP, ELM, I-ELM, and SVM for corporate credit ratings.Neurocomputing, 128(27), 285–295.; Zhu et al., 2013)Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267.; (e) métodos ensemble (Abellán & Mantas, 2014Abellán, J., & Mantas, C. J. (2014). Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications,41(8), 3825–3830.; García et al., 2012García, V., Marqués, A. I., & Sánchez, J. S. (2012). On the use of data filtering techniques for credit risk prediction with instance-based models. Expert Systems with Applications,39(18), 13267–13276.; Marqués et al., 2012Marqués, A. I., García, V., & Sánchez, J. S. (2012). Two-level classifier ensembles for credit risk assessment. Expert Systems with Applications, 39(12), 10916–10922.; Nie et al., 2011Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.; Wang et al., 2012)Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68..

O contexto real de aplicação dos estudos foram coletados em organizações do Canadá (Xiong et al., 2013Xiong, T., Wang, S., Mayers, A., & Monga, E. (2013). Personal bankruptcy prediction by mining credit card data. Expert Systems with Applications, 40(2), 665–676.), Alemanha (Han et al., 2013Han, L., Han, L., & Zhao, H. (2013). Orthogonal support vector machine for credit scoring. Engineering Applications of Artificial Intelligence, 26(2), 848–862.; Koh et al., 2006Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118.), Croácia (Oreski & Oreski, 2014Oreski, S., & Oreski, G. (2014). Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications, 41(4), 2052–2064.), Peru (Cubiles-De-La-Vega et al., 2013)Cubiles-De-La-Vega, M.-D., Blanco-Oliver, A., Pino-Mejías, R., & Lara-Rubio, J. (2013). Improving the management of microfinance institutions by using credit scoring models based on Statistical Learning techniques.Expert Systems with Applications, 40(17), 6910–6917., China (Nie et al., 2011)Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285., Turquia (Akkoç, 2012)Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178. e Barbados (Zhu et al., 2013)Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267.. Análise de crédito por meio de mineração de dados ainda é escassa no Brasil, Lemos et al. (2005)Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234. verificaram a aplicação da análise de crédito bancário com a mesma metodologia e utilizaram comolócus uma agência do Banco do Brasil.

As árvores de decisão constituem um dos principais e mais populares métodos de Mineração de Dados (Wang et al., 2012Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.). Este método, conforme Lemos et al. (2005Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234., p. 229) é o único a exibir resultados em forma hierárquica, “[...] o atributo mais importante é apresentado na árvore como o primeiro nó, e os atributos menos relevantes são mostradas nos nós subsequentes”.

Assim, árvore de decisão é uma estrutura usada para dividir grande quantidade de dados em sucessivos conjuntos menores pela aplicação de uma sequência de regras de decisão (Berry & Linoff, 2004Berry, M. J. A., & Linoff, G. (2004). Data mining techniques: For marketing, sales and customer relationship management (2nd ed.). Indianapolis: wiley Publishing.).

A construção de árvores de decisão é especialmente atrativa no ambiente deKDD. As causas para tal propensão, abordadas por Gehrke (2003)Gehrke, J. (2003). Decision tree. In The handbook of data mining (pp. 3–23). New Jersey: Lawrence Erlbaum Associates., são: resultado intuitivo e de fácil entendimento; árvores de decisão são não-paramétricas, aplicáveis, portanto, a tratamentos exploratórios; construção relativamente rápida comparada a outros métodos; a acurácia da árvore de decisão pode ser comparada com outros modelos.

É comum a transformação de uma árvore de decisão em regras de decisão. Árvore de Decisão pode ser compreendida como:

[...] um grafo em que cada nó não folha representa um predicado (condição) envolvendo um atributo e um conjunto de valores. Os nós da folha correspondem à atribuição de um valor ou conjunto de valores a um atributo do problema (Goldschmidt & Passos, 2005Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier., p. 57).

Neste sentido, os caminhos da árvore correspondem a regras do tipo “SE <condições> ENTÃO <conclusão>”. Há muitos algoritmos desenvolvidos baseados na indução de árvores de decisão, dentre os quais se destacam o C4.5, oCART (Classification and Regression Trees^¹), QUEST (Quick, unbiased, efficient statistical tree^²)e CHAID (chi-square automatic interaction detector^³).

Rede Neural Artificial – RNA - é um modelo matemático baseado na estrutura cerebral, ordenado em camadas e ligações. As RNAs têm origem em 1943, entretanto, é na década de 1980 que é despertado maior interesse pelo método, tendo como principal fator de desenvolvimento o avanço da tecnologia da informação (Braga, Carvalho, & Ludermir, 2000Braga, A. de P., Carvalho, A. P. de L. F., & Ludermir, T. B. (2000). Redes neurais artificiais: Teoria e aplicações. Rio de Janeiro: LTC.).

Na perspectiva de Goldschmidt e Passos (2005Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier., p. 175), RNAs podem ser compreendidas como “[...] modelos matemáticos inspirados nos princípios de funcionamento dos neurônios biológicos e na estrutura do cérebro”. Tais modelos, conforme os mesmos autores permitem simular capacidades humanas de aprender, generalizar, associar e abstrair.

Braga et al. (2000Braga, A. de P., Carvalho, A. P. de L. F., & Ludermir, T. B. (2000). Redes neurais artificiais: Teoria e aplicações. Rio de Janeiro: LTC., p. 1) conceituam RNAs como “sistemas paralelos distribuídos compostos por unidades de processamento simples (nodos) que calculam determinadas funções matemáticas (normalmente não-lineares), [...] dispostas em uma ou mais camadas e interligadas por um grande número de conexões[...]”

A estrutura de uma RNA é, portanto, composta de camadas de neurônios e conexões, que são ponderadas por pesos. Conforme a Figura 01, os neurônios são representados pelos nodos e os pesos são representados pelas setas.

Figura 01
– Arquitetura de uma RNA.

Há tipicamente três partes no processamento de RNA: a camada de entrada ouinput layer, por onde são recebidos os dados; a camada interna, comumente chamada de “camada escondida” ou hidden layer, responsável pelo processamento dos dados, esta parte pode conter mais de uma camada e a camada de saída ou output layer, representando o resultado (Larose, 2005Larose, T. D. (2005). Discovering knowledge in data: An introduction to data mining. New Jersey: John Wiley & Sons.).

O primeiro passo da aplicação de um RNA é a etapa de aprendizagem da rede, onde há o ajuste dos parâmetros. Este aprendizado pode ser classificado em duas categorias: supervisionado e não-supervisionado, o primeiro ocorre quando é fornecido variáveis de saída, o segundo não necessita da variável alvo.

Braga et al. (2000Braga, A. de P., Carvalho, A. P. de L. F., & Ludermir, T. B. (2000). Redes neurais artificiais: Teoria e aplicações. Rio de Janeiro: LTC., p. 227) indicam como pontos positivos que, suscitam interesse pelo método, a habilidade de aprendizado e posterior generalização, com a possibilidade de mapear funções multivariadas, a auto-organização, o processo de séries temporais, a possibilidade do uso de grande número de variáveis de entrada, a possibilidade do uso de amostragens e por ser caracterizada como um modelo não-paramétrico, portanto, e ainda ressaltam que, “[...] não há grande necessidade de se entender o processo propriamente dito”. Entretanto, este último aspecto é também considerado pelos mesmos autores como a principal crítica, ou seja, a inabilidade do modelo em esclarecer de que maneira os resultados são gerados. Devido a esta especificidade, as RNAs são também denominadas de “caixas pretas”.

3. METODOLOGIA

Para a realização da pesquisa optou-se pelo estudo de caso que, segundo Yin (2010)Yin, Robert, K. (2010). Estudo de caso: planejamento e métodos (4th ed.). Porto Alegre: Bookman., é oportuno para estudar acontecimentos contemporâneos em um contexto da vida real, quando o controle se torna mais difícil para o pesquisador. O estudo de caso é caracterizado como do tipo único, contemplando uma unidade de análise, envolvendo uma Cooperativa de Crédito do sistema SICOOB, estruturado de forma a contemplar a investigação da Cooperativa sob o enfoque do cooperativismo teórico.

A base de dados da cooperativa foi utilizada para avaliar o desempenho do sistema de análise de crédito. Essa base de dados corresponde aos dados históricos das análises de pessoas físicas de 2003 a 2007. Devido à mudança no sistema de informações, não é possível coleta dos dados anteriores a este período.

Os dados referentes à análise de crédito são altamente confidenciais e estratégicos devido ao sigilo bancário e também ao risco dos concorrentes adquirirem tais dados, torna-se, portanto, muito difícil de serem adquiridos por terceiros. A opção pela cooperativa foi, então, por sua disponibilidade em fornecer os dados.

Atualmente, a cooperativa utiliza como ferramenta para realizar sua análise de crédito, um aplicativo do sistema SICOOB chamado SisBr. É esse aplicativo que contém as informações que a gerência e diretoria se valem para fundamentar suas decisões de conceder ou não crédito.

A pesquisa utilizou os passos sugeridos por Fayyad, Piatetsky-Shapiro, e Smyth (1996)Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. A I Magazine,17(3), 37–54. no processo de Descoberta de Conhecimento: seleção dos dados; pré-processamento e limpeza dos dados; transformação dos dados; Mineração de Dados; interpretação e avaliação dos resultados.

Para a consecução dos objetivos, este estudo se apoiou nas atividades de execução de KDD, conforme discutido por Goldschmidt e Passos (2005)Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier.. Dentre as técnicas de Mineração de Dados disponíveis, foram utilizadas Redes Neurais e Árvores de Decisão, ambas encontradas extensivamente nos estudos empíricos.

A coleta e seleção dos dados correspondem ao processo de captar, organizar e selecionar os dados disponíveis para a etapa da modelagem e Mineração de Dados, portanto, requer exame acurado. Dasu e Johnson (2003)Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. New Jersey: John Wiley & Sons. apontam os seguintes elementos auxiliadores na análise de variáveis: a experiência anterior, o conhecimento, a quantidade e a qualidade dos dados.

3. RESULTADOS

Nesta parte do trabalho encontram-se os resultados das simulações das técnicas investigadas neste artigo, Árvore de Decisão e Redes Neurais Artificiais, bem como o teste estatístico comparativo entre ambas.

Não foram encontrados valores ausentes na base de dados estudada. O Quadro 01 apresenta a estrutura da base de dados construída, com as variáveis e valores possíveis.

Thumbnail

Quadro 01
Estrutura da base de dados.

A variável de saída é representada pelas variáveis 27 a 39, correspondentes ao período de julho de 2007 a junho de 2008. As variáveis 02 a 10 não constam na análise de crédito realizada pela cooperativa, assim, foram agregadas com a finalidade de ampliar a análise. A coleta destas variáveis foi realizada nos cadastros dos cooperados e representam os dados disponíveis pela cooperativa.

As variáveis 11 a 26 são utilizadas atualmente na análise de crédito e representam o comportamento histórico do cooperado na tomada de crédito. Os códigos 27 a 39 representam a variável de saída adotada pela pesquisa e retratam o período de julho de 2007 a junho de 2008. Estes são os dados disponibilizados pela cooperativa. Dados anteriores a estes não estão disponíveis.

A quantidade de variáveis utilizadas na análise enquadra-se dentro do utilizado em outras pesquisas. Por exemplo, Koh et al. (2006)Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118. utilizaram 20 variáveis e Lemos et al. (2005)Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234. utilizaram 24.

A variável, código do cooperado foi descartada, pois, foi útil somente para identificar o cooperado na coleta dos dados. A variável “risco atribuído” foi utilizada somente na etapa de pré-processamento e limpeza dos dados. Não foi utilizada nas etapas de transformação e modelagem porque esta variável se refere à saída do modelo utilizado pela cooperativa, portanto, representa o resultado do modelo atualmente empregado. A variável, resultado agregado, representa o resultado do período analisado (julho de 2007 a junho de 2008) e, segundo as regras do negócio da cooperativa e dos objetivos de estudo, constitui a variável-alvo de saída do modelo.

A transformação dos dados visa auxiliar a execução das técnicas de mineração de dados. Os dados foram agrupados segundo indicação de Goldschmidt & Passos (2005)Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier. em uma única tabela bidimensional.

Os dados foram coletados de duas fontes na cooperativa: da avaliação de crédito e do cadastro, de forma manual, registro a registro. Para a pesquisa foram utilizados dados históricos de 211 cooperados, pessoas físicas, sendo 22 inadimplentes e 189 adimplentes. Os dados representam o universo dos cooperados tomadores de crédito.

Dado o desbalanceamento entre inadimplentes e adimplentes, é possível incorrer em viés e problema de superajustamento (Chawla, 2005Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook (pp. 853–867). New Jersey: Springer.; Horta, Borges, Carvalho, & Alves, 2011Horta, R. A. M., Borges, C. C. H., Carvalho, F. A. A., & Alves, F. J. S. (2011). Previsão de insolvência: Uma estratégia para balanceamento da base de dados utilizando variáveis contábeis de empresas brasileiras.Sociedade, Contabilidade E Gestão, 6(2), 21–36.). Para solucionar tal problema foi utilizada a técnica denominada SMOTE (Synthetic Minority Oversampling Technique) (Chawla, 2005)Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook (pp. 853–867). New Jersey: Springer. para inclusão de observações de cooperados inadimplentes.

Esse algoritmo é considerado um dos mais utilizados pela literatura (Horta et al., 2011Horta, R. A. M., Borges, C. C. H., Carvalho, F. A. A., & Alves, F. J. S. (2011). Previsão de insolvência: Uma estratégia para balanceamento da base de dados utilizando variáveis contábeis de empresas brasileiras.Sociedade, Contabilidade E Gestão, 6(2), 21–36.). Dessa forma foram criadas 110 observações da classe minoritária, ou seja, inadimplentes.

A base ficou com 132 cooperados inadimplentes e 189 adimplentes, constituindo uma amostra de 321 observações. Posteriormente, a base de dados foi randomizada para evitar a concentração de mesmos valores em determinado conjunto de dados na validação cruzada e incorrer em superajustamento.

Para a implementação computacional das técnicas Árvore de Decisão e Redes Neurais, foi utilizada a base de dados descritos neste estudo, considerando-se para cada cooperado as variáveis já descritas anteriormente. Para exemplificar, foi escolhido um modelo gerado pela técnica Árvore de Decisão, para transformar em regras de decisão na fase pós-processamento. Optou-se por utilizar a ferramenta computacional de domínio público WEKA(Waikato Environment for Knowledge Analysis).

Goldschmidt e Passos (2005Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier., p. 50) argumentam que, para melhor fidedignidade da avaliação do modelo de conhecimento, “[...] os dados utilizados na construção do modelo não devem ser os mesmos utilizados na avaliação desse modelo”. Os mesmos autores ainda afirmam que deve haver no mínimo duas partições: a partição de treinamento e a partição de teste. A primeira inclui os dados para a construção do modelo e a segunda, os dados para avaliação.

Dividir o conjunto de dados tem o propósito de simplificar, sumarizar e reduzir a variabilidade e tamanho da base de dados, resultando na seleção de modelos mais sofisticados e acurados (Dasu & Johnson, 2003Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. New Jersey: John Wiley & Sons.).

Neste estudo, para melhor isenção da avaliação, tanto para a árvore de decisão quanto para a RNA, foi utilizada a validação cruzada com K conjuntos (K-Fold Cross-Validation). Segundo Goldschmidt e Passos (2005Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier., p. 51), neste método, a base de dados é dividida aleatoriamente com N elementos em K subconjuntos separados, “cada um dos K subconjuntos é utilizado como conjunto-teste e os (K-1) demais subconjuntos são reunidos em um conjunto de treinamento. O processo é repetido K vezes, sendo gerados e avaliados K modelos [...]”. Os dados foram divididos em dez conjuntos e repetidos em dez simulações conforme proposto por Witten e Frank (2005)Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco: Elsevier.. A validação cruzada é encontrada em vários estudos de análise de crédito (Akkoç, 2012Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178.; Chang & Yeh, 2012Chang, S.-Y., & Yeh, T.-Y. (2012). An artificial immune classifier for credit scoring analysis. Applied Soft Computing,12(2), 611–618.; Han et al., 2013)Han, L., Han, L., & Zhao, H. (2013). Orthogonal support vector machine for credit scoring. Engineering Applications of Artificial Intelligence, 26(2), 848–862.

Para este estudo, decidiu-se por utilizar RNA de múltiplas camadas,Multilayer Perceptron (MLP), com o algoritmo de aprendizagem backpropagation. A quantidade de neurônios da camada de entrada foi de 66, da camada intermediária 2 e a quantidade de neurônios na camada de saída foi igual a 2. Em todos os testes foi utilizada taxa de aprendizagem igual a 0.01, dado que segundo as simulações foi a melhor taxa que melhorou a classificação e também foi utilizada por Lemos et al., (2005)Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.. Optou-se por não utilizar a taxa momentum conforme Lemos et al. (2005)Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.. Além disso, o acréscimo dessa taxa não melhorou o desempenho da classificação.

O aprendizado da RNA é do tipo supervisionado. Ferreira (2005Ferreira, J. B. (2005). Mineração de dados na retenção de clientes em telefonia celular. Dissertação de Mestrado. PUC-RIO., p. 37) descreve este tipo da seguinte forma: “[...] a rede é treinada através do fornecimento dos valores de entrada e dos respectivos valores de saída [...]”.

Para a avaliação comparativa dos modelos foi utilizado o parâmetro percentual total de valores preditos corretamente com o teste estatísticot modificado (corrected resample t-test) com nível de significância de 0.05 (ou 5%) em duas caudas e nove graus de liberdade, conforme proposto por Witten e Frank (2005)Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco: Elsevier. de acordo com a fórmula 1, apresentada abaixo:

3.1 Árvore de Decisão

Optou-se, nesta pesquisa, por utilizar a ferramenta J4.8, que é a implementação do software WEKA do algoritmo da árvore de decisão C4.5. Segundo Goldschmidt e Passos (2005)Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático. Rio de Janeiro: Elsevier., esta árvore é amplamente utilizada e aceita. Tomou-se um modelo gerado pela técnica Árvore de Decisão para exemplificar as regras e matriz de confusão. O modelo abaixo gerou 41 folhas, ou seja, conjuntos de regras de decisão do tipo se-então. Algumas regras do primeiro conjunto estão apresentadas abaixo.

Se Liquidez das garantias = garantia de fácil liquidez (venda até 6 meses) então inadimplente.
Se Liquidez das garantias = garantia de média liquidez (venda de 6 a 12 meses) e Nível de comprometimento = até 20% da renda média líquida então adimplente.
Se Liquidez das garantias = garantia de média liquidez (venda de 6 a 12 meses) e Nível de comprometimento = de 20% a 30% da renda média líquida então adimplente.

O Quadro 02 mostra a matriz de confusão gerada pelo conjunto de teste da árvore avaliada. Esta matriz apresenta as instâncias classificadas em previstos e reais para avaliar o tipo de acerto e erro dos modelos. A diagonal principal indica os valores corretamente classificados. Os valores estão expressos em termos absolutos.

Thumbnail

Quadro 02
: Matriz de confusão de um modelo desenvolvido pelo método Árvore de Decisão.

Neste conjunto, o modelo baseado no algoritmo C4.5 de árvore de decisão classificou 302 registros corretamente, representando assim uma taxa de acerto de 94,08% e 19 observações incorretas, representando 5,92%.

3. 2. Rede Neural

A RNA deste estudo, conforme já discutido no referencial teórico, foi constituída por três camadas: entrada, intermediária e saída. O aprendizado da rede foi do tipo supervisionado, pois foram indicados os valores de saída do modelo.

Da mesma forma que a Árvore de Decisão, foi tomado um modelo para exemplificar os resultados gerados pelo software WEKA. OQuadro 03 apresenta a matriz de confusão para um modelo desenvolvido.

Thumbnail

Quadro 03
– Matriz de confusão de um modelo desenvolvido pelo método RNA.

Este modelo baseado em RNA, com o uso do algoritmo Multilayer Perceptrom, classificou 294 registros corretamente, representando uma taxa de acerto de 91,59% e 27 observações incorretas, representando 8,41%.

3. 3. Avaliação dos Modelos

Nesta seção são avaliados comparativamente os modelos desenvolvidos neste estudo: RNA (algoritmo MultilayerPerceptrom) e Árvore de Decisão (algoritmo C4.5). Para a realização desta avaliação, foi utilizado o percentual total de valores preditos corretamente.

A Tabela 01 apresenta o resultado do percentual dos valores preditos corretamente e o respectivo desvio padrão das simulações feitas com os modelos em estudo.

Thumbnail

Tabela 01
– Avaliação comparativa.

As simulações feitas na ferramenta Experimenter doWEKA indicam que a técnica Árvore de Decisão, com a implementação do algoritmo C4.5, é estatisticamente semelhante ao nível de significância de 0.05 em duas caudas que a RNA com a implementação do algoritmo MultilayerPerceptrom. A Figura 02 apresenta a saída do software WEKA das simulações feitas do problema previamente descrito.

Figura 02
: Caixa de saída das simulações da ferramentaexperimenter doWEKA.

O desempenho da árvore de decisão para o presente problema foi superior ao encontrado por Yap et al. (2011)Yap, B. W., Ong, S. H., & Husain, N. H. M. (2011). Using data mining to improve assessment of credit worthiness. Expert Systems with Applications, 38(10), 13274–13283. que obteve uma taxa de erro de 28,1%.

O estudo de Lemos et al. (2005)Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234., apesar de não realizar teste estatístico, encontrou taxa de acerto maior da rede neural em comparação com a árvore de decisão. Apesar da semelhança estatística, as árvores de decisão são consideradas de fácil uso (Lemos et al., 2005Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.).

Os resultados indicam que os modelos de classificação por meio de mineração de dados desenvolvidos podem ser úteis na avaliação pela cooperativa e, dessa forma, melhorar o desempenho, conforme já encontrado na análise em organização de microcrédito (Cubiles-De-La-Vega et al., 2013Cubiles-De-La-Vega, M.-D., Blanco-Oliver, A., Pino-Mejías, R., & Lara-Rubio, J. (2013). Improving the management of microfinance institutions by using credit scoring models based on Statistical Learning techniques.Expert Systems with Applications, 40(17), 6910–6917.) e cooperativa de crédito (Zhu et al., 2013Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267.).

4. CONCLUSÕES

O objetivo deste estudo foi desenvolver e avaliar modelos de Mineração de Dados para classificar e prever o comportamento dos cooperados em saldar os compromissos contraídos. Para o desenvolvimento do modelo foram utilizadas Árvore de Decisão e RNA, ambas, técnicas de Mineração de Dados.

O processo de preparação e modelagem dos dados seguiu os passos sugeridos pela literatura: seleção dos dados; pré-processamento e limpeza dos dados; transformação dos dados; Mineração de Dados; interpretação e avaliação dos resultados. Os dados foram particionados em conjuntos de treinamento e teste.

Embora a Árvore de Decisão tenha obtido 97,07% de acerto nas simulações feitas e a RNA 95,58%, o algoritmo baseado em Árvore de Decisão C4.5 obteve resultado estatisticamente semelhante ao modelo baseado em Redes Neurais ArtificiaisMultilayerPerceptrom.

O processo de descoberta de conhecimento e o uso dos modelos baseados em Mineração de Dados desenvolvidos podem trazer vantagens práticas para a cooperativa. A compreensão das variáveis e seus relacionamentos podem ajudar a melhor classificar e prever o comportamento dos cooperados.

A avaliação mais profunda das variáveis pode ainda ajudar a incluir variáveis que sejam importantes e excluir variáveis que não se mostrem relevantes, com vantagens de proporcionar modelos de gestão de crédito mais sucintos e precisos, com economia de tempo na execução e melhor acurácia nas decisões. A análise de casos discrepantes, ou que estão fora do padrão, pode ser relevante para a formação de uma nova classificação ou, inversamente, compor padrões indesejados.

Pode-se enumerar como limitação deste trabalho: a ausência de bases de dados de outras cooperativas, para comparar, avaliar e validar o modelo; limitações do sistema de informações utilizado pela cooperativa que impossibilita a coleta de dados das variáveis de entrada anteriores a 2003 e limita a fornecer os dados referentes à variável de saída há apenas seis meses; a falta de integração de alguns módulos da base de dados com planilhas eletrônicas.

Sugerem-se as seguintes propostas para estudos posteriores: a utilização de diferentes bases de dados para validação do modelo de análise de crédito; o uso de outras técnicas de Mineração de Dados; o uso de modelos híbridos com a combinação de diferentes técnicas para aperfeiçoar e melhorar o desempenho na classificação e previsão; análises de investimentos com a avaliação do tipo de erro e qual o impacto financeiro que o modelo apresenta para a lucratividade e rentabilidade da cooperativa; avaliação dos casos discrepantes, principalmente para a variável capital, com o objetivo de verificar a existência de novos padrões e classificações.

11
Árvores de classificação e regressão.
22
Árvore estatística eficiente, rápida, sem viés.
33
Detector interativo automático qui-quadrado.

Publication Dates

Publication in this collection
Aug 2014

History

Received
17 July 2012
Accepted
21 Mar 2014

This is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

[1] Abellán, J., & Mantas, C. J. (2014). Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications,41(8), 3825–3830.

[2] Akkoç, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research,222(1), 168–178.

[3] Berry, M. J. A., & Linoff, G. (2004). Data mining techniques: For marketing, sales and customer relationship management (2nd ed.). Indianapolis: wiley Publishing.

[4] Bhattacharyya, S., Jha, S., Tharakunnel, K., & Christopher, J. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613.

[5] Braga, A. de P., Carvalho, A. P. de L. F., & Ludermir, T. B. (2000). Redes neurais artificiais: Teoria e aplicações Rio de Janeiro: LTC.

[6] Chaia, A. J. (2003). Modelos de gestão de risco de crédito e sua aplicabilidade ao mercado brasileiro Dissertação de Mestrado. FEA/USP.

[7] Chang, S.-Y., & Yeh, T.-Y. (2012). An artificial immune classifier for credit scoring analysis. Applied Soft Computing,12(2), 611–618.

[8] Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook (pp. 853–867). New Jersey: Springer.

[9] Chen, S. C., & Huang, M. Y. (2011). Constructing credit auditing and control & management model with data mining technique. Expert Systems with Applications,38(5359-5365).

[10] Crone, S. F., & Finlay, S. (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting, 28(1), 224–238.

[11] Cubiles-De-La-Vega, M.-D., Blanco-Oliver, A., Pino-Mejías, R., & Lara-Rubio, J. (2013). Improving the management of microfinance institutions by using credit scoring models based on Statistical Learning techniques.Expert Systems with Applications, 40(17), 6910–6917.

[12] Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning New Jersey: John Wiley & Sons.

[13] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. A I Magazine,17(3), 37–54.

[14] Ferreira, J. B. (2005). Mineração de dados na retenção de clientes em telefonia celular Dissertação de Mestrado. PUC-RIO.

[15] García, V., Marqués, A. I., & Sánchez, J. S. (2012). On the use of data filtering techniques for credit risk prediction with instance-based models. Expert Systems with Applications,39(18), 13267–13276.

[16] Gehrke, J. (2003). Decision tree. In The handbook of data mining (pp. 3–23). New Jersey: Lawrence Erlbaum Associates.

[17] Goldschmidt, R., & Passos, E. (2005). Data mining: Um guia prático Rio de Janeiro: Elsevier.

[18] Han, L., Han, L., & Zhao, H. (2013). Orthogonal support vector machine for credit scoring. Engineering Applications of Artificial Intelligence, 26(2), 848–862.

[19] Horta, R. A. M., Borges, C. C. H., Carvalho, F. A. A., & Alves, F. J. S. (2011). Previsão de insolvência: Uma estratégia para balanceamento da base de dados utilizando variáveis contábeis de empresas brasileiras.Sociedade, Contabilidade E Gestão, 6(2), 21–36.

[20] Ju, Y. H., & Sohn, S. Y. (2014). Updating a credit-scoring model based on new attributes without realization of actual data. European Journal of Operational Research, 234(1), 119–126.

[21] Khatchatourian, O., & Treter, J. (2010). APLICAÇÃO DA LÓGICA FUZZY PARA AVALIAÇÃO ECONÔMICO-FINANCEIRA DE COOPERATIVAS DE PRODUÇÃO.Revista de Gestão Da Tecnologia E Sistemas de Informação,7(1), 141–162.

[22] Koh, H. C., Tan, W. C., & Goh, C. P. (2006). A two-step method to construct credit scoring models with data mining techniques.International Journal of Business and Information,1(1), 96–118.

[23] Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability estimates using machine learning.Expert Systems with Applications, 40(13), 5125–5131.

[24] Lai, K. K., Yu, L., Wang, S., & Zhou, L. (2006). Credit risk analysis using a reliability-based neural network ensemble model. InArtificial Neural Networks-ICANN 2006 (pp. 682–690). Springer Berlin Heidelberg.

[25] Larose, T. D. (2005). Discovering knowledge in data: An introduction to data mining New Jersey: John Wiley & Sons.

[26] Lemos, E. P., Steiner, M. T. A., & Nievola, J. C. (2005). Análise de crédito bancário por meio de redes neurais e árvore de decisao: Uma aplicação simples de data mining. Revista de Administração Da Universidade de São Paulo, 40(3), 225–234.

[27] Majeske, K. D., & Lauer, T. W. (2013). The bank loan approval decision from multiple perspectives. Expert Systems with Applications, 40(5), 1591–1598.

[28] Marqués, A. I., García, V., & Sánchez, J. S. (2012). Two-level classifier ensembles for credit risk assessment. Expert Systems with Applications, 39(12), 10916–10922.

[29] Mester, L. J. (1997). What’s the point of credit scoring?Business Review, 3, 3–16.

[30] Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38(12), 15273–15285.

[31] OCB. (2014). Organização das Cooperativas Brasileiras.Números Retrieved February 20, 2014, from http://www.ocb.org.br/site/ramos/credito_numeros.asp
» http://www.ocb.org.br/site/ramos/credito_numeros.asp

[32] Oliveira, D. P. R. (2001). Manual de gestão de cooperativas: Uma abordagem prática São Paulo: Atlas.

[33] Oreski, S., & Oreski, G. (2014). Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications, 41(4), 2052–2064.

[34] Pidd, M. (1998). Modelagem empresarial: Ferramentas para tomada de decisão São Paulo: Atlas.

[35] Pinho, D. B. (1982). O pensamento cooperativo e o cooperativismo brasileiro CNPq/BNCC.

[36] Pinho, D. B. (2004). O cooperativismo no Brasil: Da vertente pioneira à vertente solidaria São Paulo: Saraiva.

[37] Portal do Cooperativismo de Crédito. (2014). Dados consolidados dos sistemas cooperativos Retrieved February 20, 2014, from http://cooperativismodecredito.coop.br/cenario-brasileiro/dados-consolidados=_dos-sistemas-cooperativos/
» http://cooperativismodecredito.coop.br/cenario-brasileiro/dados-consolidados=_dos-sistemas-cooperativos/

[38] Saberi, M., Mirtalaie, M. S., Hussain, F. K., Azadeh, A., Hussain, O. K., & Ashjari, B. (2013). A granular computing-based approach to credit scoring modeling. Neurocomputing, 122(25), 100–115.

[39] Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.

[40] Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco: Elsevier.

[41] Xiong, T., Wang, S., Mayers, A., & Monga, E. (2013). Personal bankruptcy prediction by mining credit card data. Expert Systems with Applications, 40(2), 665–676.

[42] Yap, B. W., Ong, S. H., & Husain, N. H. M. (2011). Using data mining to improve assessment of credit worthiness. Expert Systems with Applications, 38(10), 13274–13283.

[43] Yin, Robert, K. (2010). Estudo de caso: planejamento e métodos (4th ed.). Porto Alegre: Bookman.

[44] Zhong, H., Miao, C., Shen, Z., & Feng, Y. (2014). Comparing the learning effectiveness of BP, ELM, I-ELM, and SVM for corporate credit ratings.Neurocomputing, 128(27), 285–295.

[45] Zhou, X., Jiang, W., Shi, Y., & Tian, Y. (2011). Credit risk evaluation with kernel-based affine subspace nearest points learning method.Expert Systems with Applications, 38(4), 4272–4279.

[46] Zhu, X., Li, J., Wu, D., Wang, H., & Liang, C. (2013). Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach. Knowledge-Based Systems,52, 258–267.

CODE	VARIABLE	VALUE
01	Member code	Numerical key, unique for each member
02	Gender	1= male
02	Gender	2= female
03	Age	Numerical value
04	Level of education	1= graduate
		2= complete tertiary
		3= incomplete tertiary
		4= complete secondary
		5= incomplete secondary
		6= complete primary primary
		7= incomplete
05	City	Name of city of residence
06	Birthplace	Name of city of birth
07	Place of residence	1= urban area
07	Place of residence	2= rural area
08	Primary work/activity	Name of work/activity
09	Marital status	1= Married
		2= Single
		3= Widowed
		4= Legally divorced
		5= Other
10	Capital	Numerical value
11	Relationship	1= with the cooperative for more than 3 years
		2= with the cooperative for 1 to 3 years
		3= with the cooperative for less than 1 year
12	Transaction conduct	1= normal
		2= occasional delays
		3= constant delays/renegotiations
13	Years of experience in activity/work	1= more than 5 years
		2= from 3 to 5 years
		3= less than 3 years
14	Records check	1= no restrictions
		2= justified, irrelevant restrictions
		3= relevant or unjustified, irrelevant restrictions
15	Record information at the cooperative	1= up-to-date and reliable record
		2= up-to-date and unreliable record
		3= information is not up-to-date or is missing
16	Purpose of the operation	1= support and investment
		2= financing of assets
		3= personal credit/automatic loan
		4= debt renewal/composition
17	Operation guarantees	1= mortgage—social capital
		2= chattel mortgage/warrants
		3= pledge/collateral
		4= personal
18	Liquidity guarantee	1= high liquidity guarantee (sale in less than 6 months)
		2= moderate liquidity guarantee (sale in 6 to 12 months)
		3= personal or low liquidity guarantee (sale in more than 12 months)
19	Frequency with which the member performs (active) transactions	1= never
		2= frequently
		3= permanently
20	Operation value	1= up to 1% of adjusted net worth (ANW)
		2= from 1.01% to 2% of ANW
		3= from 2.01% to 3% of ANW
		4= more than 3% of ANW
21	Level of commitment—installments on member's net income	1= up to 20% of average net income
		2= from 20%-30% of average net income
		3= more than 30% of net income
22	Personal net assets minus total liability	1= more than 4 times
		2= from 2 to 4 times
		3= no personal equity or less than 2 times
23	Total liability in relation to annual net income	1= less than 2 times
		2= from 2 to 4 times
		3= more than 4 times
24	Total liability in relation to paid-in capital	1= less than 4 times
		2= from 4 to 8 times
		3= from 8 to 12 times
		4= more than 12 times
25	Profile of member's economical activity	1= excellent
		2= good
		3= regular
		4= poor
26	Risk attributed by the cooperative	1= AA
		2= A
		3= B
		4= C
		5= D
		6= E
		7= F
		8= G
		9= H
27	Result—July 2007	1= good standing
27	Result—July 2007	2= default
28	Result—August 2007	1= good standing
28	Result—August 2007	2= default
29	Result—September 2007	1= good standing
29	Result—September 2007	2= default
30	Result—October 2007	1= good standing
30	Result—October 2007	2= default
31	Result—November 2007	1= good standing
31	Result—November 2007	2= default
32	Result—December 2007	1= good standing
32	Result—December 2007	2= default
33	Result—January 2008	1= good standing
33	Result—January 2008	2= default
34	Result—February 2008	1= good standing
34	Result—February 2008	2= default
35	Result—March 2008	1= good standing
35	Result—March 2008	2= default
36	Result—April 2008	1= good standing
36	Result—April 2008	2= default
37	Result—May 2008	1= good standing
37	Result—May 2008	2= default
38	Result—June 2008	1= good standing
38	Result—June 2008	2= default
39	Aggregate result	1= good standing
39	Aggregate result	2= default

Predicted
Actual		Default	Good standing
	Default	121	11
	Good standing	8	181

Predicted
Actual		Default	Good standing
	Default	118	14
	Good standing	13	176

Algorithm	C4.5	MLP
Correct percentage	97.07%	95.58%
Standard deviation	2.76	3.47