Evaluation of classification techniques for identifying fake reviews about products and services on the internet

Santos, Andrey Schmidt dos; Camargo, Luis Felipe Riehs; Lacerda, Daniel Pacheco

doi:10.1590/0104-530X4672-20

Acessibilidade / Reportar erro

Brasil

Español English

sumário « anterior atual seguinte »

Sumário

ORIGINAL ARTICLE • Gest. Prod. 27 (4) • 2020 • https://doi.org/10.1590/0104-530X4672-20 copy

Evaluation of classification techniques for identifying fake reviews about products and services on the internet

Avaliação de técnicas de classificação para identificação de comentários falsos sobre produtos e serviços na internet

Authorship SCIMAGO INSTITUTIONS RANKINGS

Abstract:

With the e-commerce growth, more people are buying products over the internet. To increase customer satisfaction, merchants provide spaces for product and service reviews. Products with positive reviews attract customers, while products with negative reviews lose customers. Following this idea, some individuals and corporations write fake reviews to promote their products and services or defame their competitors. The difficulty for finding these reviews was in the large amount of information available. One solution is to use data mining techniques and tools, such as the classification function. Exploring this situation, the present work evaluates classification techniques to identify fake reviews about products and services on the Internet. The research also presents a literature systematic review on fake reviews. The research used 8 classification algorithms. The algorithms were trained and tested with a hotels database. The CONCENSO algorithm presented the best result, with 88% in the precision indicator. After the first test, the algorithms classified reviews on another hotels database. To compare the results of this new classification, the Review Skeptic algorithm was used. The SVM and GLMNET algorithms presented the highest convergence with the Review Skeptic algorithm, classifying 83% of reviews with the same result. The research contributes by demonstrating the algorithms ability to understand consumers’ real reviews to products and services on the Internet. Another contribution is to be the pioneer in the investigation of fake reviews in Brazil and in production engineering.

Keywords:
Fake reviews; Text classification; Knowledge discovery in databases; Text mining

Cycle	Description
B1	The more fake reviews, the more database, the more research, the more academy after a while, more detection techniques, and after a while less fake reviews.
B2	The less true positive reviews, the more fake positive reviews, the more customers, the more sales, the more true reviews and the more true positive reviews.
B3	The less true negative reviews, the more fake negative reviews, the less customers, the less sales, the less true reviews, and the less true negative reviews.

Technique	Description	Example
Decision Tree	It has this name due to the appearance of a tree (Han & Kamber, 2006Han, J., & Kamber, M. (2006). Data mining: concepts and techniques (3rd ed.). São Francisco: Elsevier.). It is constructed with the root, decision and leaf nodes, which are the questions, and the branches that are the answers (Larose, 2005Larose, D. (2005). Discovering knowledge in data: an introduction to data mining (1st ed.). Hoboken: Wiley.). The algorithms occur in three steps (Groth, 2000Groth, R. (2000). Data mining: building competitive strategy (2nd ed.). Nova Jersey: Prentice-Hall.): i) definition of dependent and independent variables from a data source; ii) examination of the impact of each variable on the result; and iii) definition of the variable that predicts the results of the other variables. The algorithms suggested by Jurka et al. (2013)Jurka, T. P., Collingwood, L., Boydstun, A. E., Grossman, E., & Atteveldt, W. (2013). RTextTools: a supervised learning package for text classification. The R Journal, 5(1), 6-12. http://dx.doi.org/10.32614/RJ-2013-001. http://dx.doi.org/10.32614/RJ-2013-001... ; were BAGGING (Breiman, 1996Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140. http://dx.doi.org/10.1007/BF00058655. http://dx.doi.org/10.1007/BF00058655... ), RF (Liaw & Wiener, 2002Liaw, A., & Wiener, M. (2002). Classification and regression by randon forest. R News, 2(3), 18-22.) and TREE (Breiman et al., 1984Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees (1st ed.). Wadsworth: Chapman & Hall.).
Bayesian Classification	They predict the probability of the data belonging to certain classes. This technique is based on the Bayes' theorem and assumes that the data are independent of each other (Han & Kamber, 2006Han, J., & Kamber, M. (2006). Data mining: concepts and techniques (3rd ed.). São Francisco: Elsevier.). Jurka et al. (2013)Jurka, T. P., Collingwood, L., Boydstun, A. E., Grossman, E., & Atteveldt, W. (2013). RTextTools: a supervised learning package for text classification. The R Journal, 5(1), 6-12. http://dx.doi.org/10.32614/RJ-2013-001. http://dx.doi.org/10.32614/RJ-2013-001... suggests the use of the BOOSTING algorithm (Freund & Schapire, 1997Freund, Y., & Schapire, R. (1997). A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139. http://dx.doi.org/10.1006/jcss.1997.1504. http://dx.doi.org/10.1006/jcss.1997.1504... ).	$P (H / X) = \frac{P (X / H) . P (H)}{P (X)}$
Neural Networks	This technique simulates the functioning of the human brain, as a function of the large number of neurons, enabling learning based on experience (Larose, 2005Larose, D. (2005). Discovering knowledge in data: an introduction to data mining (1st ed.). Hoboken: Wiley.). There are at least two types of neural networks: percepetron and multilayer (Tan et al., 2009Tan, P., Steinbach, M., & Kumar, V. (2009). Introdução ao data mining (1. ed.). Rio de Janeiro: Moderna.). The algorithm SLDA (Blei & McAuliffe, 2010Blei, D., & McAuliffe, J. (2010). Supervised topic models. In Anais da 20º Conferência Internacional de Sistemas de Processamento de Informações Neurais (pp. 121-128). Vancouver: ACM.) is suggested by Jurka et al. (2013)Jurka, T. P., Collingwood, L., Boydstun, A. E., Grossman, E., & Atteveldt, W. (2013). RTextTools: a supervised learning package for text classification. The R Journal, 5(1), 6-12. http://dx.doi.org/10.32614/RJ-2013-001. http://dx.doi.org/10.32614/RJ-2013-001... to classify with this technique.
SVM	The support vector machine raises training data to a higher dimension by looking for an optimal separation hyperplane (a greater distance separating different classes) (Feldman & Sanger, 2007Feldman, R., & Sanger, J. (2007). The text mining handbook (1st ed.). Nova York: Cambridge University Press.; Han & Kamber, 2006Han, J., & Kamber, M. (2006). Data mining: concepts and techniques (3rd ed.). São Francisco: Elsevier.). The SVM algorithm (Fan et al., 2005Fan, R., Chen, P., & Lin, C. (2005). Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6, 1889-1918.) is that indicated by Jurka et al. (2013)Jurka, T. P., Collingwood, L., Boydstun, A. E., Grossman, E., & Atteveldt, W. (2013). RTextTools: a supervised learning package for text classification. The R Journal, 5(1), 6-12. http://dx.doi.org/10.32614/RJ-2013-001. http://dx.doi.org/10.32614/RJ-2013-001... .
Logistic Regression	It is a special type of regression, which deals with categorical and independent variables (Groth, 2000Groth, R. (2000). Data mining: building competitive strategy (2nd ed.). Nova Jersey: Prentice-Hall.). In binary classes, probabilities greater than 50% indicate the presence in the class ” 1 ” and probabilities less than 50% indicate the presence in the “0” class (Fuller et al., 2011Fuller, C., Biros, D., & Delen, D. (2011). An investigation of data and text mining methods for real world deception detection. Expert Systems with Applications, 38(7), 8392-8398. http://dx.doi.org/10.1016/j.eswa.2011.01.032. http://dx.doi.org/10.1016/j.eswa.2011.01... ). Jurka et al. (2013)Jurka, T. P., Collingwood, L., Boydstun, A. E., Grossman, E., & Atteveldt, W. (2013). RTextTools: a supervised learning package for text classification. The R Journal, 5(1), 6-12. http://dx.doi.org/10.32614/RJ-2013-001. http://dx.doi.org/10.32614/RJ-2013-001... suggests the use of the GLMNET algorithm (Friedman et al., 2010Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22. http://dx.doi.org/10.18637/jss.v033.i01. PMid:20808728. http://dx.doi.org/10.18637/jss.v033.i01... ).	$y = b_{0} + b_{1 X}$

Literature		Fei et al. (2013)Fei, G., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., & Ghosh, R. (2013). Exploiting burstiness in reviews for review spammer detection. In Anais da 17º Conferência Internacional de Midias Sociais e Weblogs (pp. 175-184). Cambridge: ACM.	Lau et al. (2011)Lau, R. Y. K., Liao, S. Y., Kwok, R. C., Xu, K., Xia, Y., & Li, Y. (2011). Text mining and probabilistic language modeling for online review spam detection. ACM Transactions on Management Information Systems, 2(4), 2501-2530. http://dx.doi.org/10.1145/2070710.2070716. http://dx.doi.org/10.1145/2070710.207071...	Sharma & Lin (2013)Sharma, K., & Lin, K. (2013). Review spam detector with rating consistency check. In Anais da 51º Conferência Regional da ACM Sudoeste (pp. 341-346). Savannah: ACM. http://dx.doi.org/10.1145/2498328.2500083. http://dx.doi.org/10.1145/2498328.250008...	Malbon (2013)Malbon, J. (2013). Taking fake online consumer reviews seriously. Journal of Consumer Policy, 36(2), 139-157. http://dx.doi.org/10.1007/s10603-012-9216-7. http://dx.doi.org/10.1007/s10603-012-921...	Mukherjee et al. (2013a)Mukherjee, A., Kumar, A., Liu, B., Wang, J., Hsu, M., Castellanos, M., & Ghosh, R. (2013a). Spotting opinion spammers using behavioral footprints. In Anais da 19º Conferência Internacional em Descoberta de Conhecimento e Mineração de Dados (pp. 632-640). Chicago: ACM.	Mukherjee et al. (2013b)Mukherjee, A., Venkataraman, V., Liu, B., & Glance, N. (2013b). What yelp fake review filter might be doing. In Anais da 17º Conferência Internacional em Midias Sociais e Weblogs (pp. 409-418). Palo Alto: ACM.	Qian & Liu (2013)Qian, T., & Liu, B. (2013). Identifying multiple userids of the same author. In Anais da 11º Conferência em Métodos Empíricos para Processamento de Linguagem Natural (pp. 1124-1135). Seattle: ACM.	Zhao et al. (2013)Zhao, Y., Yang, S., Narayan, V., & Zhao, Y. (2013). Modeling consumer learning from online product reviews. Marketing Science, 32(1), 153-169. http://dx.doi.org/10.1287/mksc.1120.0755. http://dx.doi.org/10.1287/mksc.1120.0755...	Jindal & Liu (2008)Jindal, N., & Liu, B. (2008). Opinion spam and analysis. In Anais da 8° Conferência Internacional de Pesquisa na Web e Mineração de Dados (pp. 219-230). Palo Alto: ACM.	Mukherjee et al. (2011)Mukherjee, A., Liu, B., Wang, J., Glance, N., & Jindal, N. (2011). Detecting group review spam. In Anais da 20º Conferência Internacional em World Wide Web (pp. 93-94). Hyderabad: ACM.	Mukherjee et al. (2012)Mukherjee, A., Liu, B., & Glance, N. (2012). Spotting fake reviewer groups in consumer reviews. In Anais da 21º Conferência Internacional em World Wide Web (pp. 191-200): Lyon: ACM. http://dx.doi.org/10.1145/2187836.2187863. http://dx.doi.org/10.1145/2187836.218786...	Ott et al. (2013)Ott, M., Cardie, C., & Hancock, J. (2013). Negative deceptive opinion spam. In Anais da 21º Conferência Internacional em World Wide Web (pp. 497-501). Hyderabad: ACM	Lu et al. (2013)Lu, Y., Zhang, L., Xiao, Y., & Li, Y. (2013). Simultaneously detecting fake reviews and review spammers using factor graph model. In Anais da 5ª Conferência Anual da ACM em Ciência WEB (pp. 225-233). Paris: ACM. http://dx.doi.org/10.1145/2464464.2464470. http://dx.doi.org/10.1145/2464464.246447...	Ott et al. (2011)Ott, M., Choi, Y., Cardie, C., & Hancock, J. (2011). Finding deceptive opinion spam by any stretch of the imagination. In Anais do 49º Encontro Anual da Associação para Linguistica Computacional: Tecnologias de Linguagem Humana (pp. 309-319). Portland: ACM.	Xie et al. (2012)Xie, S., Wang, G., Lin, S., & Yu, P. S. (2012). Review spam detection via temporal pattern discovery. In Anais da 18º Conferência Internacional em Descoberta do Conhecimento e Mineração de Dados (pp. 823-831). Beijing: ACM.	Jindal & Liu (2007)Jindal, N., & Liu, B. (2007). Review spam detection. In Anais da 16° Conferência Internacional em World Wide Web (pp. 1189-1190). Banff: ACM.	Jindal et al. (2010)Jindal, N., Liu, B., & Lim, E. (2010). Findind unusual review patterns using unexpected rules. In Anais da 19º Conferência Internacional e Gestão do Conhecimento e da Informação (pp. 1549-1552). Toronto: ACM.	Li et al. (2011)Li, F., Huang, M., Yang, Y., & Zhu, X. (2011). Learning to identify review spam. In Anais da 22º Conferência Mundial Internacional em Inteligência Artificial (pp. 2488-2493). Barcelona: ACM.	Lappas (2012)Lappas, T. (2012). Fake reviews: the malicious perspective. In Anais da 17º Conferência Internacional em Aplicações de Linguagem Natural em Sistemas de Informação (pp. 23-24). Groningen: ACM. http://dx.doi.org/10.1007/978-3-642-31178-9_3. http://dx.doi.org/10.1007/978-3-642-3117...
Review types	Promote or denigrate	X	X	X		X	X			X				X	X	X	X	X	X
	Focus on features									X							X
	IIrrelevant									X							X
Concentration areas	Fake promotion x good quality												X
	Fake defamation x good quality
	Fake promotion x bad quality												X
	Fake defamation x bad quality					X
	Fake promotion x average quality												X
	Fake defamation x average quality
Way of acting	Individual							X				X
Way of acting	Group										X	X
Approaches	Reviews		X	X	X		X	X		X			X	X		X	X		X
	Authors	X				X					X	X		X				X
	Product or service				X		X		X				X		X					X
Technique used	Decision tree
	Bayesian classification								X				X		X	X		X	X
	Neural network
	SVM	X	X			X	X	X			X	X	X	X	X				X
	Logistic regression			X						X		X		X			X		X

Algorithms	Classification	Review Skeptic (2013)Review Skeptic. (2013). Review Skeptic is based on research at Cornell University that uses machine learning to identifiy fake hotel reviews with nearly 90% accuracy. Retrieved in 2018, September 22, from http://reviewskeptic.com/ http://reviewskeptic.com/...		Total
Algorithms	Classification	True	Fake	Total
SVM	True	82	7	89
	Fake	10	1	11
	Total	92	8	100
GLMNET	True	81	6	87
	Fake	11	2	13
	Total	92	8	100
SLDA	True	73	7	80
	Fake	19	1	20
	Total	92	8	100
CONCENSO	True	79	7	86
	Fake	13	1	14
	Total	92	8	100

Universidade Federal de São Carlos Departamento de Engenharia de Produção , Caixa Postal 676 , 13.565-905 São Carlos SP Brazil, Tel.: +55 16 3351 8471 - São Carlos - SP - Brazil
E-mail: gp@dep.ufscar.br

Acompanhe os números deste periódico no seu leitor de RSS

[1] B1: Balancing cycle number one; B2: Balancing cycle number two; B3: Balancing cycle number three.