Acessibilidade / Reportar erro

A Survey on Feature Extraction Techniques, Classification Methods and Applications of Sentiment Analysis

Abstract

Rapid developments in the era of IoT technologies, coupled with the espousal of social media tools and applications, have promoted the use of data analytics as a means to gain significant insights from unstructured data. Sentiment analysis is an approach that identifies data polarity to classify a text as positive, neutral, or negative. Also referred to as opinion mining or subjective mining, sentiment analysis has applications that range from marketing and customer service to clinical medicine. The application of sentiment analysis in the epoch of big data has proved invaluable in classifying sentiment and, in general, determining opinions from the average person’s frame of mind Several sentiment analysis techniques have been developed over the years. In this regard, this article presents a brief survey on the sentiment analysis applications, as well as feature extraction and sentiment classification techniques. This article surveys various feature extractions techniques and concludes that each technique has its own pros and cons, and can be combined for better results. The survey on classification methods suggests that hybrid methods provide finer results than individual ones. The survey of applications surmises that sentiment analysis as applied to different sectors, helps expand business opportunities. Also, the paper presents a few open challenges in carrying out sentiment analysis.

Keywords:
Sentiment analysis; big data; social media; feature extraction; sentiment classification; application

HIGHLIGHTS

• Surveyed various feature extraction and Classification Techniques that can be used for sentiment analysis.

• Surveyed various applications of sentiment analysis.

• Surveyed various issues and challenges.

INTRODUCTION

Sentiment analysis (SA), an application of natural language processing (NLP), is a technique that determines the emotional context of a piece of text. It is commonly referred to as opinion mining [11 Park S, Kang S, Chung S, Song J. NewsCube: delivering multiple aspects of news to mitigate media bias. InProceedings of the SIGCHI conference on human factors in computing systems 2009 Apr 4 (pp. 443-452).]. This method is valuable for businesses as it allows them to identify and categorize customer opinions on their products, services, or ideas. Moreover, sentiment mining can also extract polarity, subject, and opinion holder information from text in addition to identifying sentiment. Sentiment analysis can be applied at various levels, including sub-sentence, sentence, paragraph, and document levels.

SA has various applications and benefits many fields and industries, including marketing and advertising, customer service, politics, healthcare, finance, and social sciences. It allows businesses, organizations, and governments to understand public sentiment and make informed decisions based on consumer behavior, brand reputation, political issues, patient feedback, market trends, and societal attitudes.

SA comes in a variety of forms. They are Intent analysis, Emotion detection SA, Fine-grained SA, and Aspect-based SA. Fine-grained SA determines the polarity of an opinion, which may just require a binary distinction of positive and negative sentiment. A classic example would be a rating along the lines of very good, good, average, bad, and very bad, as in a typical five-star Amazon review. Emotion detection SA allows for the detection of emotions such as anger, happiness, anxiety, frustration and sadness. Aspect-based SA designates a viewpoint on a particular product feature, such as the camera quality of a particular phone. Consumer assistance systems frequently employ intent analysis to ascertain the type of intention indicated in a message.

SA is possible on 4 different levels, including at the concept, aspect, document and sentence levels. In document-level SA, the whole document is analysed and its polarity ascertained [22 Shirsat VS, Jagdale RS, Deshmukh SN. Document level sentiment analysis from news articles. In2017 international conference on computing, Communication, Control and Automation (ICCUBEA) 2017 Aug 17 (pp. 1-4). IEEE.,33 Wagh R, Punde P. Survey on sentiment analysis using twitter dataset. In2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA) 2018 Mar 29 (pp. 208-211). IEEE.]. The goal of sentence-level SA, also termed subjectivity classification [44 Shivaprasad TK, Shetty J. Sentiment analysis of product reviews: a review. In2017 International conference on inventive communication and computational technologies (ICICCT) 2017 Mar 10 (pp. 298-301). IEEE.], is to classify the opinions voiced in every sentence. Aspect-level (also known as entity-level or phrase-level or feature-based) SA identifies constructs and devotes adequate attention to the opinions/sentiments articulated [33 Wagh R, Punde P. Survey on sentiment analysis using twitter dataset. In2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA) 2018 Mar 29 (pp. 208-211). IEEE.]. Concept-level SA examines concepts that do not overtly convey emotion and is focused on the semantic evaluation of content [55 Hemmatian F, Sohrabi MK. A survey on classification techniques for opinion mining and sentiment analysis. Artif. Intell. Rev. 2019 Oct 1;52(3):1495-545.]. It is the next level of understanding emotions in feedback data.

Two principal SA techniques are the subjective lexicon and machine learning (ML) approaches. The Lexicon approach is further classified as Dictionary-based (DB) and Corpus-based (CB). In turn, the DB is further classified as statistical and semantic based. The ML approach is, likewise, further classified into supervised and unsupervised learning. The former includes Maximum Entropy (ME), Neural Networks (NN), Naïve Bayes (NB), and the Support Vector Machines (SVM) classifiers.

SA undertakes sub-tasks like data collection, preprocessing, feature extraction, and classification. The data is taken from a slew of open-source datasets, as well as from Twitter, Facebook, and other social media. Preprocessing, the next step, cleans the data and readies it to be fed to the model. This is accomplished by a series of steps like eliminating unnecessary characters, tokenization, capitalization/de-capitalization, removing stopwords, lemmatizing, stemming, and, finally, correcting spelling and grammar.

This paper briefly reviews applications and recent trends in SA and related areas. Further, it presents the process and methodology of SA, discusses feature extraction and classification techniques, and examines issues and challenges.

Motivation and Justification

The motivation for the survey is to help researchers with their work on SA and related areas, and it is hoped that this paper will prove to be an invaluable reference in those directions. The paper provides insights into SA techniques that can be applied together. The need for SA is much important in industries such as marketing, customer service, and public opinion research, where understanding customer opinion is critical for success. In addition, carrying out SA on raw data highlights the opinion of the general public on reviews, products, and brands. Such exchanges present businesses useful insights into customers’ perceptions about the brand and gives license to make dynamic trade decisions to sustain the image of the businesses. These factors justify the undertaking of the present survey on SA.

Surveying feature extraction and classification techniques yields new features that will be a linear combination of the existing ones. Such a combination of two or more methods helps overcome the individual drawbacks of a single method. The survey on SA applications will help business professionals beat the competition by giving their trade decisions the backing needed.

MATERIAL AND METHODS

Issues and Challenges

This section deals with the issues and challenges listed below, as they relate to SA and its applications.

Tone

Tone can be difficult to clarify verbally, and even more so where the written word is concerned. Complications arise in the analysis of voluminous data containing both subjective and objective responses.

Polarity

Easy-to-understand words such as “good” and “bad” are high on positive (+1) and negative (-1) polarity scores. However, in-between combinations of words such as, for instance, “not so good”, meaning average, find themselves in mid-polarity. Occasionally, phrases like these get left out and corrupt the sentiment score.

Irony and Sarcasm

In sarcastic text, negative sentiments are expressed using positive words [66 Eremyan R. Four pitfalls of sentiment analysis accuracy [Internet]. Toptal Engineering Blog. Toptal; 2018 [cited 2023 Jun 2]. Available from: https://www.toptal.com/deep-learning/4-sentiment-analysis-accuracy-traps
https://www.toptal.com/deep-learning/4-s...
] or pseudo-compliments, making it difficult for SA tools to detect what the response actually implies in the context. This frequently results in a higher volume of positive responses that are actually negative.

Negations

Negations are a tactic of back-pedalling the polarity of sentences, words, and phrases. Negations are words that confuse the ML model, like no, not, never, neither, cannot, hardly, barely, nowhere or were not.

Word Ambiguity

Word ambiguity is another issue faced while working on SA. Word ambiguity creates problems, owing to the impracticability of defining polarity ahead of time, given that the polarity of certain words is firmly dependent on their context in the sentence.

Multi-polarity

At times, a given sentence or document or unit of text to be analyzed reveals multi-polarity. In such cases, relying solely on the study' overall findings may be misleading, similar to how an average may occasionally conceal important information about all the values that went into its calculation.

Feature Extraction Techniques

The preprocessed dataset has distinctive properties, and the feature extraction method extracts aspects from the processed dataset [77 Kharde V, Sonawane P. Sentiment analysis of twitter data: a survey of techniques. arXiv preprint arXiv:1601.06971. 2016 Jan 26.]. Various feature extraction techniques are listed below.

Bag of Words

The Bag of Words (BoW) is an NLP technique that extracts features from documents simply and flexibly [88 Agarwal B, Mittal N. Prominent feature extraction for review analysis: an empirical study. J. Exp. Theor. Artif. Intell. 2016 May 3;28(3):485-98.]. The text describing the presence of terms in the document is represented by the BoW. Because the document ignores any information regarding word order or word formation, it is known as a "bag" of words. The model considers the occurrence of recognised terms in the document, not their location.

Text generally lacks structure and organisation, which is a key issue for ML algorithms, which require organised, well-defined, fixed-length inputs. The bag of words method converts texts with variable lengths into vectors with fixed lengths.

TF-IDF

Term Frequency-Inverse Document Frequency, or TF-IDF, is a measure of a term's importance in a particular document [99 Basarslan MS, Kayaalp F. Sentiment analysis with machine learning methods on social media.]. The idea behind TF-IDF is to give more weight to terms that appear more frequently in one document and less frequently in another since they are better suited for classification [99 Basarslan MS, Kayaalp F. Sentiment analysis with machine learning methods on social media.]. The term "term frequency" (TF) refers to the frequency with which a term appears in a certain text [1010 Na JC, Kyaing WY, Khoo CS, Foo S, Chang YK, Theng YL. Sentiment classification of drug reviews using a rule-based linguistic approach. InThe Outreach of Digital Libraries: A Globalized Resource Network: 14th International Conference on Asia-Pacific Digital Libraries, ICADL 2012, Taipei, Taiwan, November 12-15, 2012, Proceedings 14 2012 (pp. 189-198). Springer Berlin Heidelberg.]. For a given term, the ratio of the total number of documents to the number of documents containing that term is known as the inverse document frequency (IDF).

T F = ( F r e q u e n c y o f a w o r d i n t h e d o c u m e n t ) / ( T o t a l w o r d s i n t h e d o c u m e n t )

I D F = L o g ( ( T o t a l n u m b e r o f d o c s ) / ( N u m b e r o f d o c s c o n t a i n i n g ) )

The major pitfall of TF-IDF is that it does not detain textual position, semantics, co-occurrences across documents, etc. Thus, it is used only as a lexical level feature.

Word2Vec

Word2Vec is an algorithm that uses a NN model to learn term interrelation from a voluminous text corpus and construct word embeddings [1111 Ding X, Liu B, Yu PS. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 international conference on web search and data mining 2008 Feb 11 (pp. 231-240).]. The Word2Vec model is used to extract the relatedness across words, including synonym detection, analogies, semantic relatedness, preference selection, and concept categorization. It learns significant relations and encodes similarities into a vector similarity. It takes a huge text corpus as input and creates a vector space that spans hundreds of dimensions [1111 Ding X, Liu B, Yu PS. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 international conference on web search and data mining 2008 Feb 11 (pp. 231-240).]. The word2vec uses one of the two architectures: Skip Gram or Continuous Bag of Words (CBOW).

The CBOW predicts a word under consideration, given the context words within a specific window. The skip-gram model functions differently from CBOW as it generates embeddings for the surrounding context words within a defined window based on a given current word.

Part-of-Speech Tagging

The practice of Part-of-Speech (POS) tagging, which dates back to the 1960s, has garnered renewed interest from NLP researchers due to its ability to extract product features, as these features tend to be expressed through nouns or noun phrases [1212 Kim D, Kim D, Hwang E, Choi HG. A user opinion and metadata mining scheme for predicting box office performance of movies in the social network environment. New review of hypermedia and multimedia. 2013 Dec 1;19(3-4):259-72.]. Also known as grammatical tagging, POS tagging labels each word in a phrase with its corresponding part of speech, such as nouns, pronouns, verbs, adverbs, adjectives, conjunctions or prepositions. POS tagging also creates synonym feature list i.e., a list of related words for certain keywords which increases the accuracy of the model. An example of synonym features is “good”, “great”, “excellent”, “fabulous”.

A major limitation of POS tagging is ambiguity, owing to the occurrence of numerous common words with different meanings, resulting in multiple POS.

Classification Techniques

Sentiment classification is an automated technique used to detect opinions within text and classify them as positive, neutral, or negative, based on the underlying emotions conveyed. Sentiment classification techniques are of 3 types, ML, lexicon-based, and hybrid, as depicted in Figure.1.

Machine learning techniques

The ML approach refers to an artificial intelligence technique that enables computers to learn through supervised, semi-supervised, or unsupervised methods [1313 Singh NK, Tomar DS, Sangaiah AK. Sentiment analysis: a review and comparative analysis over social media. JAIHC. 2020 Jan;11:97-117.].

Supervised learning

In supervised learning, the ML algorithm is trained on a small labelled dataset that represents the bigger dataset to be worked with. This gives the algorithm a basic idea of the problem to be dealt with.

Supervised learning algorithms include SVM, NB, Decision Tree (DT), Random Forest (RF), Linear Regression, etc.

Unsupervised learning

Unsupervised ML involves algorithms that train on unlabeled data and are permitted to act on the data with no supervision. It aims to find the intrinsic dataset, group the data based on similarities, and represent the data in a compressed format.

The unsupervised ML uses algorithms such as Apriori algorithm, NN, K-means clustering algorithm, Gaussian mixture, etc.

Figure 1
Sentiment Classification Techniques

Semi-supervised learning

The semi-supervised ML algorithm, as its name implies, sits between the supervised and unsupervised learning algorithms. This type of learning approach employs both labeled and unlabeled datasets to train the algorithm.

Semi-supervised learning uses algorithms that include continuity assumption, cluster assumption, generative models, and heuristic approaches.

Lexicon-based approach

A lexicon-based approach uses the lexicon features, Lexicon features are the features that are derived from analyzing the words in the text using a lexicon, which is a collection of words where every word has a specific score indicative of its polarity. Lexicon features include the frequency of certain words and their emotional tone in a text like positive connotation count, positive word count, negative word count, etc. A lexicon-based approach combines the scores of all the words in the document, using adjectives and adverbs to find the sentiment polarity of the text [1414 Rana TA, Cheah YN. Aspect extraction in sentiment analysis: comparative analysis and survey. Artif. Intell. Rev. 2016 Dec;46:459-83.]. The lexicon-based approach is of two types, DB and CB.

Dictionary-based (DB) approach

A DB approach initially creates a dictionary by taking a few words, following which a thesaurus is used to expand the dictionary by incorporating the synonyms and antonyms of the words taken [1515 Jain AP, Dandannavar P. Application of machine learning techniques to sentiment analysis. In2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT) 2016 Jul 21 (pp. 628-632). IEEE.]. The process is carried on until no new words are found and the dictionary is refined through a manual inspection.

Corpus-based (CB) approach

A corpus is a collection of writings, often on a specific topic [1616 Kaur H, Mangat V. A survey of sentiment analysis techniques. In2017 International conference on I-SMAC (IoT in social, mobile, analytics and cloud)(I-SMAC) 2017 Feb 10 (pp. 921-925). IEEE.]. A CB approach finds the polarity of context-specific words. The approach is of 2 types, statistical and semantic. The statistical approach finds co-occurrence words in a corpus [1616 Kaur H, Mangat V. A survey of sentiment analysis techniques. In2017 International conference on I-SMAC (IoT in social, mobile, analytics and cloud)(I-SMAC) 2017 Feb 10 (pp. 921-925). IEEE.]. A word that appears mostly in a positive text has positive polarity, and one that occurs largely in a negative text has negative polarity [1616 Kaur H, Mangat V. A survey of sentiment analysis techniques. In2017 International conference on I-SMAC (IoT in social, mobile, analytics and cloud)(I-SMAC) 2017 Feb 10 (pp. 921-925). IEEE.]. The semantic approach calculates sentiment values by using the principle of word similarity [1616 Kaur H, Mangat V. A survey of sentiment analysis techniques. In2017 International conference on I-SMAC (IoT in social, mobile, analytics and cloud)(I-SMAC) 2017 Feb 10 (pp. 921-925). IEEE.]. The synonyms and antonyms of a given word are found using a thesaurus and its sentiment value is calculated.

Hybrid approach

A hybrid approach combines ML and lexicon-based approaches, getting the best of both worlds [1515 Jain AP, Dandannavar P. Application of machine learning techniques to sentiment analysis. In2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT) 2016 Jul 21 (pp. 628-632). IEEE.], particularly in terms of improved accuracy. While the lexicon-based approach has a high level of precision but low recall, it can enhance both recall and accuracy when paired with an ML classifier

Qualitative Analysis

In this paper, a qualitative analysis is made on the works of various authors on SA. The qualitative analysis provides information on the dataset used, strength, weakness and various techniques like pre-processing, feature extraction and classification. A qualitative analysis of studies on SA has been made and the findings are listed in Table 1 for easy reference.

Table 1
Qualitative Analysis of Sentiment Analysis Techniques

Quantitative Analysis

A quantitative analysis is carried out on SA methodology by computing and comparing the value of the precision, recall, accuracy, and F-score performance metrics [1919 Alfaro C, Cano-Montero J, Gómez J, Moguerza JM, Ortega F. A multi-stage method for content classification and opinion mining on weblog comments. Ann. Oper. Res. 2016 Jan;236:197-213.]. A few formulas are listed in Table 2 to compute the value of the performance metrics.

Accuracy

A measured value's accuracy is how closely it resembles a reference value or true value, and establishes how often a sentiment rating is correct.

Precision

Precision is a metric that gauges the degree of exactness of a classifier. A higher precision score indicates fewer FP, while a lower score indicates a greater number of FP.

Recall

Recall gauges the completeness, or sensitivity, of a classifier. Higher recall means fewer FN, while lower recall means more FN.

F-score

The F-score is a metric that assesses the accuracy of a test by taking into account both its Precision and Recall values. It is computed as the Harmonic mean of Precision and Recall.

Table 2
Metric Formula

The quantitative analysis of the studies related to SA, in terms of performance metrics, has been carried out and the findings are presented in Table 3 for easy reference.

Table 3
Quantitative Analysis of Sentiment Analysis Techniques

Applications of Sentiment Analysis

SA has a wide range of applications, listed in Table 4, in healthcare; finance; politics; sports; hospitality and tourism; marketing and sales; and assessment and evaluation, as well as in user reviews, as depicted in Figure 2.

Figure 2
Applications of Sentiment Analysis.

Health Care Sector

In their study, Korkontzelos and coauthors [2020 Korkontzelos I, Nikfarjam A, Shardlow M, Sarker A, Ananiadou S, Gonzalez GH. Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts. J. Biomed. Inform. 2016 Aug 1;62:148-58.] sought to evaluate the impact of SA features on the detection of adverse drug reactions (ADRs) through an enhancement made to the ADRMine, an existing ADR detection technique. To evaluate the effectiveness of SA features in identifying ADR mentions, 81 drug-related posts with manual annotations were collected from the DailyStrength forum and Twitter. The SA features were then incorporated and the study found a slight improvement in the ADR mentions' performance in both tweets and healthcare forum posts. The results suggest that SA features could be utilized in pharmacovigilance in the future.

Na et al. [1010 Na JC, Kyaing WY, Khoo CS, Foo S, Chang YK, Theng YL. Sentiment classification of drug reviews using a rule-based linguistic approach. InThe Outreach of Digital Libraries: A Globalized Resource Network: 14th International Conference on Asia-Pacific Digital Libraries, ICADL 2012, Taipei, Taiwan, November 12-15, 2012, Proceedings 14 2012 (pp. 189-198). Springer Berlin Heidelberg.] developed a rule-based linguistic technique to classify sentiment in drug reviews, with the objective of creating an effective technique for sentiment analysis of social media material. The researchers leveraged SentiWordNet and the Subjectivity Lexicon [2121 Wilson T, Wiebe J, Hoffmann P. Recognizing contextual polarity in phrase-level sentiment analysis. InProceedings of human language technology conference and conference on empirical methods in natural language processing 2005 Oct (pp. 347-354).], two widely used SA resources, to create linguistic rules for classification.

User Reviews

Ding, Liu, and Yu [1111 Ding X, Liu B, Yu PS. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 international conference on web search and data mining 2008 Feb 11 (pp. 231-240).] put forward a lexicon-based approach at the sentence level. They addressed the challenge of determining the binary sentiment orientation of opinions regarding product features/aspects, without aiming to assign sentiment scores. The approach involved sentiment summarization based on the number of negative and positive opinions, but it did not explore the degree to which the opinions themselves were negative or positive.

In their research, Kim et al [1212 Kim D, Kim D, Hwang E, Choi HG. A user opinion and metadata mining scheme for predicting box office performance of movies in the social network environment. New review of hypermedia and multimedia. 2013 Dec 1;19(3-4):259-72.] explored the use of social media data mining to predict the box office success of movies. Their study revealed that combining viewer comments with marketing properties resulted in more accurate box office revenue predictions.

Financial Sector

Tetlock [2222 Tetlock PC. Giving content to investor sentiment: The role of media in the stock market. The J. Financ. 2007 Jun;62(3):1139-68.] conducted a study on the sentiment of Wall Street Journal (WSJ) reports and quantified their level of optimism and pessimism. The results revealed that after pessimistic reports, trading volume tended to increase, and highly pessimistic reports often led to a decline in market prices. Tetlock and his team also utilized the Harvard IV-4 psychological dictionary [2323 Tetlock PC, Saar‐Tsechansky M, Macskassy S. More than words: Quantifying language to measure firms' fundamentals. The J. Financ. 2008 Jun;63(3):1437-67.] to analyze the negative word count in the Dow Jones

News Service and WSJ stories related to Standard and Poor's 500 (S&P 500) companies from 1980 to 2004, focusing solely on the positive and negative dimensions of the dictionary.

Jaiwang and Jeatrakul [2424 Jaiwang G, Jeatrakul P. A forecast model for stock trading using support vector machine. In2016 International Computer Science and Engineering Conference (ICSEC) 2016 Dec 14 (pp. 1-6). IEEE.] developed a model to predict stock prices using a SVM after applying a major voting algorithm to select key technical and fundamental indicators for each stock. They evaluated the models effectiveness with different kernel functions within the SVM, such as the dot, RBF, sigmoid, and polynomial functions. The study showed that the dot function was the most effective kernel function. However, they also noted that using too many features could result in significant demand for storage space and computational processing power, potentially affecting the impact of critical technical indicators on the predicted price.

Politics

The cross-domain SA technique of Wu and Tan [2525 Wu Q, Tan S. A two-stage framework for cross-domain sentiment classification. Expert Syst. Appl. 2011 Oct 1;38(11):14269-75.] implemented a 2-stage approach. In the initial stage, they established a relationship between the source and target domains by utilizing a graph-ranking algorithm to select some of the best seeds from the target domain. In the later stage, the basic structure was utilized to determine the sentiment value of each document, followed by the labeling of target domain documents based on the values.

Liu and Zhao [2626 Liu K, Zhao J. Cross-domain sentiment classification using a two-stage method. InProceedings of the 18th ACM conference on Information and knowledge management 2009 Nov 2 (pp. 1717-1720).] also suggested a 2-stage approach. In the initial stage, a feature translator was used to transform a feature in the source domain to a feature in the target domain. In the later stage, the source domain data were employed to fit a classifier to classify the unlabeled data in the target domain.

Park et al [2727 Park S, Lee S, Song J. Aspect-level news browsing: Understanding news events from multiple viewpoints. InProceedings of the 15th international conference on Intelligent user interfaces 2010 Feb 7 (pp. 41-50).] developed a method to classify news articles based on the aspects they covered. However, their method was limited to certain types of articles that were classified in an unsupervised manner, preventing the establishment of specific political orientations.

Table 4
Areas of Sentiment Analysis Applications

DISCUSSION

Based from the data presented in Tables 1, 3, and 4, this section summarizes some of the key findings and their implications of the survey. One of the primary takeaways from the survey is that sentiment analysis is a challenging task due to the complexity and variability of human emotion. Additionally, several factors - including tone, polarity, negation, multi-polarity, irony, sarcasm, and ambiguity - can influence SA. Of these factors, sarcasm and irony are particularly significant because they can convey emotions that are different from what is being expressed. Unlike verbal communication, which relies on additional cues like tone and facial expressions, textual communication provides fewer indicators of emotional tone, making sentiment analysis more difficult. However, despite these challenges, sentiment analysis is increasingly being used across various domains and applications. Quantitatively, the survey found that the SVM classifier outperformed other classifiers, achieving an accuracy of 87.27%.

CONCLUSION

This article has reviewed certain applications of SA and provided a few open challenges. Further, it has reviewed the processes and methods of SA, as well as those of major feature extraction and classification techniques. Every paper studied has advantages and disadvantages, as well as a particular problem-solving approach. This survey has attempted a theoretical study of several applications, feature extraction and classification techniques that includes the BoW, TF-IDF, supervised learning, unsupervised learning, and DB approaches, among others. The survey showed that the algorithms developed have though shown promising results, none is so superior to resolve every single problem. The SVM, which has delivered the best performance to date, needs a lot of work in terms of further enhancements. One possible solution to overcome the limitations of individual algorithms is to combine them to enhance SA performance. Also, the SA techniques typically only consider the overall sentiment expressed in the text, without distinguishing between different aspects of the product or service being discussed. Here comes the need for aspect-based SA. It is hoped that the survey of SA processes, levels, applications, feature extraction techniques, classification techniques, and issues and challenges carried out in this paper will help further future research.

Acknowledgments

We thank the Department of Computer Science and Engineering of MSU.

REFERENCES

  • 1
    Park S, Kang S, Chung S, Song J. NewsCube: delivering multiple aspects of news to mitigate media bias. InProceedings of the SIGCHI conference on human factors in computing systems 2009 Apr 4 (pp. 443-452).
  • 2
    Shirsat VS, Jagdale RS, Deshmukh SN. Document level sentiment analysis from news articles. In2017 international conference on computing, Communication, Control and Automation (ICCUBEA) 2017 Aug 17 (pp. 1-4). IEEE.
  • 3
    Wagh R, Punde P. Survey on sentiment analysis using twitter dataset. In2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA) 2018 Mar 29 (pp. 208-211). IEEE.
  • 4
    Shivaprasad TK, Shetty J. Sentiment analysis of product reviews: a review. In2017 International conference on inventive communication and computational technologies (ICICCT) 2017 Mar 10 (pp. 298-301). IEEE.
  • 5
    Hemmatian F, Sohrabi MK. A survey on classification techniques for opinion mining and sentiment analysis. Artif. Intell. Rev. 2019 Oct 1;52(3):1495-545.
  • 6
    Eremyan R. Four pitfalls of sentiment analysis accuracy [Internet]. Toptal Engineering Blog. Toptal; 2018 [cited 2023 Jun 2]. Available from: https://www.toptal.com/deep-learning/4-sentiment-analysis-accuracy-traps
    » https://www.toptal.com/deep-learning/4-sentiment-analysis-accuracy-traps
  • 7
    Kharde V, Sonawane P. Sentiment analysis of twitter data: a survey of techniques. arXiv preprint arXiv:1601.06971. 2016 Jan 26.
  • 8
    Agarwal B, Mittal N. Prominent feature extraction for review analysis: an empirical study. J. Exp. Theor. Artif. Intell. 2016 May 3;28(3):485-98.
  • 9
    Basarslan MS, Kayaalp F. Sentiment analysis with machine learning methods on social media.
  • 10
    Na JC, Kyaing WY, Khoo CS, Foo S, Chang YK, Theng YL. Sentiment classification of drug reviews using a rule-based linguistic approach. InThe Outreach of Digital Libraries: A Globalized Resource Network: 14th International Conference on Asia-Pacific Digital Libraries, ICADL 2012, Taipei, Taiwan, November 12-15, 2012, Proceedings 14 2012 (pp. 189-198). Springer Berlin Heidelberg.
  • 11
    Ding X, Liu B, Yu PS. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 international conference on web search and data mining 2008 Feb 11 (pp. 231-240).
  • 12
    Kim D, Kim D, Hwang E, Choi HG. A user opinion and metadata mining scheme for predicting box office performance of movies in the social network environment. New review of hypermedia and multimedia. 2013 Dec 1;19(3-4):259-72.
  • 13
    Singh NK, Tomar DS, Sangaiah AK. Sentiment analysis: a review and comparative analysis over social media. JAIHC. 2020 Jan;11:97-117.
  • 14
    Rana TA, Cheah YN. Aspect extraction in sentiment analysis: comparative analysis and survey. Artif. Intell. Rev. 2016 Dec;46:459-83.
  • 15
    Jain AP, Dandannavar P. Application of machine learning techniques to sentiment analysis. In2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT) 2016 Jul 21 (pp. 628-632). IEEE.
  • 16
    Kaur H, Mangat V. A survey of sentiment analysis techniques. In2017 International conference on I-SMAC (IoT in social, mobile, analytics and cloud)(I-SMAC) 2017 Feb 10 (pp. 921-925). IEEE.
  • 17
    Tamrakar ML. An Analytical Study Of Feature Extraction Techniques For Student Sentiment Analysis. Turkish Int J Comput Math (TURCOMAT). 2021 May 10;12(11):2900-8.
  • 18
    Harish BS, Kumar K, Darshan HK. Sentiment analysis on IMDb movie reviews using hybrid feature extraction method.
  • 19
    Alfaro C, Cano-Montero J, Gómez J, Moguerza JM, Ortega F. A multi-stage method for content classification and opinion mining on weblog comments. Ann. Oper. Res. 2016 Jan;236:197-213.
  • 20
    Korkontzelos I, Nikfarjam A, Shardlow M, Sarker A, Ananiadou S, Gonzalez GH. Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts. J. Biomed. Inform. 2016 Aug 1;62:148-58.
  • 21
    Wilson T, Wiebe J, Hoffmann P. Recognizing contextual polarity in phrase-level sentiment analysis. InProceedings of human language technology conference and conference on empirical methods in natural language processing 2005 Oct (pp. 347-354).
  • 22
    Tetlock PC. Giving content to investor sentiment: The role of media in the stock market. The J. Financ. 2007 Jun;62(3):1139-68.
  • 23
    Tetlock PC, Saar‐Tsechansky M, Macskassy S. More than words: Quantifying language to measure firms' fundamentals. The J. Financ. 2008 Jun;63(3):1437-67.
  • 24
    Jaiwang G, Jeatrakul P. A forecast model for stock trading using support vector machine. In2016 International Computer Science and Engineering Conference (ICSEC) 2016 Dec 14 (pp. 1-6). IEEE.
  • 25
    Wu Q, Tan S. A two-stage framework for cross-domain sentiment classification. Expert Syst. Appl. 2011 Oct 1;38(11):14269-75.
  • 26
    Liu K, Zhao J. Cross-domain sentiment classification using a two-stage method. InProceedings of the 18th ACM conference on Information and knowledge management 2009 Nov 2 (pp. 1717-1720).
  • 27
    Park S, Lee S, Song J. Aspect-level news browsing: Understanding news events from multiple viewpoints. InProceedings of the 15th international conference on Intelligent user interfaces 2010 Feb 7 (pp. 41-50).
  • 28
    Schumaker RP, Jarmoszko AT, Labedz Jr CS. Predicting wins and spread in the Premier League using a sentiment analysis of twitter. Decision Support Systems. 2016 Aug 1;88:76-84.
  • 29
    Marrese-Taylor E, Velásquez JD, Bravo-Marquez F. A novel deterministic approach for aspect-based opinion mining in tourism products reviews. Expert systems with applications. 2014 Dec 1;41(17):7764-75.
  • 30
    Chung W, Zeng D. Social‐media‐based public policy informatics: Sentiment and network analyses of US Immigration and border security. JASIST. 2016 Jul;67(7):1588-606.
  • 31
    Jiang H, Lin P, Qiang M. Public-opinion sentiment analysis for large hydro projects. J. Constr. Eng. 2016 Feb 1;142(2):05015013.
  • 32
    Zavattaro SM, French PE, Mohanty SD. A sentiment analysis of US local government tweets: The connection between tone and citizen involvement. Government information quarterly. 2015 Jul 1;32(3):333-41.
  • 33
    Stavrianou A, Brun C. Expert recommendations based on opinion mining of user‐generated product reviews. Comput. Intell. 2015 Feb;31(1):165-83.
  • 34
    Li N, Wu DD. Using text mining and sentiment analysis for online forums hotspot detection and forecast. Expert Syst. Appl. 2010 Jan 1;48(2):354-68.
  • Funding:

    This research received no external funding

Edited by

Editor-in-Chief:

Alexandre Rasi Aoki

Associate Editor:

Fabio Alessandro Guerra

Publication Dates

  • Publication in this collection
    17 July 2023
  • Date of issue
    2023

History

  • Received
    01 Sept 2022
  • Accepted
    17 May 2023
Instituto de Tecnologia do Paraná - Tecpar Rua Prof. Algacyr Munhoz Mader, 3775 - CIC, 81350-010 Curitiba PR Brazil, Tel.: +55 41 3316-3052/3054, Fax: +55 41 3346-2872 - Curitiba - PR - Brazil
E-mail: babt@tecpar.br