| Removal of stopwords - (filter) |
Linguistic |
Filtering process for removing words with little relevance, in the attempt to measure all the information that does not constitute knowledge in the text. The idea behind this filter is to remove words that contain little or no content, such as articles, prepositions, pronouns, conjunctions, adverbs, numerals and interjections. Additionally, terms that commonly or rarely appear are probably not significantly relevant and thus can also be removed. |
Luhn (1958), Salton and McGill (1983), Frakes and Baeza-Yates (1992), Lui, Li, and Choy (2007) and De Oliveira Júnior and Esmin (2012). |
|
TF-IDF - (Term Frequency - Inverse Document Frequency) |
Statistical |
Term Frequency (TF) is based on the premise that the importance of a term appearing in a document is directly proportional to its occurrence. Inverse Document Frequency (IDF) is based on the premise that the specificity of a term can be measured by an inverse function of the number of documents in which it occurs. Therefore, this technique consists in weighing the importance of each term within a background corpus, normally consisting of documents belonging to the same domain and then eliminating the list of very common words. |
Luhn (1958), Jones (1972), Bhatia and Jaiswal (2015), Liu, Li, and Feng (2017) and Rocha and Guelpeli (2017). |
|
Latent Semantic Analysis (LSA) |
Hybrid |
A method that uses synonyms and polysemy to extract and represent the semantic meaning of words in a context. This representation is obtained from mathematical calculations and applications that analyze the relationship between terms and documents, decomposing them into index vectors. |
Landauer, Foltz, and Laham (1998) and Scarton and Aluísio (2010). |
|
N-grams
|
Statistical |
This technique consists of word co-occurrence and permits a statistical prediction of two or more terms in a text appearing in a certain sequence. An n-gram is a contiguous substring of n items of a given sequence of text or speech. |
Cohen (1995), Liu, Webster, and Kit (2009), L. F de Alencar (2010), A. F. de Alencar (2013a) and Tonelli and Pianta (2011). |
|
Segmentation
|
Hybrid |
Segmenting the content of the text in individualized sentences, which represents a minimal semantic set for defining a proposition. |
Lin, Hsieh, and Chuang (2009), SOUSA, KEPLER, and FARIA (2010) and A. F. de Alencar (2013b). |
|
Tokenization (Text segmentation in words) |
Hybrid |
Consists in the process that segments a sequence of text characters into a sequence of significant units (words) that compose the text. The spaces and punctuation are generally adopted as delimiting tokens for western languages. |
Webster and Kit (1992), SOUSA, KEPLER, and FARIA (2010), A. F. de Alencar (2013b) and Silva, Trindade, et al. (2015) |
| Lemmatization and stemming |
Linguistic |
Lemmatization is the representation of each word of the input text in its primitive form (lemma). The process of word stemming has the purpose of removing suffixes and prefixes from a term, so that it is reduced to its radical (stem). |
Lovins (1968), SOUSA, KEPLER, and FARIA (2010) and Rolim, Ferreira, and Costa (2016). |
|
Part-of-Speech (POS) Tagging
|
Linguistic |
It consists of tagging the words of the input text with their respective grammatical classes and syntactic distributions. The main morphosyntactic tagging techniques include: Rule-based, which follows tagging rules manually coded by linguists; Probabilistics, which employs statistical tagging methods in which each word has a finite set of possible tags and is labeled with the most probable tag; and Hybrid, which involves the combination of rule-based and probabilistic techniques. |
Lau et al. (2008), Domingues, Favero, and De Medeiros (2008), SOUSA, KEPLER, and FARIA (2010), A. F. de Alencar (2013b) and Santos and Zadrozny (2014). |
| Text genre Tagging |
Linguistic |
It consists in tagging the main characteristics of the input text genre. This technique permits building the structural model in an arboreal format and adding linguistic data; information about the relationships between elements of the production context, or sentences or sentence fragments of the general text infrastructure; and the visualization of the constitutive dimensions of the base genre. This tagging can delimit various constitutive elements of the text genre, such as: bibliographic references, sections, abstract, paragraphs, tables, figures, funding, title, subtitles, authorship, keywords, and many others. This technique can retrieve the basic structure of the input text by planning the root nodes and their possible affiliations, which represent the textual infrastructure. The preprocessing of a genre will somehow be influenced by the recognition of the superstructure and infrastructure of its compositional organization. |
Fonseca (2018). |