Removal of stopwords - (filter) |
Linguistic |
Filtering process for removing words with little relevance, in the attempt to measure all the information that does not constitute knowledge in the text. The idea behind this filter is to remove words that contain little or no content, such as articles, prepositions, pronouns, conjunctions, adverbs, numerals and interjections. Additionally, terms that commonly or rarely appear are probably not significantly relevant and thus can also be removed. |
Luhn (1958LUHN, H. P. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, v. 2, n. 2, p. 159-165, abr. 1958. DOI: 10.1147/rd.22.0159. Disponível em: <Disponível em: http://ieeexplore.ieee.org/document/5392672/ >. Acesso em: 27 dez. 2021. http://ieeexplore.ieee.org/document/5392...
), Salton and McGill (1983SALTON, G.; MCGILL, M. J. Introduction to modern information retrieval. New York: McGraw-Hill, 1983. (McGraw-Hill computer science series).), Frakes and Baeza-Yates (1992FRAKES, W. B.; BAEZA-YATES, R. (Ed.). Information retrieval: data structures & algorithms. Englewood Cliffs, N.J: Prentice Hall, 1992.), Lui, Li, and Choy (2007LUI, A. K.-F.; LI, S. C.; CHOY, S. O. An Evaluation of Automatic Text Categorization in Online Discussion Analysis. In: SEVENTH IEEE International Conference on Advanced Learning Technologies (ICALT 2007). [S.l.: s.n.], jul. 2007. p. 205-209. DOI: 10.1109/ICALT.2007.59. https://doi.org/10.1109/ICALT.2007.59...
) and De Oliveira Júnior and Esmin (2012DE OLIVEIRA JÚNIOR, R. L.; ESMIN, A. A. A. Monitoramento Automático de Mensagens de Fóruns de Discussão de Texto Semi-Supervisionado. In: SBIE Simpósio Brasileiro de Informática na Educação. Rio de Janeiro: SBIE, 2012.). |
TF-IDF - (Term Frequency - Inverse Document Frequency) |
Statistical |
Term Frequency (TF) is based on the premise that the importance of a term appearing in a document is directly proportional to its occurrence. Inverse Document Frequency (IDF) is based on the premise that the specificity of a term can be measured by an inverse function of the number of documents in which it occurs. Therefore, this technique consists in weighing the importance of each term within a background corpus, normally consisting of documents belonging to the same domain and then eliminating the list of very common words. |
Luhn (1958LUHN, H. P. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, v. 2, n. 2, p. 159-165, abr. 1958. DOI: 10.1147/rd.22.0159. Disponível em: <Disponível em: http://ieeexplore.ieee.org/document/5392672/ >. Acesso em: 27 dez. 2021. http://ieeexplore.ieee.org/document/5392...
), Jones (1972JONES, K. S. Some thesauric history. Aslib Proceedings, v. 24, n. 7, p. 400-411, jul. 1972. DOI: 10.1108/eb050353. Disponível em: <Disponível em: https://www.emerald.com/insight/content/doi/10.1108/eb050353/full/html >. Acesso em: 27 dez. 2021. https://www.emerald.com/insight/content/...
), Bhatia and Jaiswal (2015BHATIA, N.; JAISWAL, A. Literature Review on Automatic Text Summarization: Single and Multiple Summarizations. International Journal of Computer Applications, v. 117, n. 6, p. 25-29, mai. 2015. DOI: 10.5120/20560-2948. Disponível em: <Disponível em: http://research.ijcaonline.org/volume117/number6/pxc3902948.pdf >. Acesso em: 27 dez. 2021. http://research.ijcaonline.org/volume117...
), Liu, Li, and Feng (2017LIU, X.; LI, C.; FENG, Z. Analyze of Subject Research Hot Spots Based on An Improved Algorithm of TF*IDF--Taking Information Science for Example- Information Science 2017 07 . Information Science, v. 7, n. 35, p. 015, 2017. Disponível em: <Disponível em: http://en.cnki.com.cn/Article_en/CJFDTotal-QBKX201707015.htm >. Acesso em: 27 dez. 2021. http://en.cnki.com.cn/Article_en/CJFDTot...
) and Rocha and Guelpeli (2017ROCHA, V. J. C.; GUELPELI, M. V. C. PragmaSUM: automatic tex summarizer based on user profile. International Journal of Current Research, v. 9, n. 7, p. 53935-53942, 2017.). |
Latent Semantic Analysis (LSA) |
Hybrid |
A method that uses synonyms and polysemy to extract and represent the semantic meaning of words in a context. This representation is obtained from mathematical calculations and applications that analyze the relationship between terms and documents, decomposing them into index vectors. |
Landauer, Foltz, and Laham (1998LANDAUER, T. K; FOLTZ, P. W.; LAHAM, D. An introduction to latent semantic analysis. Discourse Processes, v. 25, n. 2-3, p. 259-284, jan. 1998. DOI: 10.1080/01638539809545028. Disponível em: <Disponível em: http://www.tandfonline.com/doi/abs/10.1080/01638539809545028 >. Acesso em: 27 dez. 2021. http://www.tandfonline.com/doi/abs/10.10...
) and Scarton and Aluísio (2010SCARTON, C. E.; ALUÍSIO, S. M. Análise da Inteligibilidade de textos via ferramentas de Processamento de Língua Natural: adaptando as métricas do Coh-Metrix para o Português. Linguamática, v. 2, n. 1, p. 45-61, abr. 2010. Disponível em: <Disponível em: https://linguamatica.com/index.php/linguamatica/article/view/44 >. Acesso em: 27 dez. 2021. https://linguamatica.com/index.php/lingu...
). |
N-grams
|
Statistical |
This technique consists of word co-occurrence and permits a statistical prediction of two or more terms in a text appearing in a certain sequence. An n-gram is a contiguous substring of n items of a given sequence of text or speech. |
Cohen (1995COHEN, J. D. Highlights: Languageand domain-independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, v. 46, n. 3, p. 162-174, abr. 1995. DOI: 10.1002/(SICI)1097-4571(199504)46:3<162::AID-ASI2>3.0.CO;2-6. Disponível em: <Disponível em: https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-4571(199504)46:3%3C162::AID-ASI2%3E3.0.CO;2-6 >. Acesso em: 27 dez. 2021. https://onlinelibrary.wiley.com/doi/10.1...
), Liu, Webster, and Kit (2009LIU, X.; WEBSTER, J. J.; KIT, C. An Extractive Text Summarizer Based on Significant Words. In: LI, W.; MOLLÁ-ALIOD, D. (Ed.). Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy. Berlin, Heidelberg: Springer , 2009. (Lecture Notes in Computer Science), p. 168-178. DOI: 10.1007/978-3-642-00831-3_16. https://doi.org/10.1007/978-3-642-00831-...
), L. F de Alencar (2010ALENCAR, A. F. de. About Aelius Brazilian Portuguese POS-Tagger. [S.l.: s.n.], 2013. Disponível em: <Disponível em: http://aelius.sourceforge.net/ >. Acesso em: 27 dez. 2021. http://aelius.sourceforge.net/...
), A. F. de Alencar (2013aALENCAR, A. F. de. Aelius User’s Manual. [S.l.: s.n.], 2013. Disponível em: <Disponível em: http://aelius.sourceforge.net/manual.html >. Acesso em: 27 dez. 2021. http://aelius.sourceforge.net/manual.htm...
) and Tonelli and Pianta (2011TONELLI, S.; PIANTA, E. Matching documents and summaries using key-concepts Sara. In: PROCEEDINGS of the Seventh DEFT Workshop. Montpellier, France: [s.n.], 2011. p. 73-83.). |
Segmentation
|
Hybrid |
Segmenting the content of the text in individualized sentences, which represents a minimal semantic set for defining a proposition. |
Lin, Hsieh, and Chuang (2009LIN, F.-R.; HSIEH, L.-S.; CHUANG, F.-T. Discovering genres of online discussion threads via text mining. Computers & Education, v. 52, n. 2, p. 481-495, fev. 2009. DOI: 10.1016/j.compedu.2008.10.005. Disponível em: <Disponível em: https://linkinghub.elsevier.com/retrieve/pii/S0360131508001528 >. Acesso em: 27 dez. 2021. https://linkinghub.elsevier.com/retrieve...
), SOUSA, KEPLER, and FARIA (2010SOUSA, M. C. P. de; KEPLER, F. N.; FARIA, P. P. F. de. E-Dictor: novas perspectivas na codificação e edição de corpora de textos históricos. In: CAMINHOS da Linguística de Corpus. São Paulo: Mercado de Letras, 2010. p. 225-246.) and A. F. de Alencar (2013bALENCAR, L. F de. Aelius: uma ferramenta para anotação automática de corpora usando o NLTK. In: IX Encontro de Linguística de Corpus. Porto Alegre: PUCRS, 2010. Disponível em: <Disponível em: http://corpuslg.org/gelc/media/blogs/elc2010/slides/Figueiredo_de_Alencar.pdf >. Acesso em: 27 dez. 2021. http://corpuslg.org/gelc/media/blogs/elc...
). |
Tokenization (Text segmentation in words) |
Hybrid |
Consists in the process that segments a sequence of text characters into a sequence of significant units (words) that compose the text. The spaces and punctuation are generally adopted as delimiting tokens for western languages. |
Webster and Kit (1992WEBSTER, J. J.; KIT, C. Tokenization as the initial phase in NLP. en. In: PROCEEDINGS of the 14th conference on Computational linguistics -. Nantes, France: Association for Computational Linguistics, 1992. v. 4, p. 1106. DOI: 10.3115/992424.992434. Disponível em: <Disponível em: http://portal.acm.org/citation.cfm?doid=992424.992434 >. Acesso em: 27 dez. 2021. http://portal.acm.org/citation.cfm?doid=...
), SOUSA, KEPLER, and FARIA (2010SOUSA, M. C. P. de; KEPLER, F. N.; FARIA, P. P. F. de. E-Dictor: novas perspectivas na codificação e edição de corpora de textos históricos. In: CAMINHOS da Linguística de Corpus. São Paulo: Mercado de Letras, 2010. p. 225-246.), A. F. de Alencar (2013bALENCAR, L. F de. Aelius: uma ferramenta para anotação automática de corpora usando o NLTK. In: IX Encontro de Linguística de Corpus. Porto Alegre: PUCRS, 2010. Disponível em: <Disponível em: http://corpuslg.org/gelc/media/blogs/elc2010/slides/Figueiredo_de_Alencar.pdf >. Acesso em: 27 dez. 2021. http://corpuslg.org/gelc/media/blogs/elc...
) and Silva, Trindade, et al. (2015SILVA, L. A.; TRINDADE, D. et al. Mineração de Dados em publicações de Fóruns de Discussões do Moodle como geração de Indicadores para aprimoramento da Gestão Educacional. Anais dos Workshops do Congresso Brasileiro de Informática na Educação, v. 4, n. 1, p. 1084, out. 2015. DOI: 10.5753/cbie.wcbie.2015.1084. Disponível em: <Disponível em: http://br-ie.org/pub/index.php/wcbie/article/view/6220 >. Acesso em: 27 dez. 2021. http://br-ie.org/pub/index.php/wcbie/art...
) |
Lemmatization and stemming |
Linguistic |
Lemmatization is the representation of each word of the input text in its primitive form (lemma). The process of word stemming has the purpose of removing suffixes and prefixes from a term, so that it is reduced to its radical (stem). |
Lovins (1968LOVINS, J. B. Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics, v. 11, n. 1, p. 22-31, 1968.), SOUSA, KEPLER, and FARIA (2010SOUSA, M. C. P. de; KEPLER, F. N.; FARIA, P. P. F. de. E-Dictor: novas perspectivas na codificação e edição de corpora de textos históricos. In: CAMINHOS da Linguística de Corpus. São Paulo: Mercado de Letras, 2010. p. 225-246.) and Rolim, Ferreira, and Costa (2016ROLIM, V.; FERREIRA, R.; COSTA, E. Identificação Automática de Dúvidas em Fóruns Educacionais. In: p. 936. DOI: 10.5753/cbie.sbie.2016.936. Disponível em: <Disponível em: http://www.br-ie.org/pub/index.php/sbie/article/view/6779 >. Acesso em: 27 dez. 2021. http://www.br-ie.org/pub/index.php/sbie/...
). |
Part-of-Speech (POS) Tagging
|
Linguistic |
It consists of tagging the words of the input text with their respective grammatical classes and syntactic distributions. The main morphosyntactic tagging techniques include: Rule-based, which follows tagging rules manually coded by linguists; Probabilistics, which employs statistical tagging methods in which each word has a finite set of possible tags and is labeled with the most probable tag; and Hybrid, which involves the combination of rule-based and probabilistic techniques. |
Lau et al. (2008LAU, R. Y. K. et al. Towards Fuzzy Domain Ontology Based Concept Map Generation for E-Learning. In: LEUNG, H. et al. (Ed.). Advances in Web Based Learning - ICWL 2007. Berlin, Heidelberg: Springer, 2008. (Lecture Notes in Computer Science), p. 90-101. DOI: 10.1007/978-3-540-78139-4_9. https://doi.org/10.1007/978-3-540-78139-...
), Domingues, Favero, and De Medeiros (2008DOMINGUES, M. L.; FAVERO, E. L.; DE MEDEIROS, I. P. O desenvolvimento de um etiquetador morfossintático com alta acurácia para o português. In: VALE, O. A. (Ed.). Avanços da Linguística de Corpus no Brasil. São Paulo: Humanistas, 2008. p. 267-286.), SOUSA, KEPLER, and FARIA (2010SOUSA, M. C. P. de; KEPLER, F. N.; FARIA, P. P. F. de. E-Dictor: novas perspectivas na codificação e edição de corpora de textos históricos. In: CAMINHOS da Linguística de Corpus. São Paulo: Mercado de Letras, 2010. p. 225-246.), A. F. de Alencar (2013bALENCAR, L. F de. Aelius: uma ferramenta para anotação automática de corpora usando o NLTK. In: IX Encontro de Linguística de Corpus. Porto Alegre: PUCRS, 2010. Disponível em: <Disponível em: http://corpuslg.org/gelc/media/blogs/elc2010/slides/Figueiredo_de_Alencar.pdf >. Acesso em: 27 dez. 2021. http://corpuslg.org/gelc/media/blogs/elc...
) and Santos and Zadrozny (2014SANTOS, C. N. dos; ZADROZNY, B. Learning Character-level Representations for Part-of-Speech Tagging. In: PROCEEDINGS of the 31st International Conference on Machine Learning (ICML-14). [S.l.: s.n.], 2014. p. 1818-1826.). |
Text genre Tagging |
Linguistic |
It consists in tagging the main characteristics of the input text genre. This technique permits building the structural model in an arboreal format and adding linguistic data; information about the relationships between elements of the production context, or sentences or sentence fragments of the general text infrastructure; and the visualization of the constitutive dimensions of the base genre. This tagging can delimit various constitutive elements of the text genre, such as: bibliographic references, sections, abstract, paragraphs, tables, figures, funding, title, subtitles, authorship, keywords, and many others. This technique can retrieve the basic structure of the input text by planning the root nodes and their possible affiliations, which represent the textual infrastructure. The preprocessing of a genre will somehow be influenced by the recognition of the superstructure and infrastructure of its compositional organization. |
Fonseca (2018FONSECA, C. A. et al. AnoTex: rotina de filtragem de dados estruturados do gênero artigo científico como contribuição para o PLN. Texto Livre: Linguagem e Tecnologia, v. 11, n. 3, p. 40-64, dez. 2018. DOI: 10.17851/1983-3652.11.3.40-64. Disponível em: <Disponível em: https://periodicos.ufmg.br/index.php/textolivre/article/view/16811 >. Acesso em: 27 dez. 2021. https://periodicos.ufmg.br/index.php/tex...
). |