Representation of structured data of the text genre as a technique for automatic text processing

Fonseca, Claudia Aparecida; Guelpeli, Marcus Vinícius Carvalho; Souza Netto, Rafael Santiago de

Acessibilidade / Reportar erro

Brasil

Español English

sumário « anterior atual seguinte »

Sumário

Articles • Texto livre 15 • 2022 • https://doi.org/10.35699/1983-3652.2022.35445 copy

Representation of structured data of the text genre as a technique for automatic text processing

Authorship SCIMAGO INSTITUTIONS RANKINGS

Abstract

The present article was developed in the field of Natural Language Processing and Language Studies based on a corpus compiled by computational tools. This study is based on the assumption that it is helpful to trace a close relationship between corpus generation/annotation and the assessment of the constitutive elements of the text genre source. It aims to demonstrate, through specific studies of structured data from the text genre ‘scientific article’, alternatives to automatic text processing techniques. In order to reach the intended goal, the authors created a computational model for the compilation of a linguistic, specialized Corpus, representative of the genre Scientific Article CorpACE. The object of study includes the constitutive elements of scientific articles, marked in XML, extracted and collected from the SciELO-Scientific Electronic Library Online database. The final product was a database obtained with information extracted and structured in XML format, which designates and identifies the markups of the genre being analyzed and is available for many tools and applications. The results demonstrate how the representation of constitutive elements of the genre can condense available information with hierarchical and dynamic processes built during the compilation. At the end of the study, it is believed that more research will be required for bringing Language Science and Computer Science closer with emphasis on NLP in the attempt to represent and manipulate linguistic knowledge in its many levels - morphological, syntactic, semantic and discursive - in order to improve implementation and manipulation of automatic text processing.

Keywords:
Corpus linguistics; Natural language processing; Scientific article; Text genre; Corpora annotation

Technique	Approach	Description	Publications
Removal of stopwords - (filter)	Linguistic	Filtering process for removing words with little relevance, in the attempt to measure all the information that does not constitute knowledge in the text. The idea behind this filter is to remove words that contain little or no content, such as articles, prepositions, pronouns, conjunctions, adverbs, numerals and interjections. Additionally, terms that commonly or rarely appear are probably not significantly relevant and thus can also be removed.	Luhn (1958LUHN, H. P. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, v. 2, n. 2, p. 159-165, abr. 1958. DOI: 10.1147/rd.22.0159. Disponível em: <Disponível em: http://ieeexplore.ieee.org/document/5392672/ >. Acesso em: 27 dez. 2021. http://ieeexplore.ieee.org/document/5392... ), Salton and McGill (1983SALTON, G.; MCGILL, M. J. Introduction to modern information retrieval. New York: McGraw-Hill, 1983. (McGraw-Hill computer science series).), Frakes and Baeza-Yates (1992FRAKES, W. B.; BAEZA-YATES, R. (Ed.). Information retrieval: data structures & algorithms. Englewood Cliffs, N.J: Prentice Hall, 1992.), Lui, Li, and Choy (2007LUI, A. K.-F.; LI, S. C.; CHOY, S. O. An Evaluation of Automatic Text Categorization in Online Discussion Analysis. In: SEVENTH IEEE International Conference on Advanced Learning Technologies (ICALT 2007). [S.l.: s.n.], jul. 2007. p. 205-209. DOI: 10.1109/ICALT.2007.59. https://doi.org/10.1109/ICALT.2007.59... ) and De Oliveira Júnior and Esmin (2012DE OLIVEIRA JÚNIOR, R. L.; ESMIN, A. A. A. Monitoramento Automático de Mensagens de Fóruns de Discussão de Texto Semi-Supervisionado. In: SBIE Simpósio Brasileiro de Informática na Educação. Rio de Janeiro: SBIE, 2012.).
TF-IDF - (Term Frequency - Inverse Document Frequency)	Statistical	Term Frequency (TF) is based on the premise that the importance of a term appearing in a document is directly proportional to its occurrence. Inverse Document Frequency (IDF) is based on the premise that the specificity of a term can be measured by an inverse function of the number of documents in which it occurs. Therefore, this technique consists in weighing the importance of each term within a background corpus, normally consisting of documents belonging to the same domain and then eliminating the list of very common words.	Luhn (1958LUHN, H. P. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, v. 2, n. 2, p. 159-165, abr. 1958. DOI: 10.1147/rd.22.0159. Disponível em: <Disponível em: http://ieeexplore.ieee.org/document/5392672/ >. Acesso em: 27 dez. 2021. http://ieeexplore.ieee.org/document/5392... ), Jones (1972JONES, K. S. Some thesauric history. Aslib Proceedings, v. 24, n. 7, p. 400-411, jul. 1972. DOI: 10.1108/eb050353. Disponível em: <Disponível em: https://www.emerald.com/insight/content/doi/10.1108/eb050353/full/html >. Acesso em: 27 dez. 2021. https://www.emerald.com/insight/content/... ), Bhatia and Jaiswal (2015BHATIA, N.; JAISWAL, A. Literature Review on Automatic Text Summarization: Single and Multiple Summarizations. International Journal of Computer Applications, v. 117, n. 6, p. 25-29, mai. 2015. DOI: 10.5120/20560-2948. Disponível em: <Disponível em: http://research.ijcaonline.org/volume117/number6/pxc3902948.pdf >. Acesso em: 27 dez. 2021. http://research.ijcaonline.org/volume117... ), Liu, Li, and Feng (2017LIU, X.; LI, C.; FENG, Z. Analyze of Subject Research Hot Spots Based on An Improved Algorithm of TFIDF--Taking Information Science for Example- Information Science 2017 07 . Information Science, v. 7, n. 35, p. 015, 2017. Disponível em: <Disponível em: http://en.cnki.com.cn/Article_en/CJFDTotal-QBKX201707015.htm >. Acesso em: 27 dez. 2021. http://en.cnki.com.cn/Article_en/CJFDTot... ) and Rocha and Guelpeli (2017*ROCHA, V. J. C.; GUELPELI, M. V. C. PragmaSUM: automatic tex summarizer based on user profile. International Journal of Current Research, v. 9, n. 7, p. 53935-53942, 2017.).
Latent Semantic Analysis (LSA)	Hybrid	A method that uses synonyms and polysemy to extract and represent the semantic meaning of words in a context. This representation is obtained from mathematical calculations and applications that analyze the relationship between terms and documents, decomposing them into index vectors.	Landauer, Foltz, and Laham (1998LANDAUER, T. K; FOLTZ, P. W.; LAHAM, D. An introduction to latent semantic analysis. Discourse Processes, v. 25, n. 2-3, p. 259-284, jan. 1998. DOI: 10.1080/01638539809545028. Disponível em: <Disponível em: http://www.tandfonline.com/doi/abs/10.1080/01638539809545028 >. Acesso em: 27 dez. 2021. http://www.tandfonline.com/doi/abs/10.10... ) and Scarton and Aluísio (2010SCARTON, C. E.; ALUÍSIO, S. M. Análise da Inteligibilidade de textos via ferramentas de Processamento de Língua Natural: adaptando as métricas do Coh-Metrix para o Português. Linguamática, v. 2, n. 1, p. 45-61, abr. 2010. Disponível em: <Disponível em: https://linguamatica.com/index.php/linguamatica/article/view/44 >. Acesso em: 27 dez. 2021. https://linguamatica.com/index.php/lingu... ).
N-grams	Statistical	This technique consists of word co-occurrence and permits a statistical prediction of two or more terms in a text appearing in a certain sequence. An n-gram is a contiguous substring of n items of a given sequence of text or speech.	Cohen (1995COHEN, J. D. Highlights: Languageand domain-independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, v. 46, n. 3, p. 162-174, abr. 1995. DOI: 10.1002/(SICI)1097-4571(199504)46:3<162::AID-ASI2>3.0.CO;2-6. Disponível em: <Disponível em: https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-4571(199504)46:3%3C162::AID-ASI2%3E3.0.CO;2-6 >. Acesso em: 27 dez. 2021. https://onlinelibrary.wiley.com/doi/10.1... ), Liu, Webster, and Kit (2009LIU, X.; WEBSTER, J. J.; KIT, C. An Extractive Text Summarizer Based on Significant Words. In: LI, W.; MOLLÁ-ALIOD, D. (Ed.). Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy. Berlin, Heidelberg: Springer , 2009. (Lecture Notes in Computer Science), p. 168-178. DOI: 10.1007/978-3-642-00831-3_16. https://doi.org/10.1007/978-3-642-00831-... ), L. F de Alencar (2010ALENCAR, A. F. de. About Aelius Brazilian Portuguese POS-Tagger. [S.l.: s.n.], 2013. Disponível em: <Disponível em: http://aelius.sourceforge.net/ >. Acesso em: 27 dez. 2021. http://aelius.sourceforge.net/... ), A. F. de Alencar (2013aALENCAR, A. F. de. Aelius User’s Manual. [S.l.: s.n.], 2013. Disponível em: <Disponível em: http://aelius.sourceforge.net/manual.html >. Acesso em: 27 dez. 2021. http://aelius.sourceforge.net/manual.htm... ) and Tonelli and Pianta (2011TONELLI, S.; PIANTA, E. Matching documents and summaries using key-concepts Sara. In: PROCEEDINGS of the Seventh DEFT Workshop. Montpellier, France: [s.n.], 2011. p. 73-83.).
Segmentation	Hybrid	Segmenting the content of the text in individualized sentences, which represents a minimal semantic set for defining a proposition.	Lin, Hsieh, and Chuang (2009LIN, F.-R.; HSIEH, L.-S.; CHUANG, F.-T. Discovering genres of online discussion threads via text mining. Computers & Education, v. 52, n. 2, p. 481-495, fev. 2009. DOI: 10.1016/j.compedu.2008.10.005. Disponível em: <Disponível em: https://linkinghub.elsevier.com/retrieve/pii/S0360131508001528 >. Acesso em: 27 dez. 2021. https://linkinghub.elsevier.com/retrieve... ), SOUSA, KEPLER, and FARIA (2010SOUSA, M. C. P. de; KEPLER, F. N.; FARIA, P. P. F. de. E-Dictor: novas perspectivas na codificação e edição de corpora de textos históricos. In: CAMINHOS da Linguística de Corpus. São Paulo: Mercado de Letras, 2010. p. 225-246.) and A. F. de Alencar (2013bALENCAR, L. F de. Aelius: uma ferramenta para anotação automática de corpora usando o NLTK. In: IX Encontro de Linguística de Corpus. Porto Alegre: PUCRS, 2010. Disponível em: <Disponível em: http://corpuslg.org/gelc/media/blogs/elc2010/slides/Figueiredo_de_Alencar.pdf >. Acesso em: 27 dez. 2021. http://corpuslg.org/gelc/media/blogs/elc... ).
Tokenization (Text segmentation in words)	Hybrid	Consists in the process that segments a sequence of text characters into a sequence of significant units (words) that compose the text. The spaces and punctuation are generally adopted as delimiting tokens for western languages.	Webster and Kit (1992WEBSTER, J. J.; KIT, C. Tokenization as the initial phase in NLP. en. In: PROCEEDINGS of the 14th conference on Computational linguistics -. Nantes, France: Association for Computational Linguistics, 1992. v. 4, p. 1106. DOI: 10.3115/992424.992434. Disponível em: <Disponível em: http://portal.acm.org/citation.cfm?doid=992424.992434 >. Acesso em: 27 dez. 2021. http://portal.acm.org/citation.cfm?doid=... ), SOUSA, KEPLER, and FARIA (2010SOUSA, M. C. P. de; KEPLER, F. N.; FARIA, P. P. F. de. E-Dictor: novas perspectivas na codificação e edição de corpora de textos históricos. In: CAMINHOS da Linguística de Corpus. São Paulo: Mercado de Letras, 2010. p. 225-246.), A. F. de Alencar (2013bALENCAR, L. F de. Aelius: uma ferramenta para anotação automática de corpora usando o NLTK. In: IX Encontro de Linguística de Corpus. Porto Alegre: PUCRS, 2010. Disponível em: <Disponível em: http://corpuslg.org/gelc/media/blogs/elc2010/slides/Figueiredo_de_Alencar.pdf >. Acesso em: 27 dez. 2021. http://corpuslg.org/gelc/media/blogs/elc... ) and Silva, Trindade, et al. (2015SILVA, L. A.; TRINDADE, D. et al. Mineração de Dados em publicações de Fóruns de Discussões do Moodle como geração de Indicadores para aprimoramento da Gestão Educacional. Anais dos Workshops do Congresso Brasileiro de Informática na Educação, v. 4, n. 1, p. 1084, out. 2015. DOI: 10.5753/cbie.wcbie.2015.1084. Disponível em: <Disponível em: http://br-ie.org/pub/index.php/wcbie/article/view/6220 >. Acesso em: 27 dez. 2021. http://br-ie.org/pub/index.php/wcbie/art... )
Lemmatization and stemming	Linguistic	Lemmatization is the representation of each word of the input text in its primitive form (lemma). The process of word stemming has the purpose of removing suffixes and prefixes from a term, so that it is reduced to its radical (stem).	Lovins (1968LOVINS, J. B. Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics, v. 11, n. 1, p. 22-31, 1968.), SOUSA, KEPLER, and FARIA (2010SOUSA, M. C. P. de; KEPLER, F. N.; FARIA, P. P. F. de. E-Dictor: novas perspectivas na codificação e edição de corpora de textos históricos. In: CAMINHOS da Linguística de Corpus. São Paulo: Mercado de Letras, 2010. p. 225-246.) and Rolim, Ferreira, and Costa (2016ROLIM, V.; FERREIRA, R.; COSTA, E. Identificação Automática de Dúvidas em Fóruns Educacionais. In: p. 936. DOI: 10.5753/cbie.sbie.2016.936. Disponível em: <Disponível em: http://www.br-ie.org/pub/index.php/sbie/article/view/6779 >. Acesso em: 27 dez. 2021. http://www.br-ie.org/pub/index.php/sbie/... ).
Part-of-Speech (POS) Tagging	Linguistic	It consists of tagging the words of the input text with their respective grammatical classes and syntactic distributions. The main morphosyntactic tagging techniques include: Rule-based, which follows tagging rules manually coded by linguists; Probabilistics, which employs statistical tagging methods in which each word has a finite set of possible tags and is labeled with the most probable tag; and Hybrid, which involves the combination of rule-based and probabilistic techniques.	Lau et al. (2008LAU, R. Y. K. et al. Towards Fuzzy Domain Ontology Based Concept Map Generation for E-Learning. In: LEUNG, H. et al. (Ed.). Advances in Web Based Learning - ICWL 2007. Berlin, Heidelberg: Springer, 2008. (Lecture Notes in Computer Science), p. 90-101. DOI: 10.1007/978-3-540-78139-4_9. https://doi.org/10.1007/978-3-540-78139-... ), Domingues, Favero, and De Medeiros (2008DOMINGUES, M. L.; FAVERO, E. L.; DE MEDEIROS, I. P. O desenvolvimento de um etiquetador morfossintático com alta acurácia para o português. In: VALE, O. A. (Ed.). Avanços da Linguística de Corpus no Brasil. São Paulo: Humanistas, 2008. p. 267-286.), SOUSA, KEPLER, and FARIA (2010SOUSA, M. C. P. de; KEPLER, F. N.; FARIA, P. P. F. de. E-Dictor: novas perspectivas na codificação e edição de corpora de textos históricos. In: CAMINHOS da Linguística de Corpus. São Paulo: Mercado de Letras, 2010. p. 225-246.), A. F. de Alencar (2013bALENCAR, L. F de. Aelius: uma ferramenta para anotação automática de corpora usando o NLTK. In: IX Encontro de Linguística de Corpus. Porto Alegre: PUCRS, 2010. Disponível em: <Disponível em: http://corpuslg.org/gelc/media/blogs/elc2010/slides/Figueiredo_de_Alencar.pdf >. Acesso em: 27 dez. 2021. http://corpuslg.org/gelc/media/blogs/elc... ) and Santos and Zadrozny (2014SANTOS, C. N. dos; ZADROZNY, B. Learning Character-level Representations for Part-of-Speech Tagging. In: PROCEEDINGS of the 31st International Conference on Machine Learning (ICML-14). [S.l.: s.n.], 2014. p. 1818-1826.).
Text genre Tagging	Linguistic	It consists in tagging the main characteristics of the input text genre. This technique permits building the structural model in an arboreal format and adding linguistic data; information about the relationships between elements of the production context, or sentences or sentence fragments of the general text infrastructure; and the visualization of the constitutive dimensions of the base genre. This tagging can delimit various constitutive elements of the text genre, such as: bibliographic references, sections, abstract, paragraphs, tables, figures, funding, title, subtitles, authorship, keywords, and many others. This technique can retrieve the basic structure of the input text by planning the root nodes and their possible affiliations, which represent the textual infrastructure. The preprocessing of a genre will somehow be influenced by the recognition of the superstructure and infrastructure of its compositional organization.	Fonseca (2018FONSECA, C. A. et al. AnoTex: rotina de filtragem de dados estruturados do gênero artigo científico como contribuição para o PLN. Texto Livre: Linguagem e Tecnologia, v. 11, n. 3, p. 40-64, dez. 2018. DOI: 10.17851/1983-3652.11.3.40-64. Disponível em: <Disponível em: https://periodicos.ufmg.br/index.php/textolivre/article/view/16811 >. Acesso em: 27 dez. 2021. https://periodicos.ufmg.br/index.php/tex... ).

Value	Description
Cases	Cases/case studies
Conclusions	Conclusions/final considerations/final remarks
Discussion	Discussions
Introduction	Introduction/synopsis
Materials	Materials used
Methods	Methodology /methods
Results	Results

Statistics	Total
Number of articles	87
Total word count	750380
Word count of smallest article	6144
Word count of largest article	18758
Average word count per article	8625
Word count filtered from constitutive elements of genre	130128
Word count of Corpus EdReal	383506
Word count of Corpus EdPesq	366874

Universidade Federal de Minas Gerais - UFMG Av. Antônio Carlos, 6627 - Pampulha, Cep: 31270-901, Belo Horizonte - Minas Gerais / Brasil, Tel: +55 (31) 3409-6009 - Belo Horizonte - MG - Brazil
E-mail: revistatextolivre@letras.ufmg.br

Acompanhe os números deste periódico no seu leitor de RSS

[1] Corresponding author: Cláudia Aparecida Fonseca