Acessibilidade / Reportar erro

Computational tools development for the dialectal and lexicographical data processing

Abstract

This paper is situated at the intersection of Corpus Linguistics (O’KEEFFE; MCCARTHY, 2010O’KEEFFE, Anne; MCCARTHY, Michael. What are corpora and how have they evolved? In: O’KEEFFE, Anne; MCCARTHY, Michael (Ed.). The Routledge handbook of corpus linguistics. London/New York: Routledge, 2010. P. 3–10.); Computational Linguistics (KEDIA; RASU, 2020KEDIA, Aman; RASU, Mayank. Hands-on Python natural language processing: explore tools and techniques to analyze and process text with a view to building real-world NLP applications. Birmingham: Packt Publishing Ltd, 2020.; SRINIVASA-DESIKAN, 2018SRINIVASA-DESIKAN, Bhargav. Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Birmingham: Packt, 2018.; MANNING, 2008MANNING, Christopher D. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008.; MANNING; SCHUTZE, 1999MANNING, Christopher D; SCHUTZE, Hinrich. Foundations of statistical natural language processing. Cambridge: MIT press, 1999.; CHOMSKY, 1965CHOMSKY, Noam. Aspects of the theory of syntax. Cambridge: MA: MIT Press, 1965.); Dialectology (CARDOSO, 2010CARDOSO, Suzana Alice Marcelino. A dialetologia e os estudos da variação linguística. In: CARDOSO, Suzana Alice Marcelino (Ed.). Geolinguística - tradição e modernidade. São Paulo: Parábola Editorial, 2010. P. 15–30.; RADTKE; THUN, 1996RADTKE, Edgar; THUN, Harald. Nuevos caminos de la geolinguística románica. In: RADTKE, Edgar; THUN, Harald (Ed.). Neue Wege der Romanischen Geolinguistik. Kiel: Westensee-Verlag, 1996. P. 25–49.; CHAMBERS; TRUDGILL, 1994CHAMBERS, Jack; TRUDGILL, Peter. La dialectología. Madrid: Visor Libros, 1994.) and Lexicography (TARP, 2008TARP, Sven. Lexicography in the borderland between knowledge and non-knowledge: General Lexicographical Theory with Particular Focus on Learner’s Lexicography. Tübingen: Niemeyer, 2008., 2011TARP, Sven. Lexicographical and other e-tools for consultation purposes: towards the individualization of needs satisfaction. In: FUERTES-OLIVEIRA, Pedro Antonio; BERGENHOLTZ, Henning (Ed.). e-Lexicography: The Internet, Digital Initiative and Lexicography. London/New York: Continuum, 2011. P. 54–70., 2015TARP, Sven. La teoría funcional en pocas palabras. Estudios de Lexicografía. Revista Mensual del grupo de las dos vidas de las palabras, v. 4, p. 31–42, 2015. Disponível em: https://issuu.com/ldvp/docs/elex%5C_4-%5C_def. Acesso em: 2 ago. 2022.
https://issuu.com/ldvp/docs/elex%5C_4-%5...
; FUERTES-OLIVEIRA; BERGENHOLTZ, 2015FUERTES-OLIVEIRA, Pedro Antonio; BERGENHOLTZ, Henning. Los Diccionarios en Línea de Español “Universidad de Valladolid.” Estudios de Lexicografía. Revista Mensual del grupo de las dos vidas de las palabras, n. 4, p. 71–98, jun. 2015. Disponível em: https://issuu.com/ldvp/docs/elex%5C_4-%5C_def. Acesso em: 2 ago. 2022.
https://issuu.com/ldvp/docs/elex%5C_4-%5...
; LEROYER, 2011LEROYER, Patrick. Change of paradigm: from Linguistics to Information Science and from dictionaries to lexicographic information tools. In: FUERTES-OLIVEIRA, Pedro Antonio; BERGENHOLTZ, Henning (Ed.). e-Lexicography: The Internet, Digital Initiative and Lexicography. London/New York: Continuum, 2011. P. 121–140.). It aims to present the development of computational tools capable of processing dialectal and lexicographic data using a methodology that does not require the hiring of programming services, inviting the researcher to study the necessary computer resources to perform an automatic manipulation of information in a database. For this purpose, the corpus used was Atlas Linguístico do Brazil Project (COMITÊ NACIONAL DO PROJETO ALIB, 2001COMITÊ NACIONAL DO PROJETO ALIB. Atlas Lingüístico do Brasil: questionário 2001. Londrina: EDUEL, 2001.) relating to the interior municipalities from the ALiB, network, pointed out in the country’s North region. The construction of these small programs was mainly motivated by two reasons: i) provide lexicographical and electronic treatment to ALiB dialect data; ii) develop their own computational tools to meet the Doctoral research goals in progress, to which this article is linked. Thus, a database in Extensible Markup Language (XML) was built to store dialectal information in lexicographical format, and through the execution of code lines, it was possible to electronically retrieve specific data from the corpus and filter the results based on ‘gender’, ‘age’, and ‘location’ variants present in the data from the ALiB corpus.

Keywords:
Dialectology; Lexicography; Computational tools; Programming languages; Database

Universidade Federal de Minas Gerais - UFMG Av. Antônio Carlos, 6627 - Pampulha, Cep: 31270-901, Belo Horizonte - Minas Gerais / Brasil, Tel: +55 (31) 3409-6009 - Belo Horizonte - MG - Brazil
E-mail: revistatextolivre@letras.ufmg.br