Portuguese corpus-based learning using ETL

Milidiú, Ruy Luiz; Santos, Cícero Nogueira dos; Duarte, Julio Cesar

doi:10.1007/BF03192569

Abstract

We present Entropy Guided Transformation Learning models for three Portuguese Language Processing tasks: Part-of-Speech Tagging, Noun Phrase Chunking and Named Entity Recognition. For Part-of-Speech Tagging, we separately use the Mac-Morpho Corpus and the Tycho Brahe Corpus. For Noun Phrase Chunking, we use the SNR-CLIC Corpus. For Named Entity Recognition, we separately use three corpora: HAREM, MiniHAREM and LearnNEC06. For each one of the tasks, the ETL modeling phase is quick and simple. ETL only requires the training set and no handcrafted templates. ETL also simplifies the incorporation of new input features, such as capitalization information, which are sucessfully used in the ETL based systems. Using the ETL approach, we obtain state-of-the-art competitive performance in all six corpora-based tasks. These results indicate that ETL is a suitable approach for the construction of Portuguese corpus-based systems.

Entropy Guided Transformation Learning; transformation-based learning; decision trees; natural language processing

ARTICLES

Portuguese corpus-based learning using ETL

Ruy Luiz Milidiú^I; Cícero Nogueira dos Santos^I; Julio Cesar Duarte^{I, II}

^IDepartamento de Informática, Pontifícia Universidade Católica PUC-Rio Rua Marquês de São Vicente, 225, Gávea Phone: +55 (21) 3527-1500 Cep 22453-900, Rio de Janeiro - RJ, Brazil {milidiu | nogueira | jduarte}@inf.puc-rio.br ^IICentro Tecnológico do Exército Av. das Américas, 28705, Guaratiba Phone: +55 (21) 2410-6200 Cep 23020-470, Rio de Janeiro - RJ, Brazil jduarte@ctex.eb.br

ABSTRACT

We present Entropy Guided Transformation Learning models for three Portuguese Language Processing tasks: Part-of-Speech Tagging, Noun Phrase Chunking and Named Entity Recognition. For Part-of-Speech Tagging, we separately use the Mac-Morpho Corpus and the Tycho Brahe Corpus. For Noun Phrase Chunking, we use the SNR-CLIC Corpus. For Named Entity Recognition, we separately use three corpora: HAREM, MiniHAREM and LearnNEC06.

For each one of the tasks, the ETL modeling phase is quick and simple. ETL only requires the training set and no handcrafted templates. ETL also simplifies the incorporation of new input features, such as capitalization information, which are sucessfully used in the ETL based systems. Using the ETL approach, we obtain state-of-the-art competitive performance in all six corpora-based tasks. These results indicate that ETL is a suitable approach for the construction of Portuguese corpus-based systems.

Keywords: Entropy Guided Transformation Learning, transformation-based learning, decision trees, natural language processing.

1. INTRODUCTION

Since the last decade, Machine Learning (ML) has proven to be a very powerful tool to help in the construction of Natural Language Processing systems, which would otherwise require an unfeasible amount of time and human resources. When applying supervised learning schemes to language processing tasks, a corresponding annotated corpus is required. Corpus-based learning is a very attractive strategy, since it efficiently uses fast growing data resources [15, 13, 14, 22]. For the Portuguese language, many tasks have been approached using ML techniques, such as: Part-of-Speech Tagging [1, 10], Noun Phrase Chunking [27], Named Entity Recognition [20, 17], Machine Translation [19] and Text Summarization [11].

Portuguese tagged corpora is a scarce resource. Therefore, we focus on tasks where there are available corpora. Hence, we select the following three Portuguese Language Processing tasks: Part-of-Speech Tagging (POS), Noun Phrase Chunking (NP) and Named Entity Recognition (NER). These tasks have been considered fundamental for more advanced computational linguistic tasks [26, 33, 34, 32]. Observe that usually these three tasks are sequentially solved. First, we solve POS tagging. Next, using POS as an additional input feature, we solve NP chunking. Finally, using both the POS tags and NP chunks as additional input features, we solve NER.

In Table 1, we enumerate the six Portuguese corpora used throughout this work. For each corpus, we indicate its corresponding task and size. This work extends our previous findings on Portuguese Part-of-Speech Tagging [28].

Brasil

Brasil

Portuguese corpus-based learning using ETL

Abstract

Publication Dates

History