Original Artigo Tool for validation and import in herbarium database

Many biological collections databases feature data quality problems. On the existing computational resources, we present an import tool and data validation. The program applies filters to data submitted through a spreadsheet at the time of data import, streamlining the error-checking process. The validations presented were divided into three categories according to the taxonomic, geographical and general specimen collection data. Its implementation eliminated the errors in the data entry of new vouchers in the Herbarium of the Botanical Garden of Rio de Janeiro.


Introduction
On the challenges of the protection of natural resources, the generation of knowledge from large databases of biodiversity has attracted the attention of the whole society. These data are essential for taxonomic research and conservation actions, among others (Donaldson 2009;Lavoie 2013;Wen et al. 2015), providing relevant and useful information for preservation policies. Various herbarium management information systems and data portals have been developed in recent years to facilitate access and integration of collections. Stands out among them the Global Biodiversity Information Facility-GBIF (GBIF 2010) that serves as an aggregator of data of herbaria around the world. Despite the significant volume of data and considering only the flora data available at GBIF, even in a superficial analysis, one can see that there is a big difference between the number of flora and fauna records and quality of spatial data that needs to be improved, beyond the visible low data quality. Among the possible justifications for the highlighted points, we can consider that the determination of the scientific names of the species in the flora is more complicated, requiring knowledge of taxonomy. In the case of geographical coordinates, should be considered that a significant number of collections is old and the collector did not have sophisticated equipment, such as GPS.
In addition to the problem of data quality (Chapman 2005b), the handling of large volumes of data also has been the subject of studies (Howe et al. 2008). Such difficulties led to the search for alternative methodologies, such as data mining, one of the stages of the process of Knowledge Rodriguésia 70: e03222017. 2019 Discovery in Databases (KDD) (Fayyad et al. 1996), which aims to discover patterns in databases. The KDD process has been used in various areas such as marketing, medicine, economics, engineering, management, agriculture, social networks, geography, and the Earth Sciences (Han et al. 2011).
The costs of the development of these databases is high, for example, the financial value in organizing field trips, the storage of plant specimens in the herbarium, the computational expense of equipment, specialized staff for the development and support of information systems. So, we conclude that the values are significant and justify actions, both in the pursuit of improved quality of data from these collections, like more efficient forms of access. We highlight three points for discussion: 1) Does the quality of the data accompanying the increasing amount of data available in flora databases? 2) Information systems used in herbaria are preventing the input of new data with errors and saving time in the necessary corrections? 3) Is it possible to check the entry of new mistakes or what can be done to reduce the number of errors in the data?

Material and Methods
The use of spreadsheets for the inclusion of data from specimens in herbaria databases is a ubiquitous option for many botanists. Inserting one record per line, separating the different attributes on columns, is a mapping like the format of books used in herbaria for registration of collections. The tabular structure of the spreadsheets is so typical that some software has layouts that refer to them, for example, Brahms (<http://herbaria.plants. ox.ac.uk/bol>). Therefore, maintaining the userfriendliness for the end user was the main objective and, thus, the data import model was maintained with the use of spreadsheets in the development of the information system. Besides, the use of editing options, such as copy-and-paste, drag, replacement values throughout the document, allow the user to type of data faster and efficiently. In addition to the above, and considering that a field expedition obtains dozens of plant specimens, the inclusion of records through a form is a tiresome activity and can be performed in a more agile way with a spreadsheet.
This article presents a tool whose primary objective is to analyze data from new collections Figure 1 -System macro vision of importing spreadsheets -the green lines indicate that the result of the validation was correct and in red show the flow of errors, noting that a new test is necessary (Chapman 2005a). This software is part of the management system of scientific collections know as Jabot (Silva et al. 2017). As a premise that currently there are computational resources sufficient to eliminate new entry errors. The import tool presented in a macro way in Figure 1 receives data in spreadsheets.
During the import process, the tool applies 81 filters to identify errors in the primary occurrence data. We divided the validations into three categories according to a study conducted to evaluate the major types of errors encountered in data of scientific collections of flora. The validations for each class are described in more detail next.

Taxonomic
The taxonomic taxa informed are verified with official lists (Kennedy et al. 2005) present in systems such as Flora 2020 (<http://floradobrasil. jbrj.gov.br>), for species that occur in Brazil and, The Plant List (<http://www.theplantlist.org/>) for non Brazilian species. The main errors in this category are typos, caused by lack of training in the area and the difficulty of reading old identification tags in vouchers, among other reasons, justifying the use of lists like dictionaries. In this feature, the tool makes a comparison with each part of the scientific name with the taxa of Flora 2020, as well as the full name. The system allows when configured for Jabot, such automatic replacement and use of the name of the author of the taxon by official scientific name. This category is the most advanced currently with exciting works supported by taxonomic tools like Taxamatch (Rees 2014), a system for approximate string matching in taxonomy, but not used in this project, having as justification the fact of the names be compared directly with the Flora 2020.

Geographical
The georeferencing is very important for a variety of researchers, and there is a quest for accuracy in the geographical location in collections (García-Roselló et al. 2015). In the process of validating the quality of spatial data, before the validation itself, the tool checks if the values entered for degrees, minutes and seconds are at their valid intervals. Once checked the tool converts the values to decimals and compared with the raster of districts limits contained in the vector basis BC250-IBGE2014 (

Miscellaneous
This is the general errors found on specimen data, this category contains filters to identify those caused by the lack of standardization in the names of the collectors, errors in dates of collection and incorrectly filled fields, for example, altitude values containing the unit of measure in different formats. The identification of the collector and collection number are essential for finding duplicates in other  systems, for example. The tool also identifies and prevents duplicate collection entry, whereas this occurs mainly with large amounts of data. Table  1 gives the primary fields, their description, the validations and if the attribute is required or not. The import and validation tool is available to the public as a way to promote the improvement of data quality, through the link Jabot: <http:// jabot.jbrj.gov.br/v2/validarplanilha_externo. php>. Figure 2 presents the result of a parsed by the spreadsheet tool. The mechanism indicates the line and the type of error, to speed up the process of review of the errors encountered. In the case of the name of the author, the system suggests the name found on Flora 2020.

Results and Discussion
Information systems often generate unexpected difficulties for its users, one of the leading complaints is that related to the interface of the system. Many long forms require more time for the user to enter data into the system. So, one of the perceived advantages of spreadsheet data entry is that directly related to the speed of data entry. Considering the cost of hiring labor for the typing of samples, this can lead to a considerable reduction in the values of the project (Gonzalez 2009). The automatic check with the official lists saves a lot of time the researcher who could only do this comparison individually, i.e., name by name.
The experience of the use of the tool on the system Jabot (<http://jabot.jbrj.gov.br>) of Rio de Janeiro Botanical Garden Research Institute has shown that users understand the process and consider the use of the tool as a resource to speed up the work of inclusion of data in the information system and subsequent printing of labels.
Regarding the elimination or reduction of the amount of data entry errors in the system, the tool proved to be very efficient. Users used import tool 934 times and performed 3,548 attempts until data were fully validated, what represents on average 3.7 tries the spreadsheet. Only 428 imports were completed on the first try, representing a rate of 46%. Figure 3 shows a chart with the distribution of the number of imported spreadsheets according to the number of attempts required.
One can consider that as the user was getting acquainted with the system, the number of attempts has been decreasing. In the first two years, the use of the system has doubled, clearly indicating that the resource has become a tool of easy use, allowing you to streamline the work of the collector in the import of samples. The primary factor to achieve this goal was the study and creation of filters to the key fields used in the spreadsheet.

Conclusions
Much attention has been given to the analysis of errors contained in the databases, but the systems do not adequately prevent the entry of new data with low quality. Even with various techniques to assess the quality of data, the time required for cleaning of these data has a high cost to the herbaria and publishers. The user should be aware that other researchers also use the data inputs. Even if the researcher does not need precision in the coordinates in his work, he should keep in mind that this data will be used by other researchers for studies as ecological studies, conservation, predictive modeling and climate change, among others.
The motivating factor for their use is that the tool act as a feature that streamlines the import of the specimens, performing batch checks on spreadsheets, preventing the user from having to search the scientific names individually with official sources and suggesting corrections.
In the current phase of the development of the tool, data mining is being evaluated to identify outliers in collections and standardization of names of collectors (Silva 2016). The association analysis identified suspicious names of collectors. Even though it's a tool that requires a user's extra time for correction of errors in the data, the experience in the first three years of use, leads to the conclusion that the user has a quick adaptation to its application. nomes