Evaluation of partial 12S rRNA, 16S rRNA, COI and Cytb gene sequence datasets for potential single DNA barcode for hylids (Anura: Hylidae)

: We evaluated the extent of intraspecific and interspecific genetic distances and the effectiveness of predefined threshold values using the main genes for estimates of biodiversity and specimens’ identification in anurans. Partial sequences of the mitochondrial genes for small (12S) and large (16S) ribosomal subunits, cytochrome c oxidase subunit I (COI) and Cytochrome b (Cytb) of the family Hylidae were downloaded from GenBank and curated for length, coverage, and potential contaminations. We performed analyses for all sequences of each gene and the same species present in these datasets by distance and tree (monophyly)-based evaluations. We also evaluated the ability to identify specimens using these datasets applying “nearest neighbor” (NN), “best close match” (BCM) and “BOLD ID” tests. Genetic distance thresholds were generated by the function ‘threshVal’ and “localMinima” from SPIDER package and traditional threshold values (1%, 3%, 6% and 10%) were also evaluated. Coding genes, especially COI, had a better identification capacity than non-coding genes on barcoding gap and monophyly analysis and NN, BCM, BOLD ID tests. Considering the multiple factors involved in global DNA barcoding evaluations, we present a critical assessment of the use of these genes for biodiversity estimation and specimens’ identification in anurans (e.g. hylids).


INTRODUCTION
DNA barcoding aims to use large-scale tracing of a reference gene to assign unknown individuals to species and increase the discovery of new species (Hebert et al. 2003).For animals, the use of the cytochrome c oxidase subunit I mitochondrial gene (COI) has been effective for several taxa, such as ground beetles of Germany (Raupach et al. 2018), flies and dragonflies of Brazil (Koroiva et al. 2017, 2018) and anurans in central South America (Koroiva et al. 2020).However, its effectiveness seems to vary depending on the group analyzed (Waugh 2007).In this sense, the use of other genes for identification has been suggested, such as the mitochondrial gene 16S ribosomal subunit (16S) (Vences et al. 2005b).One of the groups in which the best gene to use has remained ambiguous is amphibians.
Within Amphibia, anurans are the richest order and are considered the most threatened vertebrate group in the world; about 32% of their species are at risk of extinction (Stuart et al. 2004).As stated by Lyra et al. (2017), Anura has the potential to greatly benefit from the use of DNA barcoding once this approach provides information on populations that may be cryptic species and assists in taxonomic identification.These procecsses can be quite difficult-even for specialists-because both parallel and convergent evolution have led to an extremely conserved morphology.Some genetic regions, especially within mitochondrial DNA, have been suggested for the molecular identification of anurans.Hebert et al. (2003), Che et al. (2012), and Murphy et al. (2013) defended the use of COI given its capability of identifying many species.Vences et al. (2005b) discusses that the priming sites of COI are highly variable in anurans and suggested that 16S may represent a better alternative because its universality and robustness of primers.Mitochondrial 12S rRNA (12S) and Cytochrome b (Cytb) genes are standard markers for phylogeny reconstruction in amphibians (Vences et al. 2005a) and have also been used to assist molecular identification of species.Nevertheless, COI and 16S are considered the main genes for the molecular identification of anuran species (Murphy et al. 2013).
Currently, many researchers use 16S as a DNA barcode marker for biodiversity estimates in amphibians.However, the use of COI has been increasing in recent years (e.g.Lehr et al. 2017, Zhao et al. 2017, Gao et al. 2019, Koroiva et al. 2020).As evidenced by Lyra et al. (2017), amphibian taxonomists have used different thresholds for COI and 16S to flag candidate species in the absence of a more resolved taxonomy.Vences et al. (2005a) provided an initial conservative threshold for the identification of potential candidate species from 5% divergence in 16s and 10% divergence in COI.Fouquet et al. (2007) suggested a threshold of 3% for the 16S marker for Neotropical frogs, which has become the standard value applied for the preliminary hypotheses of many candidate species in the amphibian integrative taxonomy (Vieites et al. 2009).Crawford et al. (2010) working in a more restricted geographical area, used an 8.6% limit for COI and more than 2% for 16S.Lyra et al.
(2017) suggested 3% for 16S and 6% for COI using their own database of anurans.For 12S, the threshold values at species-level are commomly the same as those used in 16S (eg. Howlader et al. 2016), while in Cytb, the threshold values are those evaluated to large vertebrate datasets such as 5.52 ± 1.34 for sibling species and 10.69 ± 1.34 for species within a genus (Kartavtsev & Lee 2006).Despite the lack of specific studies, they are widely used in phylogeny studies and therefore have a large number of deposited sequences that make it a potential gene on delimiting species based on genetic distances.
Few studies have directly compared the relative efficacy of these molecular markers within a single group of anurans.In general, these studies compare only regional or local databases (e.g.Vences et al. 2005b, Lyra et al. 2017).The presence of a gap between the maximum intraspecific genetic distance value and lowest genetic distance to the nearest neighbor species within a study group-called the barcoding gap-is considered a key factor in the selection of a gene for the rapid identification of unknown samples (Meyer & Paulay, 2005).At regional levels, there are favorable analyses for both COI and 16S (e.g.Vences et al. 2005b, Lyra et al. 2017).Despite this, an overall association between the markers and even between the same species is still non-existent.
Therefore, in order to clarify the effectiveness of these mitochondrial genes for species identification in anurans, the present study aimed to comprehensively sample 12S, 16S COI and Cytb from online repository (GenBank).Here, we evaluate the presence or absence of a global barcoding gap among taxa and the ability to identify them as unique species using thresholds.Considering the pronounced genetic variation within the Anura order, we decided to analyze the Hylidae family.This family is one of the most speciose among the anurans (about Cienc (2022) 94(4) e20200825 3 | 15 10% of all anurans described; 730 species) and is found worldwide (see Frost 2020).This kind of study is important for assessing the ability to identify anura species with molecular tools as well as their usefulness for biodiversity estimation and species differentiation.The other purpose of the study was to indicate the most favorable choices of genes and primers for taxon identification and high-throughput applications.I).

MATERIALS AND METHODS
The datasets were further manipulated using Geneious v.9.0.5 (Kearse et al. 2012) in combination with Microsoft Excel™ (Microsoft Corporation, Redmond, WA).To enable comparisons of verified data and to ensure intraspecific and interspecific comparisons, all undeterminated sequences were removed; i.e., sequences with imprecise taxonomic labels (e.g."sp.", "cf", "gr", "aff") were excluded from the alignements to be studied.Note that these instances only composed a small fraction of the full dataset (Table I).Valid names and taxonomic synonyms were checked against a standardized list of Frost (2020) on April 13, 2020.
Potential contaminants were then controlled by a BLASTn search of the genes' sequences against the non-redundant sequence database on GenBank (Koroiva & Kvist 2018).In addition to the evaluation of all sequences ("All sequences" in the Results section), we also analyzed only the species that had sequences for 12S, 16S, COI and Cytb ("Species in common" in the Results section).Considering the limited number of species present in the Cytb dataset, we also performed comparing the database of the other three genes ("Species in common without Cytb" in the Results section), which have more three times the number of species in common.When comparing only datasets of common species between 12S and 16S, the results are close to the analysis of "All sequences" datasets (data not shown).

Target region and directionality
First, the 5'-3' direction of the sequences was confirmed by comparison to sequences that were previously determined to be in the correct direction (KY385762 Dryophytes japonicus for COI, FJ784413 Smilisca phaeota for 16S, MG282247 Dryophytes immaculatus for 12S and AF205095 Dryophytes japonicus for Cytb).This was controlled through direction plots using MAFFT ver.7.309 (Katoh & Standley 2013) implemented in Geneious v.9.0.5.Also, in order to increase robustness in the homology statement and elevate matrix occupancy, long sequences were truncated to cover only commonly used regions for DNA barcoding and phylogenetic studies.
For hylids (Vences et al. 2012), the COI region can be amplified using the "Folmer" primer pair (HCO2198 and LCO1490) and the 16S rRNA region using the "Palumbi" primer pair (16SA-L and 16SB-H).The 12S and Cytb region were truncated to cover the region from the primer pairs t-Phe-frog/MVZ59 and 12Sb-H (Goebel et al. 1999) and L14850 and H15502 (Tanaka-Ueno et al. 1998), respectively.Several articles used these primers to amplify Hylidae species (e.g.Vences et al. 2012, Jeong et al. 2013, Orrico et al. 2017).The truncation was carried out using Geneious v.9.0.5, following the positioning of the aforementioned primers.Sequences of the four genes with a length below 400 basepairs (bp) were removed.

Genetic analysis
MAFFT ver.7.309 was also used to align and realign sequences within each of the datasets separately by employing the FFT-NS-1 strategy with a gap opening penalty of 3.0, the 200PAM/K = 2 scoring matrix, and an offset value of 0.0.These alignments were inspected and manually refined.We use genetic distance to assess the ability to correctly identify and the effectiveness of using the proposed thresholds for the identification of specimens.First, the identification capacity followed the design of Yodphaka et al. (2018), where it was assessed whether there was a sufficient gap between intraspecific and interspecific distances, called "barcoding gap" and whether the closest neighbor was co-specific (the test of the closest neighbor; Meier et al. 2006).The genetic distance between individual sequences was calculated using the p-distance model in the "dist.dna"function in the "SPIDER" package (Brown et al. 2012) of the R software (R Development Core team 2020).Traditionally, many studies use the K2P distance model in barcode analyzes, however, this has been challenged and the p-distance has been proposed to be a better model (Collins et al. 2012, Srivathsan & Meier 2012).
The quantification of the barcoding gap was calculated from the difference between the shortest interspecific distance and the longest intraspecific distance.These distances were determined with the "nonConDist" and "maxInDist" functions in the "SPIDER" package (Brown et al. 2012).The graphs are plotted as violin plots using PlotsOfData (https://huygens.science.uva.nl/PlotsOfData/).The Kruskal-Wallis and Mann-Whitney tests were used to determine differences in the mean number of genetic distances by gene (P<0.01)using PAST v.3.18software (Hammer et al. 2001).The percentage An Acad Bras Cienc (2022) 94(4) e20200825 5 | 15 of correct identification was calculated from the number of sequences with a specific closest neighbor, divided by the total number of sequences.The test was performed with the "nearNeighbor" function on the "SPIDER" package (Brown et al. 2012).We also evaluated which species had barcoding gap.A marker with high discriminatory power must have a high percentage of correct identifications from the nearest neighbor test and a positive value for the barcode gap.
For threshold analysis, we evaluated the quality of the data set by simulating a specimen identification scenario based on sequence-sequence using R software with the APE and SPIDER packages (Brown et al. 2012, Popescu et al. 2012).The identification was provided following two different criteria: Best closed correspondence (BCM) and barcode of all species (BOLD ID).The BCM criterion assigns identifications to the closest match with a distance below a defined limit.The BOLD ID criterion through the "threshID" function simulates the BOLD ID mechanism (http://www.boldsystems.org/index.php/IDS_OpenIdEngine)applying the predetermined limit and consulting all the sequences of a single species below it.The results are reported as correct when they correspond to previous morphological identifications, otherwise, a result is considered incorrect.A query can provide ambiguous results if the sequences of different species are those that correspond below the limit (BCM) and if the divergences of sequences of different species are below the limit (BOLD ID).A query that results in "no ID" does not match below the defined threshold.
We used six different threshold values: the value obtained by using the "ThreshVal" function in the "SPIDER" package (from now on "ThreshVal"), which minimizes cumulative identification errors, that is, the sum of the false positive (without specific matches within the query limit) and the negative-negative (sequences of several species within the limit).The value obtained by the 'localMinima' function (SPIDER package), which indicates the minimum in density, corresponds to the transition between intra and interspecific distances.The value of 1%, which is the default used by the BOLD ID mechanism.The value of 3%, which is used in studies of species delimitation with the 12s and 16s genes in anurans (e.g.Howlader et al. 2016), and the value of 6%, which is the standard suggested by Lyra et al. (2017) for the delimitation of species for COI and close to the value suggested by Kartavtsev & Lee (2006) for Cytb.Finally, we use the value of 10%, as suggested by Vences et al. (2005a) and Kartavtsev & Lee (2016) for COI and to congeneric species for Cytb, respectively.
We perform the counting of monophyletic groups in each database using the "Monophyly" function in the "SPIDER" package.This metric was used only and simply to determine the ability of a marker to recover monophyly among sequences of the same species.As presented by Knowles & Carstens (2007), the use of a single gene does not fully qualify for species identification.In this sense, the high number of species with reciprocal monophyly in a gene may be indicative that the time needed to coalesce has elapsed, these being preferably chosen in multiloci studies.

Datasets
The "All sequences" datasets for 12S, 16S, COI and Cytb consisted of a total of 1,690, 2,712, 1,737 and 2,425 individual sequences from 259, 276, 156 and 53 species, respectively (Fig. 1a; nucleotide alignments provide in Supplementary Appendix S1-S4).The "Species in common" datasets for 12S, 16S, COI and Cytb included 444, 499, 504 and 1,449 individual sequences from 31 species, respectively.Finally, the "species in common without Cytb gene" datasets for 12S, 16S and COI included 1,036, 1,341 and 1,113 individual sequences from 100 species, respectively.Detailed information about the sequence curation and alignment size are presented in Table I.

Distance-based evaluations
In "All sequences" datasets, on average (mean ± SD), the distribution of the DNA barcoding gap showed that the interspecific distances were greater than the intraspecific distances in 12S (0.00±0.03),COI (0.03±0.05) and Cytb (0.04±0.06), while the intraspecific distances were greater than the interspecific distances in 16S (negative barcoding gap; -0.01±0.06,Fig. 1b).Considering each gene dataset, the percentage of species with barcoding gap was 72.43% for COI, 68.72% for 12S, 63.40% for 16S and 62.26% for Cytb (see full list of species that presented barcoding gap in Supplementary Material -Table SI).

Monophyly analysis
COI recovered the highest percentage of monophyletic groups in the reconstructions of both "All sequences" (75.00%) and "Species in common without Cytb" (80.00%) datasets.In "Species in common" dataset, Cytb produced the largest number of monophyletic groups (77.42%).All markers recovered a percentage of monophyletic groups above 50% (Tables II-IV).
Table II.Results of the identification simulations from "All sequences" datasets using Best Close Match (BCM) and BOLD ID criteria based on SPIDER and tree-based comparison of efficiency among the studied barcoding markers using the percentage of monophyletic groups recovered from the neighborjoining phylogenetic reconstructions.

DISCUSSION
In this work, the efficiencies of four barcode markers species identification within hylids were evaluated.The coding genes, COI and Cytb, showed greater discriminatory power than the ribosomal genes.However, the advantages and disadvantages of each marker must be considered.
Despite the curation of the sequences, an intrinsic limitation of this study is the use of the public database.Genbank is the leading public sequence database with more than 216 million sequences deposited (April 2020).It uses basic checks such as vector contamination analysis, adequate translation of the contamination regions, and genetic annotation.However, unlike the main DNA barcoding database, the BOLD system, GenBank does not store chromatograms or collection metadata, which increases the risk of errors regarding each sequence.Note that GenBank and BOLD system exchange sequences.
Regardless sequences deposition, the generation and submission of incorrect sequences must occur mainly due to incorrect identification of the original material.Murphy et al. (2013) indicated that about 5.00% of the specimens deposited in reference collections of anurans and reptiles have some identification error.Other factors should also be considered for analyzes with pre-defined thresholds such as variability or heterogeneity in the rates of evolution and the presence of cryptic species.(e.g.Hoskin et al. 2005, Murphy et al. 2013, Caminer & Ron 2014).In addition, we cannot rule out the potential for heterotachy, heteroplasmy, pseudogenes, and other changes in evolution rates.We emphasize that, unlike other groups In the proposal for standardized barcode regions for animals, COI was indicated as the main marker (Hebert et al. 2003).Among the main attributes for the use of the COI gene were universality and high rate of substitution in the third codon position (Frati et al. 2000).The great development of public genetic databases for this gene since the publication of Hebert et al. (2003) has provided an important framework for its use in specimen identification.Recent evaluations have advocated its use compared to other genes (e.g.Lyra et al. 2017).The development of new primers and traceable databases, such as the BOLD platform, has been supporting its effectiveness and choice.However, the COI gene is still highly questioned in its application capacity, especially in highthroughput applications, which has been defended in recent years due to the choice of new primers and methodologies for this type of sequencing (see Andújar et al. 2018, Pierce 2019).
Among the most used genes for identification in high-throughput applications is 16s.Vences et al. (2005b) in their seminal article indicated the use of 16S as a true DNA barcode for Amphibia (and even for vertebrates) evidenced its use for pooled samples.At this time until today, as we found out in our database analysis for Hylidae, this gene has the largest number of sequences and species deposited.It is also the main gene used in anuran studies of species description, phylogeny and phylogeography (e.g.Neves et al. 2017, Mângia et al. 2018, 2020, Andrade et al. 2019).Another advantage of this gene is the numerous metabarcoding protocols.Several examples already allow and validate its use in the identification of anurans (e.g.Bálint et al. 2018).However, the use of 16S for species identification has important limitations for interpretation due to its hypervariable domains.The choice of the alignment method should result in different levels of genetic divergence estimates, which can result in a strong effect in analyzes downstream of phylogenetic inference (see Lyra et al. 2017).
The other two genes evaluated in this work have received attention for their use in species identification.12S has been used in the identification by metabarcoding (e.g.Lopes et al. 2017), besides being historically used in phylogenetic and phylogeographic analyzes (e.g.Garda & Cannatella 2007, Andrade et al. 2019).Similar to 16S, 12S has a problem with hypervariable regions.In the case of Cytb, the use of DNA barcoding for vertebrates has been advocated for nearly two decades (e.g.Bradley & Baker 2001).Cytb is considered more variable than the COI (Vences et al. 2005a) and has been used even in population analyzes (Recuero et al. 2006).However, for other taxonomic groups, the large number of highly conserved gene regions may limit the use of this gene in species identification (e.g.Satoh et al. 2016).Furthermore, for Cytb in hylids, the number of species present in the public databases is still far from those found for the other evaluated genes.
The high efficiency in the identification of COI and Cytb samples was corroborated by extent of the so-called barcoding gap and our simulations using BCM and BOLD ID with different tolerance levels.As evidenced by Chapple & Ritchie (2013), the presence of barcoding gap is essential to species delimitation and is the basis for specimen identification (local gap) and species discovery (global gap).For all datasets, the barcoding gap value in coding genes values were significantly higher than those of non-coding genes.Regarding the threshold values, this work An Acad Bras Cienc (2022) 94(4) e20200825 12 | 15 did not aim to define the best threshold value of molecular divergence for hylids.The accuracy of species delimitation using this methodology is influenced by factors such as the quality of the reference database, the geographic extent of the sample, the representativeness of the intraspecific variation, nucleotide replacement rate of the species, among others (see Chapple & Ritchie 2013), which were not evaluated in this work.The threshold values evaluated here are the most commonly used in taxonomic studies and allowed us to simulate their results to sequence-based identification process using our broad genetic databases.The use of these threshold values also allows us to highlight low interspecific distances ("ambiguous" entries in Tables II-III) and to exclude correspondences with high distances ("no ID" entries in Tables II-III).The Bold ID is considered the most rigorous, which does not provide correct identification if all sequences of a given species are not below the proposed limit (see Raupach et al. 2015).In these evaluations, the ability to identify the coding genes was even more prominent compared to the "near neighbor".Finally, the monophyly analysis showed the COI gene as the best indication of the taxa, with the other genes showing similar capacities.
Therefore, our results indicate that the coding genes, especially COI, have a better identification capacity than the non-coding genes, 16S and 12S, for hylids.It is noted that there is still a limitation in the number of species of hylids with coding gene sequences in relation to non-coding genes.This is a fundamental topic to defend the continued use of non-coding genes for species identification.In addition, as evidenced in DNA barcoding analysis of global databases, the use of local/regional databases should always be prioritized considering their best identification precision (see Bergsten et al. 2012), which can change the efficiency in the use of each gene.Besides, these results provide a guide to the use of a single gene in studies to assess biodiversity.Considering that amphibians are facing a series of severe threats and the limitation of taxonomists and financier resources for conservation, we provide in this study an indicative for a better allocation of resources in genetic studies (e.g.construction of DNA barcoding reference databases, specimen identification work) for this representative group.

Figure 1 .
Figure 1.Comparison among the studied barcoding markers for Hylidae from "All sequences", "Species in common" and "Species in common without Cytb" datasets.(a) Number of sequences, species and genera per marker.(b) Distribution of barcoding gap, as defined by the difference between the minimum non-conspecific distance and the maximum conspecific distance.Different letters represents a statistically significant difference in means based on the Mann-Whitney test (P<0.01).(c) The percentage of correct identifications from the nearest neighbor test.

Table III .
Results of the identification simulations from "Species in common" datasets using Best Close Match (BCM) and BOLD ID criteria based on SPIDER and tree-based comparison of efficiency among the studied barcoding markers using the percentage of monophyletic groups recovered from the neighbor-joining phylogenetic reconstructions.

Table IV .
Results of the identification simulations from "Species in common without Cytb" datasets using Best Close Match (BCM) and BOLD ID criteria based on SPIDER and tree-based comparison of efficiency among the studied barcoding markers using the percentage of monophyletic groups recovered from the neighbor-joining phylogenetic reconstructions.