Low STR variability in the threatened marsh deer, Blastocerus dichotomus, detected through amplicon sequencing in non-invasive samples

Abstract Blastocerus dichotomus is the largest deer in South America. We have used 25 microsatellite markers detected and genotyped by Next Generation Sequencing to estimate the genetic variability of B. dichotomus in Argentina, where most of its populations are threatened. Primer design was based on the sequence of a shallow partial genome (15,967,456 reads; 16.66% genome coverage, mean depth 1.64) of a single individual. From the thousands of microsatellite loci found, even under high stringency selection, we chose and tested a set of 80 markers on 30 DNA samples extracted from tissue and feces from three Argentinean populations. Heterozygosity levels were low across all loci in all populations (H=0.31 to 0.40). Amplicon sequencing is a fast, easy, and affordable technique that can be very useful for the characterization of microsatellite marker sets for the conservation genetics of non-model organisms. This work is also one of the first ones to use amplicon sequencing in non-invasive samples and represents an important development for the study of threatened species.


Blastocerus dichotomus is the largest deer in South
America, that inhabits all types of wetlands, such as flooded grasslands, lagoons and swamps with floating marshes (Duarte and González, 2010), which gives it the common name of "marsh deer". It is listed as Vulnerable by the IUCN. In Argentina, most of its populations are threatened (Pereira et al., 2019) which is why starting to address its situation from a genetic perspective is imperative.
Microsatellite loci are widely distributed in the genomes of eukaryotes and have been used without major difficulties in many species (for example, Dayon et al., 2020). The development of markers with the modern Next Generation Sequencing (NGS) approach, which involves the partial sequencing of a genome, makes the analysis of thousands of microsatellite loci possible, allowing them to be chosen under much more stringent conditions. Moreover, NGS technologies can also be used to genotype microsatellites, through PCR amplicon sequencing, which allows for a faster, easier, and more affordable process (Andrews et al., 2018). This approach helps to overcome some of the limitations of former genotype-by-fragment-size microsatellite analyses: unambiguous allele identification, additional information besides number of repeats, and possibility of comparing genotypes across different laboratories (Andrews et al., 2018). Also, the application of NGS techniques for both the design and genotyping of microsatellites allows the analyses of short target sequences (from 40 / 50 bp), which can be particularly important when analyzing low quality or degraded DNA.
The objective of this study was to estimate the genetic variability of the marsh deer in Argentina using a considerable number of new microsatellite markers. We used NGS techniques for the development and characterization of the markers and estimated genetic variability in tissues and noninvasive samples.
We used a blood sample from a marsh deer from the Paraná River Delta for an initial genome analysis. DNA extraction was performed with a Zymo Research kit (Quick DNA Microprep kit) and sonicated with a Covaris M220 Focused-Ultrasonicator for 120 s. The resulting fragments had a mean size of 950 bp. For genomic library construction, we used the Illumina TruSeq kit, following the manufacturer's instructions. Sequencing was done in an Illumina MiSeq, using MiSeq Reagent kit v2 2x250 paired ends following the manufacturer's instructions. After sequencing, demultiplexing and read quality analysis were performed using the Illumina Base Space software. Ninety-eight percent of the resulting reads were used to assemble a partial genome with the software ABySS (Simpson et al., 2009). Genome coverage, understood as the percentage of the genome that had at least one nucleotide sequenced, was estimated with Sequencing Coverage Calculator (Illumina). Mean depth, estimated as the average number of times that each nucleotide was sequenced (Sims et al., 2014) was calculated as L * N / G, with L being the length of the reads, N being the total number of reads and G the length of the reference genome. Since no genome is available for the species we studied, data from the phylogenetically close white-tailed deer, Odocoileus virginianus, whose genome size is 2.4 GB (London et al., 2022) was used. The assembled contigs were used as an input for the microsatellite search using MSATCOMMANDER 1.0.8 MacOS version (Faircloth, 2008). We searched for di, tri and tetranucleotides with a minimum of seven, seven and five repeats respectively. We then designed the primers with the same program, using the following parameters: final product length 30-130bp, primer length 18-23bp, melting temperature between 57 °C and 65 °C, GC content 30-70%, and a GC clamp in the 3' end. From the thousands of primer pairs found, we selected a set of 80 markers at random, that were synthetized with the addition of a sequencing tail (Left sequencing tail: CCCTACACGACGCTCTTCCGATCT, right sequencing tail: GTTCAGACGTGTGCTCTTCCGATCT). These were ordered from Macrogen Korea.
All chosen primers were screened together with the software AutoDimer (Vallone and Butler, 2004) to check for the possibility of heterodimers and hairpins, and then divided into four multiplexes of 20 primer pairs each. Amplicon libraries were built in the Marine Gene Probe Lab at Dalhousie University using the Qiagen Multiplex reagent kit, following manufacturer's instructions. Final volume was scaled to 5 µl per reaction. The cycling program consisted of 94 °C for 15 min, 20 cycles at 94 °C for 30 s, 57 °C for 180 s, 72 °C for 60 s, and a final extension at 68 °C for 30 min. The final products were diluted with 20 µl of ultrapure water (see Zhan et al., 2017 for more details). These products were used as a template for the second PCR, which adds the indices, and consists of 2.15 µl of ultrapure water, 0.5 µl 10x buffer, 0.2 mM of each dNTP, 0.2 µM of indexed oligo, 0.3 µl of diluted PCR product and 0.25 U of TSG DNA polymerase (Bio Basic, Markham, ON, Canada), in a final volume of 5 µl per reaction. The cycling program consisted of 95 °C for 120 s, 20 cycles of 95 °C for 20 s, 60 °C for 60 s, 72 °C for 60 s, and a 72 °C final extension for 10 min (Zhan et al., 2017). We screened all the PCRs, observing DNA bands by electrophoresis, using 2% agarose gels stained with gel green (Biotium, Fremont, CA, USA). Products were pooled in equal proportions and then purified using a 1.8:1 of Sera-Mag Speedbeads (GE Healthcare, Little Chalfront, UK). Library quantification was done with the Kapa Library Quantification kit (Roche, Pleasanton, California) following the manufacturer's protocol. The library was diluted to 15 pM and sequenced on an Illumina MiSeq, using the MiSeq 150 cycle V3 single-read kit. Amplicons from feces and tissues were sequenced independently using different flow cells. This prevents reads from good quality samples overwhelming reads from samples of lower quantity and quality DNA. DNA from feces was analyzed using seven replicates each to build a consensus genotype, as recommended for this type of material (Taberlet et al., 1996). After sequencing, indexed samples were demultiplexed automatically with MiSeq Sequence Analysis software. This resulted in the creation of one FASTQ file per individual, which contained the sequences of all the corresponding microsatellites. These files were used as input for the MEGASAT software (Zhan et al., 2017), which genotyped all the loci automatically, and generated histograms for manual visual genotyping. In addition to analyzing MEGASAT plots, we analyzed the sequences of reads of different lengths from each marker. We aligned the potential alleles with the program GENEIOUS PRIME 2020 (Kearse et al., 2012). This was done to verify the identity of the markers in relation to the original sequences used to design them, excluding markers whose polymorphism was due to indels in the flanking regions or the tandem repeat zone: a further advantage of this method compared to the genotyping by size on a capillary DNA sequencer, where size homoplasy may be a problem.
We used MICROCHECKER 2.2 (Van Oosterhout et al., 2004) to detect large allele dropout and null alleles. In the case of non-invasive samples, we calculated the rate of false alleles and allelic dropout within replicates with GIMLET 1.3 (Valière, 2002). We calculated polymorphic information content (PIC) with the online tool Gene Calc (See the Internet Resources Section). We used Arlequin to calculate H O , H E , number of alleles per locus and to test Hardy-Weinberg and linkage disequilibria.
The sequencing produced 15,967,456 reads. Genome coverage was 16.66%, and mean depth was 1.64. The assembly produced 1,347,182 contigs. The MSATCOMMANDER search found 12,658 microsatellites, from which 511 primer pairs were designed with the conditions described above and 80 were selected from those.
After the amplicon sequencing process, we retained 25 polymorphic markers out of the original 80 tested (Table 1). Null alleles were found for Bdi30, Bdi51, Bdi57, Bdi59 and Bdi65 loci. The allele dropout rate varied between 0.01 and 0.014. In the case of non-invasive replicates, the false allele rate varied between 0.01 and 0.20 and the allelic dropout rate between 0.01 and 0.14, which shows the importance of making replicates when working with this kind of samples. We calculated diversity indices for the three populations (Parana River Delta (PRD): N=19, Ho = 0.38±0.19, He=0.31±0.22; El Bagual (EB): N=5, Ho=0.38±0.24, He=0.40±0.32; Esteros del Ibera (EI): N=6, Ho=0.39±0.24, He=0.38±0.32). Detailed loci data are presented for the Paraná River Delta, the population with the largest N ( Table 2). None of the markers showed deviations from Hardy-Weinberg equilibrium after applying the False Discovery Rate approach (Pike, 2011) to the p-values. There was no evidence for linkage disequilibrium for any pairs of loci.
The use of amplicon sequencing for genotyping made it possible to affordably test a greater number of loci and samples than a research project that uses capillary electrophoresis genotyping reducing time and effort (for example, Latorre-Cardenas et al., 2020;Lim and Ab Majid, 2021). The variability of the microsatellite markers was generally low, with several loci having only two alleles. This is likely due to the drastic reduction in the census sizes of the populations of the species in Argentina because of hunting and habitat degradation in recent decades (Pereira et al., 2019). The observed heterozygosity results are lower than those obtained for other cervid species (Table 3), except the South Andean deer, which has slightly lower heterozygosity levels, probably resulting from very low (<2000 individuals) population size (Corti et al., 2011). Curiously, the highest reported heterozygosity (mean He=0.765) is for the Pantanal population of the same species we studied (Oliveira et al., 2009). We tried to reproduce the analyses using the same nine loci described by those authors, but most loci failed to amplify with our tissue and feces samples even after many rounds of optimization. The only locus that did amplify (Bdc65) was monomorphic in our samples. It is unclear if the differences between our results and those from Oliveira et al. (2009) reflect actual biological differences (related, for example, to larger populations sizes of deer populations in the Pantanal) or to technical problems related to the set of microsatellites described by them.
Another important result of this work was that it showed how amplicon sequencing can be very useful for the characterization and analysis of microsatellites in the Conservation Genetics of non-model organisms, since a large number of markers can be simultaneously analyzed. This can be particularly useful for organisms with low levels of heterozygosity, such as land vertebrates, whose current distribution is only about 5% or their original one (Li et al., 2015) with the consequent loss of gene variation (Willoughby et al., 2015). Over the last few years, analyses of single nucleotide polymorphisms (SNPs) began to replace microsatellites in studies of genetic variability, since they required the amplification of shorter fragments (~50bp versus ~100bp) (Weinman et al., 2015) and presented similar or even greater resolution to that provided by microsatellites (Weinman et al., 2015). Furthermore, although microsatellite markers have been widely used in population genetic studies, they do not provide sequence information and present inconveniences such as homoplasies and null alleles (Estoup et al., 2002). However, the use of SNPs has its own disadvantages. The platforms used to genotype SNPs require expensive and specialized equipment and may not be available to every research laboratory (Andrews et al., 2018). They are also biallelic and may be worse predictors of genetic variability at the genomic level than microsatellites (Smitz et al., 2016). It has been shown that the number of SNP loci needed to achieve equivalent statistical power is between  four and twelve times higher than for microsatellites (Liu et al., 2005). More importantly, many of the currently available SNPs methodologies (such as those from the RADSeq family; Andrews et al., 2018) require genomic DNA of good quality, which is not always available in conservation genetic studies, particularly of rare or elusive species.
This work is one of the first to use amplicon sequencing in non-invasive samples, in addition to Eriksson et al. (2020) and De Barba et al. (2017). This is particularly important considering that the low quality and quantity of the DNA from fecal samples hinders their analyses by mainstream SNP methods like RADseq, GBS or WGS (Andrews et al., 2018).

Acknowledgements
We thank Frederico Henning for the enormous help in the partial genome sequencing. We also thank Javier Pereira for all the support and Proyecto Pantano in which this project is framed. This work was supported by Fundación Ambiente y Recursos Naturales, Agencia de Promoción Científica y Tecnológica (PICT 2016 Nro. 4087 to PM), the Unidad para el Cambio Rural, Ministerio de Agricultura, Ganadería y Pesca de la Nación (Argentina) through GEF (Global Environmental Facility) 090118 TF to Proyecto Pantano, by the Brazilian Research Council (CNPq) and by a Natural Sciences and Engineering Research Council (Canada) Discovery Grant (RGPIN-2019-04679) to DER. LIW was supported by a scholarship from Consejo Nacional de Investigaciones Científicas y Técnicas as a Doctorate Candidate and the Department of Foreign Affairs, Trade and Development (Canada) during her stay at Dalhousie University, and GRMC was supported by a Strategic (NSERC) grant to DER.

Conflict of interest
The authors declare that there is no conflict of interest that could be perceived as prejudicial to the impartiality of the reported research.

Authors Contributions
LIW and AMSC designed the study; LIW collected the samples; LIW and GRMC did the lab work, in PM, AMSC and DR labs, under their supervision. LIW, GRMC and AMSC analyzed the data; All authors contributed to the results discussion and editing the manuscript; All authors have approved the submission of the manuscript to GMB.

Ethics Statement
There is no Ethics Committee for animals in Argentina. All samples were collected and transported according to the following permits and protocols: Research Permit by Organismo Provincial para el Desarrollo Sostenible (Disp. no. 068/16), Authorization 011/15 by Dirección General de Recursos Naturales de la provincia de Entre Ríos, Proyecto NEA N° 488 by Administración de Parques Nacionales, Exportation Permit N° 044869 (Argentina) and Importation Permit N° 19CA04350/CWHQ (Canada) by the Convention on International Trade in Endangered Species of Wild Fauna and Flora, and Exportation Permit N° IF-2019-95678169-APN-DNBI#SGP by Dirección Nacional de Biodiversidad (Ministerio de Ambiente y Desarrollo Sustentable).