SNP discovery and characterisation in White Rhino (Ceratotherium simum) with application to parentage assignment

Abstract The white rhino is one of the great success stories of modern wildlife conservation, growing from as few as 50-100 animals in the 1880s, to approximately 20,000 white rhinoceros remaining today. However, illegal trade in conservational rhinoceros horns is adding constant pressure on remaining populations. Captive management of ex situ populations of endangered species using molecular methods can contribute to improving the management of the species. Here we compare for the first time the utility of 33 Single Nucleotide Polymorphisms (SNPs) and nine microsatellites (MS) in isolation and in combination for assigning parentage in captive White Rhinoceros. We found that a combined dataset of SNPs and microsatellites was most informative with the highest confidence level. This study thus provided us with a useful set of SNP and MS markers for parentage and relatedness testing. Further assessment of the utility of these markers over multiple (> three) generations and the incorporation of a larger variety of relationships among individuals (e.g. half-siblings or cousins) is strongly suggested.


Introduction
Due to intensive protection and conservation efforts, the Southern white rhinoceros (Ceratotherium simum simum) have increased from a population of less than 100 at the end of the 19th century, to an estimated population of over 20,000 (Emslie, 2012). However, the illegal trade in rhinoceros horn in many parts of the world especially in Asia where the rhinoceros horns are used traditionally as material in sculptures or as drug products for medicinal purposes (Hsieh et al., 2003) is adding constant pressure on remaining populations. Currently, the remaining white rhino populations are being intensively managed as small isolated groups thus monitoring and maintaining genetic diversity is a key concern for long term survival of this species (Emslie and Brooks, 1999). Potential consequences of a reduction in genetic variability include (1) the inability of the species to adapt to changes in their environment and (2) inbreeding, whereby the expression of rare deleterious alleles may contribute to developmental, reproductive and immunological impairments (Pertoldi et al., 2007;Väli et al., 2008). In order to maintain genetic diversity and the species' evolutionary potential, a recovery strategy can be employed whereby gene flow amongst populations is enhanced (Lowe and Allendorf, 2010). Translocation can be considered as an option in in the case of the white rhino. However, an analysis of genetic structure is required in order to ensure that outbreeding depression due to the introduction of mal-adapted genes does not occur (Pertoldi et al., 2007;Väli et al., 2008).
Single nucleotide polymorphisms (SNPs) represent the most abundant type of DNA variation in the vertebrate genome and are distributed across the entire genome providing broader genome coverage as compared to mitochondrial DNA or microsatellites (MS) (Ryynänen and Primmer, 2006;Pertoldi et al., 2007). In addition, SNPs offer higher recovery of information from degraded DNA samples since the DNA target sequence in SNP-based genotyping is appreciably shorter (50-70 bp) than that in microsatellite-based genotyping (80-300 bp) (Morin et al., 2004;Ryynänen and Primmer, 2006;Butler et al., 2007;Pertoldi et al., 2007). In contrast to microsatellites, SNP genotyping reveals polymorphisms directly on the DNA sequence, and thus data is automatically standardized across chemistries, hardware platforms and laboratories (Smith et al., 2005;Glover et al., 2010). Furthermore, the development of high through-put genotyping platforms permits simultaneous genotyping of thousands of loci, enabling the identifications of highly diagnostic panels (Glover et al., 2010).
In this study, we compare the power of parentage assignment of 33 SNPs and 9 MS markers in isolation and in combination in a captive population of white rhinoceros. Development of a marker set that accurately determines parentage will provide information on the relationships and relatedness among individuals, contribute to the management of captive white rhinoceros worldwide, and additionally provide insight into mating systems in wild populations.

Materials and Methods
Blood samples were collected from 32 white rhinoceros in South Africa. Blood aliquots were first treated by mixing 100 mL of blood with 1000 mL nuclease free water followed by centrifugation at 1500 x g for 2 min to reduce the number of red blood cells and improve DNA yields. Genomic DNA was extracted from the resulting pellet using the ZR Genomic DNA TM -Tissue Mini-Prep kit (Zymo Research) following the manufacturer's instructions. A SNP enriched library was constructed using DNA from 5 individuals and digestion with Endonuclease V as previously described (Labuschagne et al., 2015). This protocol was used without any changes. Subsequent SNP enriched amplicons were cloned into pJET vector using the CloneJET PCR Cloning Kit (Thermo Scientific) and Z-Competent JM109 E. coli cells (Zymo Research). Clones containing fragments ranging from 200-700 bp were selected and sequenced utilising a Big Dye V3.1 Terminator Kit and an ABI 3500XL genetic analyser. The potential SNP loci were amplified in the 5 isolates used for the initial DNA pool. Amplification reactions were done in a final volume of 25 mL containing 30 ng DNA, 25 pM of each primer and 2X DreamTaq ® Green Master Mix (Thermo Scientific). Thermal cycling consisted of initial denaturation at 95°C for 5 min, 45 cycles of denaturation at 95°C for 30 s, annealing at 55-59°C for 30 s, extension at 72°C for 90 s, followed by final extension at 72°C for 10 min. Resulting amplicons were inspected on 1% agarose gels followed by purification and sequencing as described above. Sequences were inspected and aligned in CLC Bio Genomics work bench 8.0.1 (CLC bio, Denmark). Twelve resulting SNP markers were further typed in the remaining 27 isolates. GENEPOP version 4.0.10 (Raymond and Rousset, 1995) was used to test for deviations from expected Hardy-Weinberg (HW) proportions, to evaluate loci for gametic disequilibrium and to determine allelic richness. Differences in mean observed heterozygosity (Ho), mean expected heterozygosity (He) and mean number of alleles was determined using Cervus v3.03 (Kalinowski et al., 2007). All 32 samples were further typed for 21 previously described SNP markers through Sanger sequencing (Labuschagne et al., 2013(Labuschagne et al., , 2015. Nine microsatellite loci: BR6 (Cunningham et al., 1999), DB44, DB66, DB49, DB1 (Brown and Houlden, 1999), RHI7C, RHI32A, RHI7B (Florescu et al., 2003), SW35 (Rohrer et al., 1994) were also used. Markers were selected based on previously reported polymorphism in white rhinoceros. The PCR optimization for each locus was as follows: 2 ng of template DNA, 1.5-2.5 mM MgCl 2 , 2 mM dNTP's, 1 mM forward and 1 mM reverse primer, 0.10 UTaqDNA polymerase and ddH 2 O to a final volume of 15 mL. PCR cycles were as follows: initial denaturing stage at 94 ºC for 3 min, followed by 30 cycles of 94 ºC for 30 s, annealing at 50-55°C for 30 s, extension at 72 ºC for 30 s and a final step of 72 ºC for 20 min. Products were electrophoresed on an ABI Prism 3130 DNA sequencer (Applied Biosystems). Allele sizes were estimated by comparison with a Genescan TM 500 LIZ TM internal size standard (ABI, Foster City, CA) using the ABI programs GENESCAN (version 1.2.2.1) and GENOTYPER (version 1.1). Sanger sequencing was performed in both directions and SNP calls were only made on bases with quality scores higher than Q > 20. The SNPs all fall within the central region of the fragments where sequencing quality is the highest. In order to ensure accurate genotyping, the samples were repeated if they were homozygous, aberrant stutter patterns or spurious peaks were observed or if the profiles were below the quality score. Differences in mean observed heterozygosity (Ho), mean expected heterozygosity (He) and mean number of alleles was determined as mentioned above.
Parentage assignment was evaluated using the MS and SNP data sets individually and as a combined dataset. The software program Cervus v3.03 (Kalinowski et al., 2007) was implemented for parentage assignment using likelihood. The program uses multilocus parental exclusion probabilities (Selvin, 1980) and pair-wise likelihood to assign parent pairs to offspring. Cervus calculates the loglikelihood of each candidate parent being the true parent relative to an arbitrary individual and then calculates the difference between the two most likely parents (Delta, D). Critical values of D are determined by computer simulation. Using the real data for allele frequencies, simulation parameters were set at 10,000 offspring, with 100% of candidate parents sampled and a total proportion of loci typed over all individuals of 0.99, mistyping error rates = 0.01 and likelihood calculation error rates = 0.01, permitting two unscored loci. Strict confidence was set to 95% while the relaxed confidence level was 80%.

Results
Twelve SNPs (GenBank accession numbers 1416044499-1416044509) were identified in this study across 11 loci (WR1-WR11). The primer sequences and allele frequencies of the 12 SNPs developed here together with a further 21 SNPs for the 32 individuals are listed in Table 1. The PIC ranged from 0.060 to 0.396 with a mean of 0.2742. The observed and expected heterozygosity Labuschagne et al. 85 86 SNP discovery in White Rhino ranged from 0.065 to 0.656 and from 0.063 to 0.520, respectively. Marked BGN deviated from Hardy-Weinberg equilibrium. Large differences between the observed and expected heterozygosity were also observed for four markers namely: WR1, WR8-Y, WR11 and Tru-3. The observed deviations may be attributed to small sample size. Linkage disequilibrium was observed between markers ACTC-2/ACTC-3, GLUT2F-1/GLUT2F-2 and Tru2-1/Tru2-2/Tru2-4/Tru2-5. Such linkage is not unexpected since these SNPs are in close proximity on the same locus. The nine MS markers, primer sequences and allele frequencies for the 32 individuals are listed in Table 2. The PIC ranged from 0.259 to 0.578 with a mean of 0.4282, while observed and expected heterozygosity ranged from 0.273 to 0.654 and from 0.298 to 0.655, respectively. None of the MS loci deviated significantly from Hardy-Weinberg equilibrium and no linkage disequilibrium was observed. Only two alleles were observed in four of the markers, while four markers exhibited three alleles and one marker, five alleles, resulting in a mean allele number (Na) of 2.7.
The 32 individuals consisted of 11 known mother/offspring groups with two mothers having two offspring as illustrated in Figure 1. There was a further seven juvenile samples, which did not have known mothers in the data set as well as one adult female without any offspring. The data set included four adult male samples presumed from observational data to be the possible paternal candidates for all 11 juvenile samples with known mothers. Parentage analysis was conducted with all ten adult females as maternal candidates group, all four adult males as paternal candidate group against all 18 juveniles as offspring set. The summary of parentage assignment for maternal candidates is given in Table 3 and paternal candidates in Table 4. The SNP dataset achieved a combined first parent nonexclusion probability of 0.0889, the MS data set 0.2755 and the combined data sets 0.0153. The combined second parent non-exclusion probabilities were 0.0072, 0.0735 and 0.0001 for SNP, MS and combined data sets respectively. Using the SNP data set, all 11 juveniles were correctly as-signed to their mothers with no pair loci mismatching noted. The MS data set correctly assigned ten out of the 11 parent offspring pairs. No pair loci mismatching was noted in any MS assignments including the wrong assignment of maternal candidate WR-22 to juvenile WR-110. In order to assess the effect of missing MS loci on the assignment of parentage, analysis on the assignment of mothers to a subset of samples; WR101 and WR44.1 was conducted. In both cases a reduction of MS loci from nine to five maintained correct assignment (positive LOD scores), however an absence of three markers resulted in a drop of the pair confidence from 95% to 80%. All 11 juveniles were correctly assigned using the combined data set with ten assignments having confidence of 95% and one with 80%.
Using the SNP data set, six paternal allocations could be made with 95% confidence. Using the MS data set, two paternal allocations can be made with 95% confidence and six with 80% confidence. Two of the allocations with positive scores correspond between the two data sets. Paternal allocations on the combined dataset were conducted in two ways; where maternal assignments were fixed and without known mothers. Using the combined data set with unknown mothers, five paternal allocations could be made with 95% confidence (Table 4). In the case of fixed maternal assignment, six paternal allocations could be determined with 95% confidence. In this analysis, all allocations determined with unknown mothers remained, however an additional assignment could be made with 95% confidence (WR-17 assigned as the father of WR-5, pair LOD score 2.63). Table 5 includes the parentage assignments when siblings WR-101/WR-5 and WR-15/WR-106 are included in the pool of maternal candidates. Using only SNP data, the correct maternal candidates are assigned to WR-5 and WR-106. WR-5 was wrongly assigned as best maternal candidate for both WR-101 and WR-15. Using only MS data the assignments are correct except for WR-101 which has a higher LOD score than the true mother for WR-106. Using the combined data sets the assignments are all correct with 95% confidence. All other assignments remained as stated in Table 3. Labuschagne et al. 87 Figure 1 -Diagram illustrating the known relationships between 24 rhino samples. Rectangles indicate potential paternal candidates, ovals the maternal samples and hexagons the offspring.

Discussion
Together with the 12 new SNPs identified in this study, 33 SNPs are now available for white rhino (Labuschagne et al., 2013(Labuschagne et al., , 2015. The SNPs, were discovered through random selection and sequencing of cloned fragments from a SNP enriched library. Ascertainment bias is often a concern when using SNPs in population studies, with bias introduced by heterogeneity in the SNP discovery process, varying sample sizes or differences in sample composition leading to underestimation or overestimation of the frequency of SNPs (Nielsen and Signorovitch, 2003;Clark et al., 2005). Ascertainment of SNPs through discovery in particular populations or genomic regions does not bias the results of parentage inference in any way since the parentage analysis is not concerned with the inference of evolutionary history (Anderson and Garza, 2006). In effect, SNP ascertainment leads to an advantage in parentage inference, since ascertainment typically leads to an overrepresentation of intermediate allele frequency SNPs, the type of loci that are most powerful for parentage (Anderson and Garza, 2006). The SNP loci presented here contain extra flanking data to allow for Sanger sequencing. Shorter amplicons may be designed in the future to optimise their utility in degraded DNA samples.
To our knowledge this is the first study to employ SNP and MS markers for parentage analysis in white rhino. The current SNP set outperform the MS set during maternal assignment, where all assignments were correct while the MS data set allocated one maternal sample incorrectly. In general, assignments made with the SNP data set had higher confidence than those with the MS set. Confidence levels increased when combining the two data sets. The increased accuracy of the SNP markers in this study over MS markers can be attributed to the greater marker numbers in the SNP data set and low allele numbers of the MS markers. It is apparent that low levels of genetic diversity characterise white rhino populations and the results from our study (Na=2.7; PIC=0.4282) are consistent with other studies making use of MS markers. Harper et al. (2013) reported Na=2.722 and PIC=0.329 for 367 rhino samples, while Guerier et al. (2012) reported Na=2.72 and PIC=0.357 in a sample set of 31 individuals. Florescu et al. (2003) observed higher values, Na=2.8 and PIC=0.4812 in a sample set of 30 individuals, but selected specifically for highly polymorphic loci, which may account for the elevated genetic diversity estimates in their data. The low levels of genetic diversity observed in white rhinos may be attributed to the small (20-40 individuals) founder population and subsequent bottleneck (Walker and Walker, 2012), Challenges to parentage assessment can arise when family members other than the parents of the offspring are included in the pool of candidate parents (Jones and Ardren, 2003). Inclusion of either half-or full-siblings in the pool of candidate parents may pose the most problematic situations (Jones and Ardren, 2003). In the current study 88 SNP discovery in White Rhino    two pairs of siblings were available to evaluate the effect on assignment when included in the parental pool. Inclusion led to some wrong assignments when using the two marker sets separately, but not in the combined data set. It would seem that the combined data set has enough discriminating power for accurate assignment in the current population even when siblings are included in the parental pool, however as relatedness levels increase so should the number of markers. In extremely inbred populations this may reach prohibitive numbers. Further assessment of the utility of these markers over multiple (> three) generations and the incorporation of a larger variety of relationships among individuals (e.g. half-siblings or cousins) as well as a larger set of samples is strongly suggested.