Mendelian breeding units versus standard sampling strategies: Mitochondrial DNA variation in southwest Sardinia

We report a sampling strategy based on Mendelian Breeding Units (MBUs), representing an interbreeding group of individuals sharing a common gene pool. The identification of MBUs is crucial for case-control experimental design in association studies. The aim of this work was to evaluate the possible existence of bias in terms of genetic variability and haplogroup frequencies in the MBU sample, due to severe sample selection. In order to reach this goal, the MBU sampling strategy was compared to a standard selection of individuals according to their surname and place of birth. We analysed mitochondrial DNA variation (first hypervariable segment and coding region) in unrelated healthy subjects from two different areas of Sardinia: the area around the town of Cabras and the western Campidano area. No statistically significant differences were observed when the two sampling methods were compared, indicating that the stringent sample selection needed to establish a MBU does not alter original genetic variability and haplogroup distribution. Therefore, the MBU sampling strategy can be considered a useful tool in association studies of complex traits.


Introduction
Population definition, sample selection and choice of markers are crucial points in human population genetics studies, and the sampling strategy depends principally on the questions being asked. In addition to biological aspects, such studies should also take into account important sociocultural parameters, such as language and religion, along with social and self-identity affiliation. If a human population is clearly ethnically-identified and recent admixture is negligible, sampling strategies based only on surname (whenever distinctive) and place of birth are preferred, since they allow exclusion of recent immigrants, not yet blended into the gene pool, from the analysis. Moreover, surname and place of birth criteria can be extended from the DNA donors to their ancestors, provided that genealogical information is available.
A more stringent sampling strategy is required in studies based on genome-wide association scans, which look for different allele distributions between individuals with (cases) or without (controls) a phenotype of interest. The case-control experimental design is expected to be appropriate in surveys on homogeneous populations, whereas both false-positive and false-negative results may occur in heterogeneous or substructured populations, if cases and controls are not carefully sampled according to their origin. This scenario is likely to occur in an island like Sardinia, where the majority of the present population is distributed among 363 isolated villages (Siniscalco et al., 1999) which, while sharing common ancestry, might have diversified during many centuries of isolation. Therefore, it is important to identify true Mendelian Breeding Units (MBUs), i.e. interbreeding groups of individuals sharing a common ancestral gene pool. In Sardinia, the most practical way to define a MBU is to derive a direct estimate of the percentage of endogamous mating occurring in the last 200 years. This information was obtained anonymously from municipal and ecclesiastical marriage registers (Siniscalco et al., 1999). However, rigorous sample selection for reconstructing MBUs led to a conspicuous reduction in sample size, which might have significantly skewed haplotypic or allelic frequencies. In a previous paper (Siniscalco et al, 1999), we reported a pilot study on 55 unrelated controls belonging to the MBU of Carloforte, who were genotyped at six markers. We showed there that the allele frequencies, and therefore the genomic profile, remained constant even when only a subset of 20 individuals was analysed.
The main goal of this work was to evaluate the reliability of the MBU approach in describing genetic variation in human populations, particularly regarding its application to association studies of complex traits.
We compared genetic variability in two sets of samples which included different individuals recruited from the same areas, using two diverse sampling strategies. With the Standard (STD) Method, individuals unrelated for at least two generations were selected on the basis of the surname and place of birth of their grandparents, depicting present-day genetic variation, with the sole exclusion of the most recent immigrants. Using the MBU Method, the selected DNA donors were proven to be descendants of individuals present in the 17 th century archives, with no common ancestors for up to at least five generations. This was ascertained by means of a complete genealogical history checking, based on the official records made available to us by the City Halls. Samples collected using the latter method, being representative of population settlements before the migratory events of the last few centuries, allow an extension of the temporal resolution of genetic variability. Therefore, comparison of the two sampling methods might also reveal possible occurrences of diachronic genetic variation in the analysed areas, due to micro-evolutionary dynamics such as drift or gene flow from neighbouring populations.
The analysed samples belong to two different sociocultural areas, Cabras and western Campidano, whose cultural traits differentiated around the second half of the 19 th century: the former, and its neighbouring area, became a flourishing fishing centre, while the latter consists of rural villages whose economy is based on farming and sheep raising.
We studied mitochondrial DNA (mtDNA), since it has been extensively used as a molecular marker during the past 20 years, is maternally inherited, does not recombine and is in a haploid state; thus it is more sensitive than nuclear DNA to the effects of genetic drift and gene flow, and any discrepancy between the two sampling methods is expected to be enhanced.

Sample selection
Using the MBU strategy, we analysed 85 unrelated healthy subjects from two areas located in southwestern Sardinia: 35 individuals from Cabras and 50 individuals from western Campidano (Figure 1). Using the STD strategy, we analysed 71 unrelated individuals coming from the same areas. Comparison was performed between 48 samples from Cabras and its neighbouring area (up to 50 km) and 23 samples from the western Campidano area. 188 Sanna et al.

mtDNA analysis
Whole genomic DNA was extracted using standard procedures. For each individual, mitochondrial haplogroup affiliation was determined by both sequencing of the first hypervariable segment (HVS-I) of the control region from position 15997 to 16399 bp (Anderson et al., 1981) and RFLP (Restriction Fragment Length Polymorphism) analysis of the coding region for the presence/absence of haplogroup diagnostic markers (see Table 1 for details).

Data analysis
BioEdit software 7.0.5.2 (Hall, 1999) was used to align the sequences obtained. To characterise genetic variation among sampling sites, estimates of the number of polymorphic sites (S), the number of haplotypes (h), the nucleotide diversity (Pi), and the haplotype diversity (Hd) were obtained using the DnaSP 4.10 software (Rozas and Rozas, 1999). Pearson chi-square (c 2 ) values (Pearson, 1900) were calculated in order to assess whether there was any difference between the haplotype frequency distributions obtained for the same areas by means of different sampling strategies (MBU and STD). Principal Coordinate Analysis (PCoA) was carried out on the matrix of DNA pairwise differences, using the Genalex 6.3 software (Peakall and Smouse, 2006). The method based on the covariance matrix with data standardisation was applied. In order to assess the occurrence of significant genetic structuring among samples, analysis of molecular variance (AMOVA) was performed on the matrix of pairwise DNA distances among haplotypes, using the Arlequin 3.1 computer package (Excoffier et al., 2005). Furthermore, genetic differentiation between pairs of samples was estimated by pairwise F ST values, computed from the matrix of haplotype DNA pairwise differences. The significance of variance components and F-statistic was assessed by a random permutation test (10,000 replicates).

Results
Nucleotide sequence analysis of HVS-I (GenBank accession numbers: HM584611-HM584695 for MBU samples, and HM594952-HM595022 for STD samples) combined with RFLP analysis allowed the clustering of samples from both MBU and STD strategies into nine main haplogroups. They increased to eleven when sub-haplogroups K and U5b3 were also considered ( Table 2). Haplogroup H, which includes the Cambridge Reference Sequence (CRS) (Anderson et al., 1981), proved to be the most common. Haplogroup U5b3, reported as Sardinianspecific (Fraumene et al., 2006;Pala et al., 2009), was found in Cabras MBU, western Campidano MBU and Cabras STD, missing in western Campidano STD only. The values of genetic diversity, calculated for the dataset of HVS-I, were similar for all regions and sampling strategies considered, showing a high level of variability (Table 3). Furthermore, we found a total of 82 different haplotypes. Those whose occurrence was detected by both sampling methods (MBU and STD) showed comparable relative frequency distributions, with no significant Pearson chisquare values (Table 4).  Nucleotide sequences from the control region were combined with RFLP data on the coding region to obtain a single dataset for the following analysis.

Mendelian breeding units in Sardinia 189
The first two coordinates of PCoA, which account for 62.39% of the total variability, identify two main groups of haplotypes. However, haplotypes were not grouped either according to the geographic area of origin (Cabras or western Campidano) or to the sampling strategy adopted (MBU versus STD) ( Figure 2).
Accordingly, the analysis of molecular variance (AMOVA) did not indicate significant genetic differentiation among samples (F ST = 0.0096, p > 0.05). Indeed, nearly all variance was found within samples (99.04%), whereas differences among samples accounted for only 0.96% of the total variation. These results were further confirmed by the pairwise comparison of samples, which did not show any significant genetic differentiation (Table 5).
Furthermore, network analysis showed similar relationships among haplogroups without geographical structuring when the two sampling methods were compared ( Figure 3).

Discussion
Estimates of genetic diversity (Table 2) obtained for the two sampling strategies were compatible with no occurrence of high levels of repeated haplotypes in the STD strategy, as could be expected. This finding supports the possible occurrence of a homogeneous population shared by both the western Campidano and Cabras areas, with a constant high level of genetic variability in the samples obtained by the two sampling methods and low levels of stochastic forces.
The similarity of genetic diversity values between areas and sampling strategies may be explained considering the lack of diachronic divergence between the present and past genetic settlement of the western Campidano and Cabras areas. Furthermore, this finding is attributable to the absence of genetic drift in the analysed areas. Indeed, this stochastic force, if present, could lead to genetic heterogeneity due to random loss of haplogroups and alteration of their frequencies. The absence of higher levels of identical haplotypes among the STD samples suggests that no significant founder effects affected the population recently. Consistently, the result of PCoA applied to the combined dataset (control region + coding region) (Figure 2) contributed to group MBU and STD samples without genetic structuring. Such similarity was also confirmed by the corresponding, not significant, P values of F ST .
Network analysis was also consistent with the results above. The two sampling strategies displayed similar global relationships among mitochondrial haplogroups without geographical structuring, showing that mtDNA haplogroup frequencies and distribution obtained by the MBU method were not skewed by the severe sample selection of the method used.
Overall, these results suggest a lack of genetic variation in southwest Sardinia, probably due to a continuous 190 Sanna et al.  and controls (Risch and Botstein, 1996;Terwilliger and Weiss, 1998), even in isolated populations like Finns and Sardinians (Eaves et al. 2000;Taillon-Miller et al., 2000).
Pooling individuals belonging to different breeding units may merge alleles that might have different frequencies in different villages, as we have previously reported for some common polymorphisms in Sardinian villages (Robledo et al., 2002).
As previously shown, in a well-defined breeding unit, a small sample was sufficient to describe the genomic profile of the population, which was not affected by severe re-duction of sample size (Siniscalco et al., 1999). More importantly, the repeated application of our strategy in different MBUs offers the advantage of reducing the risk of false-positive results due to population stratification, since obtaining similar artifactual results in different MBUs is not anticipated.
In conclusion, the comparison of the variability detected by means of the MBU and STD sampling methods points to a diachronic continuity of the genetic structure of southwestern Sardinia. The benefit of the MBU sampling strategy lies in the possibility of: i) selecting the original population on the basis of written documents and not by inferring surname monophyletism, and ii) not excluding from the analysis unrelated individuals with polyphyletic surnames, when present, in the founder families.
Our results confirm that the MBU sampling strategy, despite the drastic reduction in sample size, does not introduce deviations in gene frequencies, even if haploid markers such as mtDNA are used. Therefore it can be considered a useful tool in association studies of complex traits, making it possible to infer the genetic settlement of the population, recovering the deepest branches of a genealogy and avoiding the recent contribution of foreign peopling.