Assessment of SNP-SNP interactions by using square contingency table analysis

The evolution of SNP-SNP interactions has become an interesting field in genetic epidemiology. Most of the studies, aimed to analyze the relationship between genetic factors and disease of interest, are focused on single SNP associations. However, for quantitative traits, influenced by the interplay of environmental and more than one genetic factors, interaction between the multi factors should be taken into consideration. In this study, symmetry models for square contingency tables are applied to the cross-classified SNP-SNP interactions data. Results from a genome-wide association analysis of blood pressure are used as a prior evidence for the interacted SNPs.


INTRODUCTION
A single nucleotide polymorphism (SNP) is defined as the variation at a single position in a DNA sequence among individuals. DNA sequence is formed from a chain of four nucleotide bases, namely, A, C, G, and T. For example, the variation is classified as a SNP, when a substitution of a T for a A in the nucleotide sequence GGAATCG, consequently turning out the sequence GGATTCG.
SNP-SNP interaction is generally defined as the interaction between different loci and, the statistical interaction analysis is expected to explain the etiology of many complex human traits such as diabetes, hypertension and asthma. In recent years, in spite of the increased number of genome-wide association analysis, interaction analysis of these genome-wide research are still few in number. As a major drawback, in genome-wide association analysis, the main focus is usually the single SNP associations. However, the effect of one locus is masked by the effects at another locus or the joint effect two SNPs may be significant whereas they are ineffective separately. Thus, the interactions play an important role in explaining the missing heritability of complex diseases.
A common and simple way for detecting the interaction between two SNPs is to use several types of regression techniques, with straightforward implementation of interaction analysis (Hartwing 2013). Moreover, when there is more than two SNPs, classical statistical approaches may lack power due to the high dimensionality. Vaidyanathan et al. (2017) followed up a clustering approach based on pairing up the SNPs based on similarity of genetic identity and they carried on the analysis by conducting a standard case control association test using Cochran-Mantel-Haenszel test to analyze the SNP-disease association.
On the other hand, a case control study is a common study design used for testing the association by comparing the frequency of SNP alleles in cases who have been diagnosed with the disease under study and controls who known to unaffected (Lewis 2002, Clarke et al. 2011. Contingency table analysis methods allow alternative models of genetic relationship by summarizing the counts in different ways. For example, Haber (1982) performed intraclass contingency table analysis for testing the independence under the assumptions of marginal homogeneity, quasi-independence and symmetry by intercrossing maternally and paternally inherited genes. A comparative chi-square analysis was applied by Song et al. (2014) to screen the large gene expression data for conserved and differential gene interactions.
The goal of this paper is to provide an alternative approach to analysis the cross-classified SNP-SNP interactions with symmetry models. The symmetry models described in the literature are complete symmetry, quasi-symmetry and marginal homogeneity models. Using these symmetry models, we explore the most suitable symmetry structure among the SNP-SNP pairs that have been found to have a relationship in the preliminary study. These symmetry models are applied to 24 SNP-SNP pairs and the most relevant SNP-SNP-pair is obtained.
The remaining part of the paper contains the description of genetic association and its theoretical properties, symmetry models with their theoretical background, a real life data application and some concluding remarks, respectively.

GENETIC ASSOCIATION
Genetic association studies are used for testing the relationship between the phenotype of interest and the genetic variant. Phenotype can be defined as the observable properties of an organism that are produced by the interaction of the genetic and environmental variants. In genetic epidemiology, the phenotype of interest is usually obtained as a disease status or a continuous indicator. In addition to this, in most of the association studies, SNPs are considered as the genetic variants. As above-mentioned, SNP is the variation at a single position in a DNA sequence among individuals and usually coded in genotype as a combination of alleles. Considering a SNP consisting of a single bi-allelic locus with alleles a and A. Then, the SNP can be characterized by three different possible categories, aa, aA and AA.
Testing genetic association is performed by using different statistical methods depending on the structure of the phenotype. For continuously measured phenotypes such as blood pressure measurements, linear models are useful tools. When the binary phenotype (0/1) is a case and control, then a logistic regression model can be used to detect the relationship between the trait and the genetic variation In genetic epidemiology, case control studies are widely used designs since they allow contingency table analysis as a result of the categorical structure of the genotype (Slager & Schaid 2001, Velez et al. 2016. Under the null hypothesis of no association with the disease, the genotype frequencies are expected to be the same in case and controls. A contingency table can be analyzed by using standard test statistics that measure the divergence of the observed frequencies from the expected ones under the null hypothesis of no association. For a single bi-allelic SNP with alleles a and A tested in a case control study, the data generated consist of six counts of the numbers of genotypes (aa, aA, AA) in cases and controls. In case of interaction of two SNPs, structure of the table transforms from 2x3 (Table I.(a)) to 3x3 (Table I.(b)) for cases and controls, separately. In Table I, n is the number of cases, m is the number of control and N = n + m is the total number of patients.
For the interaction analysis of SNPs, square contingency table which is a special case of contingency tables, could be obtained as in the above tables. Square contingency tables that arise in dependent samples where the row and column variables have same level. Some specific models should be used in the analysis of these kinds of tables. These models are mostly in the symmetrical pattern that represents the symmetric structure of tables.
In this study, three different symmetry models are considered which are employed in the case of complete symmetry, quasi symmetry and marginal homogeneity.

SYMMETRY MODELS
Let n ij be the observed frequency in the cell (i,j) and p ij denotes the probability of the corresponding cell. Then, the complete symmetry (S) model is defined by; and, the S model is based on R (R -1) /2 degrees of freedom (df), where R is the dimension of the square table (Goodman 1985, Bishop et al. 1975).
This model indicates that the probability that an observation will fall in cell (i,j) is equal to probability that it falls in symmetric cell (j,i). In addition, as an extended model of the S model, Quasi Symmetry (QS) model is defined by; The QS model has (R -1) (R -2) /2 df (Yamaguchi 1990). QS model indicates that equality of odds ratio on one side of the main diagonal and the other side.
Other extended model of the S model, the Marginal Homogeneity (MH) model is defined by; 1955, Tahata et al. 2008. This model indicates that the row marginal distribution is identical to the column marginal distribution.
In the genetic field, for testing the interaction between two separate bi-allelic SNPs, symmetry models can be used and it can be interpreted as the similar genotype distribution occurs in SNP-1 and SNP-2.
Let p ij = n ij /n denotes the probability of an individual having i th genotype level for the SNP-1 and j th genotype level for the SNP-2, i = 1, 2, 3. In terms of the S model, the null hypothesis states that there are no differences between p 21 = p 12 for genotypes "aa" and "aA", p 31 = p 13 for genotypes "AA" and "aA" and p 32 = p 23 for genotypes "aa" and "AA". For the QS model, the null hypothesis of no difference is the statement of p 12 p 23 p 31 = p 21 p 32 p 13 .
Let p i. = n i. /n denotes the probability of i th genotype level for the SNP-1 and p .i = n .i /n denotes the probability of i th genotype level for the SNP-2. The null hypothesis for the MH model tests the differences between, namely p .1 = p 1. for genotype "aa", p .2 = p 2. for genotype "aA" and p .3 = p 3. for genotype "AA".
The Maximum Likelihood estimates of expected values e ij under S model is The likelihood equations for the QS model are defined as; e i. = n i. and e .i = n .i i = 1, 2, 3 e ij + e ji = n ij + n ji i =j Note that marginal homogeneity is not equivalent to a log linear model and for studying marginal homogeneity (Agresti 2002). The MH model assumes that summation of marginal is symmetric whereas the structure of table is non-symmetric. When α = (p 12 p 23 p 31 )/(p 21 p 32 p 13 ) equals to 1, QS is equivalent to Caussinus (1965) showed that S is equivalent to QS and MH holding simultaneously. Thus, the distribution of a SNP-SNP interaction that satisfies both QS and MH, also satisfies S. Vice versa also holds, S≡QS∩MH.

ASSESSMENT OF SNP-SNP INTERACTIONS
The cell distribution of the parameters under symmetry models can be represented in a matrix format. Let S ij denotes the element of a design matrix S in row i and column j: where, k = |i -j| (Lawal & Sundheim 2002, Efendioğlu 2015. Thus, for a bi-allelic SNP-SNP interaction, the S matrix and the corresponding R and C matrices are given below, where, R is the row matrix, C is the column matrix.
Under the null hypothesise that no interaction exists between the SNPs, test statistic follows a chi-square distribution with associated df. The likelihood ratio test statistic equals Several models may fit to the data in the square contingency table. In such cases, the model selection process refers to the selection of the best fitting model among the models. For model selection, ranking information criteria is a common way. The well-known information criteria is the Akaike's Information Criterion (AIC) that might be used for the model selection: The model having the smallest AIC value gives the best fitting model (Akaike 1974).

REAL DATA APPLICATION
Hypertension and relatedly, the abnormal levels of blood pressure, are the cardiovascular risk factors. In this paper, to evaluate the performance of the symmetry models in testing SNP-SNP interaction, as a prior knowledge, the results of a genome wide analysis are analysed (Karadağ & Aktaş 2018). The 24 top associated SNPs are detected by using a multilevel latent class modelling approach considering familial and serial correlations. Table II includes the genetic position of the variants with a chromosome number (Chr) and a chromosomal position. The number of individuals having recoded genotypes levels of 1, 2, 3 are n 1 ,n 2 and n 3 , respectively.The association results are summarized in Table II,  For the interaction analysis of SNPs, 3x3 square tables are generated by using number of individuals, n 1 , n 2 and n 3 given in Table II  variants. In Table III, likelihood ratio test statistic G 2 and p-values are summarized for only significant pairs under every three symmetry models.
According to the results in Table III, as an example of a significant interaction, we can say that the pair 1-24 fits to whole symmetry structures depending on the p-values, 0.970, 0.900 and 0.892, respectively (p>0.05). The distribution of observed counts is given by   The comparison of three symmetry models for significant interactions, are evaluated by the AIC values. G 2 , df and AIC values are summarized for each model in Table V. The smallest values of G 2 and AIC indicate the SNP pair that best fits the model. SNP-SNP interactions are represented in 3x3 square contingency tables for case-control. The row and column variables have labels as "aa", "aA" and "AA" indicating the SNP characteristics.
The S, QS and MH models are applied to the 24 top associated SNP pairs which are structured in the form of square contingency tables. Considering that the S model rarely fits data very well, for all the SNP pairs, excluding the pair 14-15 and the pair 23-24, we see that we provide the S≡QS∩MH. It means that the test statistic for goodness-of-fit of the S model is asymptotically equivalent to the sum of the QS and MH models. G 2 values for the pairs 14-15 and 23-24 are calculated as zero due to the non-diagonal elements of the contingency tables consist of zero.
For the interactions, data that fitted to the S model indicate that p ij = p ji holds for i, j = 1, 2, 3. We could say that, for instance, p aa,Aa = p Aa,aa , p aa,AA = p AA,aa and p Aa,AA = p AA,Aa over alleles