Non-random pre-transcriptional evolution in HIV-1. A refutation of the foundational conditions for neutral evolution

The complete base sequence of HIV-1 virus and GP120 ENV gene were analyzed to establish their distance to the expected neutral random sequence. An especial methodology was devised to achieve this aim. Analyses included: a) proportion of dinucleotides (signatures); b) homogeneity in the distribution of dinucleotides and bases (isochores) by dividing both segments in ten and three sub-segments, respectively; c) probability of runs of bases and No-bases according to the Bose-Einstein distribution. The analyses showed a huge deviation from the random distribution expected from neutral evolution and neutral-neighbor influence of nucleotide sites. The most significant result is the tremendous lack of CG dinucleotides (p < 10-50 ), a selective trait of eukaryote and not of single stranded RNA virus genomes. Results not only refute neutral evolution and neutral neighbor influence, but also strongly indicate that any base at any nucleotide site correlates with all the viral genome or sub-segments. These results suggest that evolution of HIV-1 is pan-selective rather than neutral or nearly neutral.


Introduction
The Neutral Theory of evolution is mostly based on the emergence of new alleles or nucleotide bases by random mutation and their subsequent random fixation, loss or polymorphic maintenance (Kimura, 1968;King and Jukes, 1969;Crow and Kimura, 1970;Kimura, 1979Kimura, , 1991Kimura, , 1993. Kimura (1957) based this theory on the random fluctuation of gene frequencies described by stochastic matrices or by the mathematics of Brownian motion. He followed the development performed by Wright (1931) and Feller (1951), and applied Kolmogorov forward and backward equations (Crow and Kimura, 1970) elaborated for dealing with random motion to describe the random variation of allele frequencies, in order to predict the random pathway of a new mutant allele.
Studies on codon usage, synonymous and nonsynonymous substitutions made the pure neutralism untenable, so it was replaced by nearly neutral evolution (Kreitman, 1996a(Kreitman, , 1996bOhta, 1996;Hey, 1999). The status of the Neutral Theory has been extensively revised, and it is considered mostly refuted by phylogenetic analyses of codons, synonymous and non-synonymous substitutions and the different evolutionary behavior of the 1 st , 2 nd and 3 rd codon position (Nei, 2005). Most of, if not all, studies performed to test neutral versus selective evolution compare amino acid or nucleotide (involved in protein synthesis, post-transcriptional events) variations among individuals or taxa. These studies cannot solve the evolutionary condition of the genetic code itself, the acquisition and maintenance of genetic codes, pre-transcriptional evolution, genome sizes, maintenance of nucleotide isochores and signatures, chromosomal features, replication velocities, non-coding DNA and several other genome traits not related to transcription. Also, these analyses cannot inform on selective processes underlying pure base sequences and those related to the origin of life when the four bases and genetic code were established; they are blind for the most important part of evolution (Valenzuela, 2002a). Moreover, these studies have epistemic circularities from which they cannot go out. Studies on synonymous or nonsynonymous substitutions assume, without demonstration (creating a circular tautology), that synonymous substitutions are neutral or less selective than non-synonymous ones, but they cannot solve the absolute selective value of both types of substitutions. Also, the strong selective coadaptation of bases on 1 st , 2 nd and 3 rd codon positions (otherwise they cannot code) is overlooked and dealt with the necessary constraint of the genetic code. The present current position on acquisitions of pre-transcriptional evolution is to take them as un-debatable constraints (negative heuristic protective belt). For example, a replacement of adenine by guanine could change the velocity of DNA replication, leading to a great selective process that is not only invisible, but may be contradictory to codon analyses.
The foundations of the Neutral Theory were established after the discovery of the high frequency of polymorphisms that could not be maintained by balanced selection (heterozygous advantage; Kimura, 1968Kimura, , 1979Jukes and King, 1969;Nei, 2005). However, at the molecular level, the present genome studies show that for each polymorphic nucleotide site there are hundreds of monomorphic sites, so maintained for hundreds of millions of generations; such fixations can only be possible by selective evolution (Valenzuela and Santos, 1996;Valenzuela, 1997Valenzuela, , 2000Valenzuela, , 2002bValenzuela, , 2007. The most important factual feature of evolution is the maintenance of genome sequences (the core to be a living being) for thousands of millions of cell cycle generations (think about unicellular and haploid organisms) through different taxa, and not polymorphisms or genetic variability. This is, perhaps, the most important restriction of the present evolutionary studies based on comparative phylogenetic analyses, which need genome variability among taxa and are blind for the evolution of the invariant part of genomes that is their largest proportion. The trans-taxa genome maintenance (fixation) contrasts with the individual genome instability. Individual genomes cannot be maintained during their ephemeral life (DNA mutations, cancer, aging). Post-transcriptional neutral-selective analyses cannot be performed on the major part of eukaryote genomes with more than 95% of non-coding sequences. Furthermore, these studies cannot quantify the effect of selection and drift on current genomes, the only approach to answer the question on the amount of neutral, nearly-neutral, selective and eventually pan-selective evolution. Foundational errors of the neutral theory do not allow solving these mentioned insufficiencies (Valenzuela and Santos, 1996;Valenzuela, 1997Valenzuela, , 2000Valenzuela, , 2002bValenzuela, , 2007. The random condition of neutral evolution implies reversibility, that is the transformation of unicellular organisms into multicellular ones should be as probable as the reverse process (Valenzuela, 2007). No study has shown a similar situation. Evolution is directional; we see convergence, not reversibility. The question is, how distant are genome sequences from randomness? Neutral evolution is incompatible with nonrandom distribution of nucleotides, but a random distribution of nucleotides is compatible with selective evolution.
Selective and neutral evolution imply mutation as the origin of variability. While neutral evolution proposes that drift plays a fundamental role in the population destiny of mutations and selection rarely contributes to the process, selective evolution proposes selection as fundamental and drift as a marginal or rare evolutionary process. A quantitative definition of rare or marginal has never been proposed, so as to be tested scientifically.
A methodology based on the quantitative and qualitative deviations of nucleotide sequences from the random neutral expected distribution was developed. This methodology is independent of post-transcriptional processes, phylogenetic variability and comparative analyses, but it includes their molecular bases. It detects selective processes where codon analyses do not show them and in genomes or genome segments that vary or do not vary among taxa. These analyses are complementary to coding and non-coding-region analyses or comparative studies to understand some selective mechanisms. Our first discovery (Valenzuela, 1985;Valenzuela and López-Fenner, 1986) was that the nucleotides' distribution on chromosomes follows a Bose-Einstein distribution of undistinguishable balls (nucleotides) on distinguishable boxes (chromosomes). Then, the expected neutral (random) chromosome length and centromere position could be calculated, founding the mathematical basis for chromosome evolution (Valenzuela, 1985;López-Fenner, 1985, 1986;Valenzuela and López-Fenner, 1986).
The aim of the present study is to screen the whole genome of the HIV-1 virus and a segment that specify the envelope (GP120 ENV, S-env hereafter), and to establish and quantify their deviations from a random (neutral) distribution, without (intentionally) any reference to transcriptional processes or phylogeny. This approach was preceded by Gatlin (1976) who used the information theory to estimate the expected random sequence of coding DNA to test neutralism. She found a great deviation from randomness in DNA segments. Neutralists (Jukes, 1976;Kimura and Ohta, 1977) fast contra-argued that significant non-random sequence does not necessarily refute neutralism, because the mutation rate, in a site, could be influenced (a property of DNA or RNA polymerases) by the neighbor base context of this site. It was an undemonstrated negative heuristic protective hypothesis (an assumed neutral constraint that cannot be tested) to support the Neutral Theory. The debate closed without solution.
The position of Gatlin was considered satisfied by the unfounded neighbor influence, and non-random sequences were so accepted. However, neutralists did not realize that the neighbor influence does not change the expected random distribution of bases' sequences, because as a permanent property of polymerases, the neutral neighbor influence should also be isotropically and randomly distributed. Due to recurrent mutation, bases at any site are continuously changing; if evolution and the neighbor influence are neutral, the expected base at a site is a vector where the four bases are represented by a probability that is equal for all the sites (isotropy). In a short period, we should expect that each base has a proper neighborhood distributed isotropically along the genome, and the expected base, dinucleotide, trinucleotide, or any nucleotide sequence composition of long segments, should be equal, independently of its genome location.
Dividing genomes into long segments and comparing their mono or dinucleotide composition should test this isotropy of neutral evolution. Studies of base sequences have been performed and a great heterogeneity has been found. Significant different isochores (genome segments with similar base composition), maintained along with thousands of millions of generations, were found on every genome (Bernardi, 1993). These are macro-isochores (million bps), but micro-isochores (hundred or thousand bps) have also been found in fungi, bacteria and eukaryote organisms, both in coding and non-coding regions (Valenzuela, 1997;this article). Also big genome segments with different signatures (di-, tri-or multi-nucleotide structures) seem to be the rule in genomes (Karlin and Mrazek, 1997;Mrazek and Karlin, 2007;this article).
Besides isochores and signatures, a great deal of highly or moderately, tandemly repetitive DNA (VNTR, STR) or dispersed (LINEs, SINEs) in eukaryotes show high intra and inter chromosome correlations. The acquisition and maintenance of isochores, signatures and repetitive DNA for hundreds of millions of generations, and their wide intra and inter-chromosome variability refute definitively neutral and nearly neutral evolution and the neutral neighbor influence. It astonishes how the scientific community seems blind or unaware of this conclusive refutation. The random motion of the sand (bases) may build a sand castle (genome), but it cannot maintain the castle, on the contrary, it is the main cause of its destruction (Valenzuela, 2007).
It is important to note that the neighbor influence hypothesis is also valid for selective evolution, because a DNA or RNA sequence could have higher adaptive values than other sequences, as it will be shown in this article. So, the neighbor influence hypothesis rather blurs than helps to solve the selective-neutral condition of evolution. In the present study, conclusive evidence is given on the existence of micro-isochores and micro-signatures among the HIV-1 and S-env base sequences. HIV-1 was chosen because viruses evolve fast (Drake, 1993(Drake, , 1999Drake et al., 1998). There are different lines of evidence showing that S-env is under selective pressure (Reiher et al., 1986;Serres, 2001;Yang, 2001;Mani et al., 2002;Kitrinos et al., 2003;Travers et al., 2005;MacNeil et al., 2007). On the other hand, neutral molecular evolution has also been proposed for this gene (Leigh-Brown, 1997;Zhang, 2004).
Here I propose a method to estimate the distance from randomness (neutralism) for any DNA, RNA or amino-acid sequence, independently of the taxon at which it belongs, to test how much distant are genomes or genome segments (the core of living beings) from random processes. This method allows measurements of the distance from randomness of that part of living beings by which they stand as living beings (Valenzuela, 2002a).

Material and Methods
Complete cDNA sequence of the HIV-1 virus was obtained from Genbank (accession number AF005495, isolated in Brazil). Also, a cDNA sequence of the GP120 ENV gene (S-env) of the HIV-1 was used (accession number AF119820, from Cyprus and Greece). Abbreviations A, T, G and C will be used for Adenine, Thymine, Guanine and Cytosine; their base frequencies will be denoted by f A , f T , f G and f C , and their number by N A , N T , N G and N C , respectively. Degrees of freedom (DF) for tests are subscript. For huge values of the c 2 k test (k DF) an approximation was made taking into account that c 2 k distribution has mean k and variance 2k, then, an extrapolation may be obtained for the decay of the probability according to the number of standard deviations from the c 2 value and the mean (as a z test, with a correction made by the deviation of the c 2 from the Gaussian distribution according to DF, using known data of the c 2 distribution). For small-expected numbers (< 5) the Poisson distribution was used to calculate significance. For large values of z, the proposition of Freund et al., (2000) for one-tailed test was used: Probability for 4z = 0.49997; 5z = 0.4999997; 6z = 0.499999999; and extrapolation, according to this tendency, for larger z.

Rationale
Under neutral evolution, mutation and drift are the main evolutionary factors; the probability to find any of the four bases at any nucleotide site is the same. This probability has been shown to be 0.25 for the four bases, accepting equal mutation rates among them (Jukes and Cantor, 1969;Valenzuela and Santos, 1996;Li 1997). If transitions and transversions occur with different mutation rates, the expected base frequency will still be 0.25 for the four bases (Valenzuela and Santos, 1996). These probabilities change with different mutation rates among the bases, however due to the complementariness of A-T, and G-C, six parameters are sufficient to describe the system (Sueoka, 1995;Valenzuela, 1997); in this condition the expected f A equates f T and the same occurs with f G and f C . These equalities are not expected for single stranded nucleic acid where complementariness is not possible.

Analysis of the expected equal proportions of A-T and G-C
If expected f A = expected f T and f G = f C , then N A = N T and N G = N C . Both equalities can be tested by a c 2 1 test for equality where the expected number are EN A-T = (N A + N T )/2 and EN G-C = (N G + N C )/2, respectively. Thus,

Analyses of the neutral expected homogeneity of diand mononucleotide proportion
The influence of a base on mutation rates of neighbor sites does not change the equal expectancy of the four bases in a site, because the historical average influence of the neighbor bases in a site is the same for every site. If neighbor influence is true, it is expected that in short historical periods a base will be associated with a particular vector frequency of the four bases in the neighbor sites along with the whole genome. For neutral evolution, this frequency vector should be stochastically invariant along the genome, and this can be tested by examining the homogeneity of bases or dinucleotides in sufficiently long sub-segments of a DNA (RNA) segment. If this influence is neutral, in evolutionary periods (millions of generations or more), it should be balanced by the turnover of the four bases in this site. Here, "long" depends on the extension of the influence, thus, for our purpose, it is more than 10 sites, because we found that this influence for DNA genes is highly significant in consecutive bases (0 site separation), it decays greatly for bases separated by one site, two sites and it is not significant for separations equal or longer than three sites. This occurs in DNA segments; RNA viruses are expected to have a wider neighborhood, because RNA should be processed and folded to be put into the envelope (capsid), requiring that any site correlate with any other.

Analyses of base and no-base sequences
A base, for example A, may be consecutively present 0, 1(A), 2(AA), 3(AAA), n times in a DNA segment (Supplementary Material, S1). In the same segment "No-A" (Z = T, G and C) may be present 0, 1(Z), 2(ZZ), 3(ZZZ), n times. The set of A, with N A bases in a DNA segment may be taken as a set of undistinguishable balls and the set of Z, with N Z +1 No-A bases, may be taken as the walls of distinguishable boxes where balls are distributed, and vice-versa (with N Z balls and N A +1 boxes). The random distribution of undistinguishable balls in distinguishable boxes follows a Bose-Einstein (B-E) statistics (Feller, 1968;Supplementary Material, S1). With this expected random (neutral) distribution, the observed distribution of bases and no-bases was tested; total comparison is obtained by a c 2 k-1 , k being the number of non-0 cases of numbers of balls in a box. We can also test the observed variance with the expected B-E random variance with a specific test developed for this purpose (Supplementary Material, S1); this is the analysis of the variance of the variance. The number of a base and no-base runs can be tested with the non-parametric run-test (Supplementary Material, Appendix S2; Freund et al., 2000;Spiegel et al., 2001). These three analyses were applied to the total HIV-1 and to S-env DNA segment. The three tests are based on the B-E distribution, their information overlaps partly, but they also inform on independent traits of deviations from randomness. The analyses of the number of consecutive bases and no-bases inform on the general and specific distribution of a base and no-base; the analysis of the variance of the variance informs on how much clustered or widespread are the sequences of bases or no-bases (uni, bi or multiple modality); the run analysis informs on the tendency of bases and no-bases to cluster in series or to be isolated. Base sequences can be also analyzed with the Geometric distribution, assuming p as the probability to find a base and q = (1-p) the probability of finding a no-base (Valenzuela and Santos, 1996). Here only B-E analyses are performed.

Results and Discussion
The number of nucleotide sites for the whole genome of HIV-1 was 8954; N A = 3236 (36.14%); N T = 1964 (21.93%); N G = 2173 (24.27%); N C = 1581 (17.66%). The number of sites for S-env was 2627; N A = 901 (34.30%); N T = 636 (24.21%); N G = 621 (23.64%); N C = 469 (17.85%). The difference in base composition of both DNA segments was near the significance level (c 2 3 = 6.99, p = 0.072); this figure should be considered significant because positive covariance between S-env and HIV-1 base composition was not considered. The isolated f T was significantly higher in S-env (z for proportion = 2.48, p = 0.013). Both base compositions are significantly different from the expected neutral distribution of 0.25 for each base (no test is needed).
Figures are different from the expected A-T and G-C equalities; this may be due to the fact that this is a single stranded RNA retrovirus, but an important part of its cycle occurs as DNA, in the host genome.

Homogeneity tests for proportions and distribution of di-and mono-nucleotides (bases)
HIV-1. Table 1 shows the random-Expected and Observed distribution of overlapping dinucleotides of HIV-1 separated by 0, 1, 2 and 3 nucleotide sites. The c 2 9 values decayed strongly from consecutive (0 separation) dinucleotides (p < 10 -80 ) to those separated by 1 (p < 10 -30 ), 2 (p = 0.000016) and 3 (p = 0.01736) nucleotide sites. An important part of significance found in 1 and 2 sites separation matrices may be due to the big deviation present in consecutive sites. This indicates that the neutral neighbor influence, if real, is mostly reduced to one or, at most and slightly, to two sites.
The study was carried out with separations until 33 sites, finding significant values that ranged between 1 and 5% for separations over three sites, with a few exceptions, as that observed for 8 separation sites (p = 0.401). This is a mystery, because mononucleotide pairs with 2, 4, and 16 separation sites (which include those with 8 sites away) were significantly correlated (deviated from randomness); with 32 separation sites, no deviation from randomness was found (p = 0.557), but that deviation was observed for 31 and 33 (p = 0.009, p = 0.0000046). The study of waves of correlations among sites is out of the scope of this article. 162 Valenzuela This agrees with our intuitive prediction for single stranded RNA segments that could be packed into a capsid. All these correlations cannot be due to random influences and refute neutralism, indicating that a nucleotide at any site must correlate with the whole context of a small single-stranded RNA genome to be maintained. Our analyses on eukaryote genomes show a different picture, where correlations of this type are restricted to one or at most two sites; separations of more than 2 sites yield non-significant values (unpublished results). The structure of dinucleotides (0 site separation) showed significant deviations from randomness, ranging from more to less significant as follows: lack of CG, excess of CA, excess of AG, excess of GG, excess of CC, lack of GT, excess of TT, excess of CT, lack of TC, and lack of GA. The lack of CG is found widespread in eukaryote genomes that inactivate genes by means of methylation of C in CpG dinucleotides (often promoters). However, this is a RNA virus that can be incorporated to the host genome.
It is straightforward to propose that, either it is a selective adaptation to host CpG inactivation mechanisms of RNA viral genome, or HIV-1 or its ancestors were incorporated in the primate genome several million years ago and shares with hosts the same inactivation mechanism. In both cases, this is a strong evidence for selective adaptation; it is still possible to invocate the neutral neighbor influence, but the level of significance (Expected 383.7, observed 79, c 2 1 = 241.99, p < 10 -50 ) makes this mechanism untenable. HIV-1 appeared in humans not more than 70 years ago, an insufficient time to produce such a deviation from the expected neutral random distribution, moreover, due to its high mutation rate (Drake, 1993(Drake, , 1999Drake et al., 1998), and if evolution is (mutations are) mostly neutral, this is a sufficient time to yield a near random neutral base distribution. Thus, the hypothesis that this dinucleotide structure appeared in primates several millions of generations ago and is maintained by selection until the present human infection is strongly affirmed. Moreover, the observed number of the symmetric (main diagonal) GC pair, that theoretically must have the same frequency as CG (if evolution is neutral), was 421, not significantly different from the expected number (383.5). Thus, to maintain the neutral theory, it is necessary, besides the addition of a very especial kind of neighbor influence, the addition of the hypothesis of polarity (5'-3') discrimination of both CG and GC pairs for mutation and neighbor influence.  very similar virus in chimpanzees is a well-known fact (Jern et al., 2006) corroborating these results, but, inferences of the present study do not need phylogenetic information and are founded only on its analyses and deduction from theoretical background. It is impressive that the dinucleotide structure found with 0 site separation (0SS) disappears and is reversed with one site separation (1SS). The case of the highly lack (p < 10 -50 ) of CG in 0SS reverted to a significant excess (p = 0.0015) in 1SS is dramatic. Let us assume (better imagine) that neutral evolution, with the addition of the neighbor influence and the 3'-5' discrimination has produced and maintained these huge deviations from randomness (even though this is factually impossible). There is still an independent test for neutralism, because these deviations should be distributed homogeneously along with the whole HIV-1 genome. Table 2 presents the division of HIV-1 genome in 10 equal sub-segments and the analyses for di-and mononucleotide distributions. The huge heterogeneity of dinucleotide (p < 10 -20 ) and mononucleotide (p < 10 -15 ) distributions refutes definitively neutral evolution and the neighbor influence.
Let us examine a case, in dinucleotides, the mononucleotide frequency vector associated to A (first four rows) in segment 4°is (f A = 0.3924; f T = 0.1907; f G = 0.2643; f C = 0.1526) and in segment 10°is (f A = 0.2784; f T = 0.1412; f G = 0.3725; f C = 0.2078). There is no known property of polymerases that enables them to distinguish A of the segment 4°from A of the segment 10°, so as to yield such different mutation rates leading to these different vectors of the contiguous nucleotide. The heterogeneities of the nucleotide frequency vector, in the 10 sub-segments, associated to A, T, G, and C were: c 2 27 = 56.9, p = 0.00066; c 2 27 = 41.0, p = 0.0412; c 2 27 = 49.2, p = 0.0056; and c 2 27 = 90.9, p = 0.0000015, respectively. The same high heterogeneity occurs among nucleotide frequencies. It is impossible for neutral mutation rates, genetic drift and the neighbor influence to produce and maintain such deviations from the expected random distribution. Table 3 shows the dinucleotide distribution of S-env with 0, 1, 2 and 3 separation sites. The distribution is similar to that of HIV-1 whole genome. Significances are smaller due to smaller numbers. The same similarity of 164 Valenzuela both segments is found in Table 4 that presents di-and mononucleotides in three equal sub-segments (to work with S-env sub-segments similar to HIV-1 sub-segments) of S-env sequence. Even though S-env has near 25% of the total HIV-1 genome and a significant deviation from randomness of the mononucleotide distribution in the three sub-segments was expected, data agreed with randomness instead. Also the variance of the dinucleotide composition was higher (not significantly) in HIV-1 than in S-env [see percents in the last column of Tables 2 (from 12.4 to 0.9) and 4 (from 11.1 to 1.2), respectively]. This is a very interesting result that we have found consistently. DNA segments submitted to known higher pressures of selection, as for example coding regions, are not necessarily more deviated from randomness (in nucleotide sequences) than less selective segments (non-coding regions). This is expected due to the constraint of codons (triplets) in coding regions or to selective constraints that do not allow for a great variability of nucleotide sequences. Non-coding regions can accept a long repeat of mono-, di-, tri-, tetra-or multi-nucleotides that coding regions cannot. Evolutionary studies of post-transcriptional processes are blind for evolution of pre-transcriptional ones that do not have a consequence on coding variability. Furthermore, studies on variable regions of genomes (polymorphism) are blind for selective processes of non-polymorphic regions that are by far more frequent than variables ones. As we indicated, the most important evolutionary problem is not variability or the maintenance of variability, but invariance or the maintenance of invariance along with millions of generations. The maintenance (fixation) of similarities (invariants) is impossible for neutral or nearly neutral evolution (Valenzuela and Santos, 1996;Valenzuela, 2000Valenzuela, , 2007. As it was remarked our individual genome is unstable, we die inexorably by mutation (cancer and aging), but the Homo sapiens genome is more stable than the individual one due to selection within the species and higher taxa.

Analyses of sequences of isolated bases or no-bases (Bose-Einstein analyses)
HIV-1. Table 5 presents this analysis for the HIV-1 complete genome. The statistical significance of isolated number of runs of bases is superscript. Adenine, A: only an excess of 1A was significant (p = 0.0019), however, the total distribution was significantly different from randomness (p = 0.00013), thus A showed no tendency to cluster and an observed variance of A distribution (A-OVar) smaller but not significantly different from the expected value (z = 0.94, p = 0.3472); both results indicate that A is more dispersed than expected. No-A: a significant excess of 1No-A (p = 0.006), 2No-A (p = 0.012), 9No-A (p = 0.012) and 23No-A (p = 0.038) were found, thus No-A showed a HIV-1 refutes neutralism 165 slight tendency to be both isolated and cluster in couples and 23No-A tandem, the total distribution being significantly deviated from randomness (p = 0.0005), no significant higher No-A-OVar than expected difference was found (z = 1.4, p = 0.1585). The runs of A and No-A yielded z = 2.8, p = 0.0045, the positive value indicates that there were more runs than the expected mean, confirming that A and No-A are more dispersed than expected from a random B-E distribution.
Thymine: the frequency of 1T was significantly less than expected (p = 0.0054), an excess was found for 4T (p = 0.018), giving a significant total (p = 0.0014), thus T showed a mild tendency to cluster; the observed T-OVar was significantly greater than expected (z = 2.83, p = 0.0047). No-T: as expected from the T distribution, the category 0No-T showed a significant excess (p = 0.00075), because T showed a tendency to be clustered; other excesses were found in 22No-T (p = 0.00995), 39No-T (p = 0.027) and 40No-T (p = 0.022); 3No-T presented a slight loss (p = 0.03); the total was also significant (p = 0.00017), in favor of clusters of No-T; No-T-OVar was larger than expected (z = 3.5, p = 0.00045). The run test for T and No-T yielded z = -4.3, p = 0.000015, indicating less runs than randomly expected; this confirms the tendency of T-No-T to cluster.
Guanine: G showed less 1G than expected (p = 2.9 x 10 -6 ) and excesses of 4G (p = 0.00032) and 6G (p = 0.0001), being the total deviation from randomness highly significant (p = 6 x 10 -8 ); G-OVar was significantly larger than expected (z = 4.4, p = 0.00001), thus, G showed a strong tendency to cluster. No-G: 0No-G was more frequent than expected (p < 10 -8 ) and 1No-G less than expected (p < 10 -10 ), there was an excess of 36No-G (p = 0.023); the total deviation was also highly significant (p < 10 -8 ); the tendency to cluster was not so marked as for G; No-G-OVar was greater than expected, but close to significant values (p = 0.074). There were less G and No-G runs than the expected mean (z = -6.8, p < 10 -8 ), confirming the tendency of G-No-G to cluster.
S-env. It is important to remark that this S-env came from another HIV-1 strain than the HIV-1 whole genome. However, the general structure of deviations from randomness of sequences of bases and No-bases was similar to that of the complete HIV-1, as expected, with less significant figures due to the smaller number of nucleotide sites. A few disagreements between env and HIV-1 values should be remarked. Adenine: in S-env there were more 2A dinucleotides than expected (p = 0.0028); in HIV-1 there were less. Thymine: S-env showed more 1T than expected (nonsignificant); HIV-1 had less observed than expected 1T (p = 0.007). No-T: A highly excess of 0No-T (p < 0.001), in HIV-1, was not correlated with a small deficiency (p = 0.8) in S-env. A significant excess of 2No-T (p = 0.013), in S-env, was not found in HIV-1, which instead presented a non-significant deficiency of 2No-T. G, No-G, C and No-C did not show differences in both segments. The results of analyses of OVar and runs were consistent with those found in HIV-1. This last agreement between HIV-1 whole genome and S-env, on addition to similar distribution of dinucleotides, allows assigning S-env to HIV-1 (or to a similar retro-virus) with high confidence, even ignoring its real origin. 166 Valenzuela Table 4 -Di-and mono-nucleotides of three segments of S-env.