Transcription factors and molecular markers revealed asymmetric contributions between allotetraploid Upland cotton and its two diploid ancestors

Three Gossypium species have been used to breed cotton as they vary in their fiber production and resistance to stresses. Transcription factors (TFs) mostly are present in different copies or isoforms by which they conduct their regulation. Their copy number can determine organism behavior to a cue. Simple sequence repeats (SSRs) are one of the most informative and versatile molecular markers. Transcription factors of three Gossypium species were compared in silico. Seventy eight percent of TFs were common between the three species. Single copy for each species were 6057 TF. Gossypium hirsutum and G. raimondii shared the most common interspecific TF. Gossypium arboreum species-specific TF were the least. MYB TF family with its subfamilies is the most abundant followed by bHLH and AP2/ERF family. Gossypium hirsutum generally possesses more TF copies compared to other two species. The 2109 single-copy clusters indicate that G. hirsutum has received one copy from only one parent. The five most abundant SSR markers of TF were dinucleotides AT, TA, TC, CT and TG belonging to G. raimondii. For G. arboreum and G. hirsutum they were trinucleotides CAA, CGA, TGA, GAA (CAT: G. hirsutum) and TCA. The findings suggest that there is regulatory difference between the three Gossypium species for fiber production and insect attack response. The differences may be due to some adaptive deletion events during speciation of G. hirsutum from its parents G. arboreum and G. raimondii.


INTRODUCTION
Cotton plant belonging to Gossypium genus is one of the principal providers of cellulosic fiber and of great economic importance. In addition, it is one of the most important oleaginous crops worldwide because of cottonseed, which makes it an economically and agronomically important crop, mainly for the textile industry where it is usually utilized as raw material for the manufacture of garments. According to an estimated approximation concerning the world production of cotton, 80% of cotton fiber comes from Brazil, China, India, Pakistan, Turkey, USA and Uzbekistan. This crop represents a crucial percentage to the gross national product in many countries (Zahid et al. 2016). Cotton is a differentiated epidemic cell produced by Gossypium species (John and Crow 1992). However, cotton production is usually vulnerable to biotic and abiotic stresses. Therefore, breeding for stress-resistant cotton is a fundamental issue for plant biotechnology programs. This importance makes it mandatory to advance the application of biotechnological tools for cotton improvement (Rathore et al. 2015) in order to provide solutions to common problems that cotton producers are facing quotidianly.
The temperature ranges for cotton production are around 20 and 30 °C (Reddy et al. 1991). The effects of temperature on some developmental stages (germination, seedling growth, vegetative production, morphological development and maturity attributes) are very important for ensuring a correct boll production. For this, it is well known that cotton is enormously susceptible to temperatures throughout all developmental stages. However, flowering stage is when cotton plant shows more vulnerability, mainly to high temperature. Regarding biotic stress, cotton plant is mainly attacked by pests, particularly arthropods. These herbivores are housed mainly on foliage, flower buds and capsules (Peterson et al. 2016;Renou et al. 2012). However, cotton also suffers in lesser proportion from the attack of diseases and pests housed on soil, in which the nematodes are included (Greenberg et al. 2012). Cotton is considered a very high consumer of synthetic products, especially pesticides, which translates into high production costs. All these characteristics make it a crop with great difficulties, which endanger harvests, human health and the environment.
The genome of three species including G. raimondii (Wang et al. 2012), G. hirsutum (Li et al. 2015) and G. arboretum (Li et al. 2014) have been published. Genomic gene-rich regions responsible for fiber development have been disclosed (Xu et al. 2008) on four chromosomes involved in initiation (chr5), early, middle and late elongation (chr10 and 14), secondary cell wall deposition (SCWD), maturation (chr15) and finally verification (chr10 and 14). Transcription factor (TF) importance for cell wall process involved in cotton fiber production has been revealed recently (MacMillan et al. 2017).. However, due to their impacts on cotton production, the genes that are involved in fiber and cellulose production as well as biotic and abiotic stresses is under more consideration.
Transcription factor (TF) is one of the most important regulatory elements among plants. They can control the whole plant ontogeny and change expression of other key genes when needed like different stages of growth and development, response to stresses, production and reproduction. They can make networks collaborating with other TF or genes to control a mechanism, pathway, or production and help plant respond to stresses (Jazayeri et al. 2018;2019). Transcription factor importance for cell wall process involved in cotton fiber production has been recently revealed (MacMillan et al. 2017).
Rice produces terpenes for defense against pathogens and herbivores. OsWRKY45 TF was identified as a regulator in the production of diterpenoid phytoalexins like momilactone, phytocassane and oryzalexin by priming the expression of biosynthetic genes (Akagi et al. 2014). The expression of HbWRKY1 has been associated with increased biosynthesis of natural rubber, a polyisoprenoid derived from wounding the bark of the tropical tree Hevea brasiliensis . WRKY TFs family directly or indirectly regulate plant defense response altering the secondaries metabolites biosynthesis (Schluttenhofer and Yuan 2015). All these advantages have been taken into account by breeders in other crops due to the inherent role that TF play on these biomolecular processes. However, despite their importance, they have not well been studied in Gossypium and are still under development.
Molecular markers are DNA fragments that colligate with a particular region of the genome. Molecular markers in sequence motifs repeat in tandem include from 1 to 6 base pairs generally. They are grouped in different classes as microsatellites or short tandem repeats (STRs), simple sequence repeats (SSRs), simple sequence length polymorphism (SSLP), sequence tagged microsatellite site (STMS) and single nucleotide polymorphism (SNP). Among them, SSRs and SNPs are under S. M. Jazayeri et al. more consideration as they are useful informative markers for various studies on evolution, biodiversity, phylogeny and molecular breeding (MAS: marker-assisted selection) and have importantly been employed for constructing genetic linkage maps. They can be transferred between different species. They are reproducible, economic, polymorphic, heterozygous and allele specific and are known robust markers. Simple sequence repeats are evaluated by PCR due to flanking sequences by primer design. To study them, it is possible to disclose SSRs in a genome wide range and then see if the genes of interest possess any SSRs to use them in subsequent plant selection and breeding program (Jazayeri and Villamar-Torres 2017). Simple sequence repeats are important for plant breeding and selection.
This article aimed to comparatively study TFs in three species of Gossypium to reveal how similar and different they are in terms of regulatory system. Transcription factors related to fiber and stress are disclosed. In addition, the SSR markers are reported for TFs of three species. The results of this article provide a base for cotton plant breeding and selection in terms of regulatory system and TFs.

MATERIAL AND METHODS
The protein sequences of TF families from PlantTFDB (Jin et al. 2017) and iTAK database (Zheng et al. 2016) for three species of Gossypium including G. arboretum, G. hirsetum and G. raimondii. PlantTFDB presents 320,370 TFs from 165 green plant species in its last version underlying TF motifs, regulatory elements and literature. iTAK identifies TF based on the consensus rules of TF protein domains of each family using other databases as evidence.
The TF sets from two databases were combined for each species to generate an integrated TF list. The TF number of each family were compared for three species. The combined TF set for each species were subject to cluster using CD-HIT (Xiong et al. 2012) by 90% similarity threshold to remove redundancy. CD-HIT can group the input sequences based on their similarity and overlap. Each cluster is a representative of the sequences according to the threshold similarity. The final clustered TF list for each species was used for subsequent analyses.
In order to extract gene nucleic acid sequences and genome position for the clustered TFs of three species CottonGen (Cotton Database Resources) ) was employed.
Three TF lists were compared by OrthoVenn . It is a web-based server program and clusters protein sequences up to six lists to find their common and different proteins. It compares different protein sequence datasets, analyzes their annotation and generates Venn diagram and Gene Ontology (GO) enrichment for the clustered proteins. The default parameters were used.
Transcription factors of each species were classified using PlantTFcat (Dai et al. 2013). It groups the input sequences, compares them against its TF database and generates the information about the likely TF found in the sequences such as their family and domain.
GSEAServer was employed to annotate TFs of the three species. It is an online web-based and high-performance pipeline to perform fast functional annotation and gene/transcript set enrichment analysis for de novo assembled transcripts from non-model plant. It does blastx-search against nonredundant NCBI protein database based on available plant species and then annotates the blast hits using Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes reference databases.
The protein clusters generated by OrthoVenn were visualized by TreeView of Environment for Tree Exploration (ETE) toolkit (Huerta-Cepas et al. 2016). It was also used to construct the gene tree for three Gossypium species TF sets. MEME was applied to find motifs for the selected TF clusters. UniProt was employed to reveal pertinent function of each protein cluster. iTOL (Letunic and Bork 2016) was utilized to construct the phylogenetic tree of three studied species for the chosen TF families. MEGA7.0.21 (Kumar et al., 2016) was employed to align, find and visualize the common sequence domain and repeats.
Simple sequence repeats were identified by the Perl script MISA (MIcroSAtellite identification tool. The parameters established for MISA were set to identify SSRs of di-, tri-, tetra-, penta-and hexanucleotides. These SSRs were selected because of their more polymorphic character than SSRs of 12 to 20 bp. Mononucleotide SSRs were not considered as they are prone of errors due to assembly and sequencing. A distance of 100 bp between two SSRs was adjusted for compound SSRs (distinct and adjacent SSRs).
GOMapMan (Ramšak et al. 2014) was employed to extract the genes related to stress based on Arabidopsis map. The genes were subject to search in Bulk Data Retrieval in TAIR to generate fasta file containing gene sequences. Blast+ 2.6.0 (Camacho et al. 2009) was used to find homologs for TFs.

Transcription factor of Gossypium species
Transcription factors for each Gossypium species were compiled from two main plant TF databases and clustered to obtain TF species-specific lists (Table 1). Gossypium hirsutum possesses more TFs than the other two species comparing the final TF clustered integrated datasets. The integrated TF list of each species that was subject to group the TF families show that TFs, in terms of family number and member, were most in G. hirsutum followed by G. raimondii and G. arboreum, respectively. The numbers of nonredundant TFs were 2441, 3303 and 2967 for G. arboretum, G. hirsutum and G. raimondii, respectively. Gossypium hirsutum gains 61 TF families followed by G. raimondii with 60 and G. arboreum with 59 families ( Table 2). The remaining proteins were categorized into other group since they are not TFs and belong to other type of regulatory elements.
Based on PlantTFcat classification, the most abundant family is MYB-HB-like followed by AP2-EREBP, bHLH, C2H2, NAM, Homobox-WOX, WRKY and bZIP. If MYB, MYB-HB and MYB/SANT were considered as a whole group, i.e., MYB family, then the family number increases for three species of Gossypium species as the biggest family. However, the position of the top families might move a little for the above-mentioned families.
The least abundant families were LFY, NOZZLE, STAT and WD40-like having at least one copy in each species. However, G. arboreum lacks two families including BED-type(Zn) and MYB-related and G. raimondii lacks MYB-related family based on the classification done by PlantTFcat. These results are not in accordance with PlantTFDB and iTAK databases where MYB-related members are very abundant for all three species (more than 90 members). This is due to the categorization system followed by PlantTFcat that classifies MYB-related TF under MYB family. BED-type(Zn) family has 4 and 9 copies in G. raimondii and G. hirsutum, respectively.
All total TFs of three species including 9306 proteins were clustered by 90% similarity. A list of 4260 clustered proteins was generated as the reference for the Gossypium genus 10 out of which 4112 proteins were annotated and classified by PlantTFcat. They were categorized as TF (4341), chromatin regulator (35), chromatin remodeler (2), chromatin remodeling (71), response regulator (44), transcription regulator (5).

Common and species-specific transcription factors between the three Gossypium species
The common TFs between the three species included 7280 proteins out of 9306, which is 78% of all TFs. Between G. hirsutum and G. raimondii 7857 TFs (84%), between G. arboreum and G. hirsutum 7447 (80%) and finally between G. arboreum and G. raimondii 7388 TFs (79%) were found common from the most to the least. Gossypium arboreum did not have any specific TF cluster while G. hirsutum and G. raimondii possess 16 and 12 TF clusters covering 33 and 25 genes, respectively. In addition, 2109 single-copy gene clusters including 6057 TFs (93%) that comprise three copies (one copy for each species) were disclosed. The singleton TFs were 136, 595 and 385 as species-specific for G. arboreum, G. hirsutum and G. raimondii, respectively (Table 3). As shown in Fig. 1, the most abundant clusters belong to G. hirsutum followed by G. raimondii and G. arboreum. The species-specific singletons belonged to different families and were not completely specific. However, they could not match the clusters due to the blast parameters (more than 90%), although they belong to the families those were clustered 11 .
The most abundant TF families between the three species were MYB, bHLH, Homeobox, NAC and WRKY whose clusters possess more than 100 members. Homeobox-leucine zipper protein HOX1 was the first common cluster with 20 proteins between the three species followed by Homeobox-leucine zipper protein HDG9 and Calmodulin-binding transcription activator 2. The phylogenetic tree for the three species, as shown in Fig. 2, demonstrates that the TFs of G. hirsutum and G. raimondii are closer to each other than to G. arboreum.  (2) and specific for one species (1).  Considering other clusters with more than three copies, the authors of this work could generally find that for clusters containing for example 4, 5, 6 proteins, only one copy belongs to G. arboreum generally while G. hirsutum and G. raimondii share the other copies. In the clusters with more TF numbers, this trend is seen that G. arboreum possesses less number than the other two. As an example, for BES1/BZR TFs (cluster 237) four proteins were clustered with two copies from G. arboreum and only one copy from G. hirsutum and one copy from G. raimondii.

Gene Ontology annotation analysis of transcription factors
In order to disclose in a better manner TF functionality of Gossypium species and analyze more effective in silico, the authors tried to mine information based on Gene Ontology (GO) terms from different sources. First, the clustered TF set of each species was subject to compare with Arabidopsis to annotate their function based on Arabidopsis annotation and GO terms. More TF genes in KOBAS matched with Arabidopsis homologs compared to GSEAServer as shown in Table 4. However, it is not significant and might be due to differences between both programs used. *These genes were annotated against Arabidopsis in the program.
As reported, genes associated with GO terms of "membranes" are involved in fiber initial and development in cotton (Taliercio and Boykin 2007). Based on membrane-related GO terms on GSEAServer, it was found that G. hirsutum showed more membrane-related GOs (8411 hits belonging to 68 nonredundant terms) while G. raimondii possesses 7553 hits of 65 GO terms and G. arboreum 6194 hits of 53 terms. Interestingly, G. hirsutum has 3389 membrane-related genes among its TF dataset followed by G. raimondii with 3068 genes and G. arboreum with 2542 genes. However, the relationship of these TFs with fiber production needs further analyses to disclose.
Cell wall GO terms were analyzed between the three species and G. hirsutum showed 1684 cell wall associated GO for its TFs followed by G. raimondii and G. arboreum with 1522 and 1215 hits, respectively. The cell wall-associated genes were 259, 229 and 180 for G. hirsutum, G. raimondii and G. arboreum, respectively. Among the TFs related to secondary cell wall (SCW) that are responsible for fiber production in cotton, G. hirsutum showed many species-specific TFs. However, the members of WRKY, NAC and MYB for G. raimondii and arboreum as well as common genes between the three species were found.
Based on OrthoVenn analyses, TFs related to cell wall were mined by pairwise comparisons as they function in fiber production and biosynthesis (MacMillan et al., 2017). Interestingly, between G. hirsutum and G. raimondii 8 clusters including 17 genes belong to cell wall biogenesis, cell wall modification and other cell wall related processes while for G. arboreum vs G. hirsutum and G. arboreum vs G. raimondii there were 2 genes for each comparison. The TFs including NAC43, ROOT HAIR DEFECTIVE 3, MADS-box 21, SOMBRERO, Vegetative cell wall protein gp1, MADS-box 27 were common between G. hirsutum and G. raimondii.
Among the genes involved in fiber production, like SCW and PCW formation, many are common between G. hirsutum and G. raimondii and no copy is common between the three of them or between G. arboreum and the two other species. Some examples are WUSCHEL, MYB48, WRKY3, WRKY 22, WRKY 29, WRKY 75. These TFs were found as common TFs only between G. hirsutum and G. raimondii.
As previously revealed, membrane-associated GO include genes involved in fiber initiation and formation. There were 23 genes of membrane GO between G. hirsutum and G. raimondii. Interestingly, the most abundant common cluster, 60S ribosomal protein L6-1, was associated with membrane in 7 genes of which 6 belonged to G. hirsutum and only one copy was found in G. raimondii. The other membrane associated genes were ROOT HAIR DEFECTIVE 3, Transcription repressor OFP7, Auxin response factor 5, Probable inactive receptor kinase At2g26730, NAC69, Thylakoid ADP, ATP carrier protein, L-arabinokinase and Tubby-like F-box protein 7. In the comparison between G. arboreum with G. hirsutum and G. raimondii, there were only two common genes, i.e., Tubby-like F-box protein 3 and Eukaryotic translation initiation factor 3 subunit A.
Due to the importance of climate change and stresses that can have impact on cotton cultivation and production, GO related to response to stress were mined. Stress-related GO terms were distributed on 931, 1234 and 1127 genes for G. arboreum, G. hirsutum and G. raimondii, respectively. Defense related genes were compared between the three species as 412, 532 and 499 G. arboreum, G. hirsutum and G. raimondii, respectively. Comparing three species, 178 clusters were determined as stress-based GO terms. Comparison of G. hirsutum and G. raimondii showed 19 clusters covering 39 genes that are involved in stress response. These genes belong to the TF families of MYB, WRKY, AP2, DRE, DELLA, protein phosphatase 2C 27, REVEILLE 5 and Homeobox-leucine zipper protein ATHB-40.
For the comparisons between G. arboreum and the two other species, 2 and 5 clusters were found under stressrelated terms. GO related to response to insects were two members of MYB family including MYB76 and MYB51, that were common between G. hirsutum and G. raimondii, while no stress-GO term was found in other pairwise comparisons.
In addition, due to its importance and to have a better knowledge on TFs related to insect attack in cotton, genes with insect-related GO were extracted, which were 14, 16 and 15 for G. arboreum, G. hirsutum and G. raimondii, respectively ( Table 5). The insect-related TFs of the three species shared MYC2, WRKY70, MYB51, dof zinc To develop enriching study on stress, stress-related genes from GOMapMan were compared with each TF dataset. Six common TFs of different members of heat shock TF family related to stress for three species were found, including HSFA2 (AT2G26150), HSFA7B (AT3G63350), HSFA1E (AT3G02990), A6Bb (AT3G22830), HSFA1B (AT5G16820) and HSFB4 (AT1G46264). The species-specific TF which belongs to G. arboreum was HSFA4A (AT4G18880). For G. raimondii, HSFA6B (AT3G22830) was the species-specific stress related TF. For G. hirsutum no stress-related TF was found as species-specific TF.

Simple sequence repeats identifications
Transcription factors dataset of each species was subject to identify simple sequence repeats (SSRs). Interestingly no 4 and 5-unit-size SSRs were found for G. arboreum and G. hirsutum while G. raimondii showed all different unit size (from 2 to 6), as shown in Fig. 3. The SSR markers were distributed on 346, 503 and 1287 TFs of G. arboreum, G. hirsutum and G. raimondii, respectively. G. raimondii possesses the most SSRs. The trinucleotide SSRs were most abundant for G. arboretum and G. hirsutum while dinucleotide SSRs were most abundant in G. raimondii followed by dinucleotide SSRs (Fig. 3). Density of SSRs distributed among TF datasets showed that G. raimondii has denser SSR per 1Mbp as 234/Mbp, while for G. hirsutum and G. arboreum they were almost close and 153 and 145 SSRs/ Mbp, respectively. In terms of density of SSR per TF, G. raimondii showed 66.4 SSRs for each 100 TFs, while for G. arboreum and G. hirsutum it was around one fifth of G. raimondii (Fig. 4).    Table 6 for 20 the most abundant SSRs for the three Gossypium species, CAA, CAG, CAC were three SSR markers at top in G. arboreum and G. hirsutum while the first one, i.e., CAA, covered 9 and 8% of SSR markers, respectively. CAA covered 2.8% of SSRs in G. raimondii ranked in the ninth level. In G. raimondii AT, TA, TC, CT, TG were the most abundant markers followed by TTC, AG, TCA, CAA, CTT, ATC, CAG.

Comparison of transcription factors of Gossypium species revealed phylogenic relation
Gossypium hirsutum possesses more TFs than G. arboreum and G. raimondii in terms of family number and members. This might be due to the fact that this species is tetraploid, while the other two are diploid. The most abundant family is MYB for all three species, which is in accordance with Arabidopsis genus and species (Jazayeri et al., 2018). However, due to varying categorization systems used by different TF sources and databases, some little changes in the TF ranking based on their number can be seen. For example, PlantTFDB and iTAK database classify AP2/ERF as two separate families, i.e., AP2 and ERF, then bHLH is the second most abundant family followed by ERF, NAM, bZIP, C2H2, WRKY. The family members might vary a little, but evidently MYB and bHLH are the most abundant TF families as seen in Arabidopsis (Jazayeri et al., 2018) and almost similar pattern exists between both genera and their species. Interestingly, many TFs members of MYB family as well as WRKY, bHLH and NAC are involved in regulation of cell wall in fiber production (MacMillan et al. 2017). The two most-numbered families, i.e., MYB and bHLH, have physically and functionally interaction to achieve their function (Pireyre and Burow 2015). This makes them important for any kind of interaction involved in biological processes and suggest their abundancy in plants.
Gene ontology terms related to membrane cover almost all TFs in the three species (more than 96%). This result may not be justified by GO terms to elucidate candidate TFs associated to fiber production considering their membrane compartmentalization. However, they might be involved in fiber production mechanisms. This remains to disclose how they might be involved in fiber production.
BED-type (Zn) is the only family that has no member in G. arboreum. In G. hirsutum, its number is twice as higher as G. raimondii. This finding suggests that this family has appeared during the evolution of G. raimondii and transferred and duplicated in G. hirsutum, bearing in mind that G. hirsutum is the progeny of G. arboreum and G. raimondii. BED finger proteins are known as either transcription activators or repressors. It regulates transcription by changing local chromatin structure on binding to GC-rich sequences. BED-type(Zn) is involved in plant disease resistance, for example blast resistance and bacterial blight in rice (Kroj et al. 2016). It is a domain type (CC)-NBS-LRR found in many R-genes among plants that functions in chromatin-boundary proteins and transposases (Aravind 2000).
Common TFs between the three species and pairwise comparisons showed that G. raimondii and G. arboreum share less TFs, but G. hirsutum has most common TFs with each of them. The most common TFs were found between G. hirsutum and G. raimondii, while the least common belonged to G. arboreum and G. raimondii. Additionally, they did not have the same contribution to G. hirsutum, especially in fiber-and stress-associated TFs. The findings in this study for TFs are in accordance with another that reported that A and D genomes do not equally contribute and D genome (G. raimondii) provides more genes than A genome (G. arboreum) . Therefore, they might share less genes with each other than with G. hirsutum. The results of this work suggested that G. hirsutum shares more TFs with G. raimondii and it might duplicate the G. raimondii originated TFs during its evolution. On the other hand, the TFs number of G. arboreum is less than in the other two species. These discrepancies and similarities between the three species might occur due to less TFs numbers of G. arboreum compared with the other species, causing less shared proteins. In addition, G. hirsutum may have received more TF genes from G. raimondii than G. arboreum or duplicated the genes of G. raimondii during its evolution. This is in accordance with a previous report (Shan et al. 2016).
A large number of single-copy genes suggests that G. hirsutum have not received these TFs from both parents, as only one copy exists for it. This imply that these genes might have been transmitted from G. arboreum, G. raimondii or other mechanisms that cause this single-copy occurrence like deletion of TFs transmitted from one parent. For other clusters with four and more copies, G. hirsutum possesses generally more copies.
The representative classified TF proteins of genus Gossypium showed other function than transcription factor. This indicates that there are some motif and sequence fragments that function differently. This can be one of the advantages that TFs possess by which they can regulate in different manners their target genes. However, to assure their varying functions by their motifs, further studies using functional genomics on each of them are required.
The findings of this study can justify that the principal species of Gossypium genus, i.e., G. hirsutum, employs more regulatory genes associated to fiber production and stress response than its parents.
The first cluster with the most common number of TFs belongs to Homeobox family. HOX1 is probably a transcription repressor involved in leaf development, it binds to the DNA sequence 5'-CAAT[GC]ATTG-3' and may act as a regulatory switch to specify provascular cell fate. Interestingly, it belongs to HOX family, whose members control fiber elongation in cotton and are expressed in cotton fiber cells (Shan et al. 2014).

Species-specific genes disclosed regulatory system difference between the three Gossypium species
The TFs including NAC43, ROOT HAIR DEFECTIVE 3, MADS-box 21, SOMBRERO, Vegetative cell wall protein gp1, MADS-box 27 were common between G. hirsutum and G. raimondii. However, it remains to reveal if these TFs are key fiber production regulatory elements in cotton in order to use them in breeding and MAS programs.
For example, WUSCHEL related homeobox 4 is one of the master key TFs involved in vascular differentiation and procambium development (Ji et al., 2010). It functions between IAA and downstream molecules, i.e., jasmonate (JA), salicylic acid (SA) and ethylene to initiate subsequently fiber formation (Gorshkova et al. 2012).
MYB48 is differentially expressed in G. raimondii as one of PCW formation genes that is involved in fiber biosynthesis of cotton (MacMillan et al. 2017). WRKY TFs, WRKY6, WRKY75, WRKY3, WRKY29, WRKY22, during SCW and PCW formation comparing G. hirsutum with G. raimondii showed differential expression profiles as reported by MacMillan et al. (2017).
Basically, this research is an in silico study but the findings reported in it are in accordance with genes involved in fiber formation and production confirmed by other studies and used in cotton breeding programs successfully. Thus, they have been validated as candidate genes toward more fiber production in cotton.

Insect related transcription factors are common between the three species
The common insect related TFs between the three species, including MYC2, WRKY70, MYB51, dof zinc finger DOF2.2, bHLH3 and bHLH28, have been reported that are involved in the processes related to defense against insect attacks. They are related to JA and SA balance. MYC2 regulates jasmonate-dependent functions (Dombrecht et al. 2007). WRKY70, as a crosstalk, acts as a repressor for JA and an activator for SA in favor of modulating signaling network as a defense mechanism (Li et al. 2006). On the other hand, WRKY70 is involved in negatively glucosinolate (GLS) biosynthesis which causes accumulation of GLS (Li et al. 2004) . Different members of MYC family, MYC2, MYC3 and MYC4 are involved in GLS biosynthesis  and are classified in bHLH group with bHLH3 and bHLH28 (Niu et al. 2011). Interestingly, they have physical interaction with MYB family in order to regulate biosynthesis of GLS (Frerigmann et al. 2014;). However, no species-specific TF was found in Gossypium species and the authors of this study were not able to conclude if any species-specific mechanism exists or not.
These TFs function in regulation of GLS biosynthesis while MYB 51 involved in indole GLS biosynthesis and MYB76 involved in aliphatic GLS biosynthesis. They increase insect resistance in crops (Malka and Cheng 2017) and can be considered as two key regulatory TFs in resistance to insect attacks (Gigolashvili et al. 2009) in cotton, as G. hirsutum and G. raimondii show more resistance to such stress. However, further expression analysis to disclose their function in response to insect attack in cotton species is necessary.
The discrepancies among species-specific TFs between the three Gossypium species suggest that the speciesspecific TFs belong to other isoforms or paralogs of TF families. Further functional genomic analyses are necessary to demonstrate how each species can respond to insect attack considering their specific TFs.
On the other hand, comparative analyses revealed some TFs in response to insect attack. HSFs are involved in herbivory and insect attack mechanism (Kuśnierczyk et al. 2007). Gossypium raimondii and G. arboreum possess species-specific HSF. These HSF can be used for plant screening with the aim of resistance to insect attack and herbivory.

Simple sequence repeats are sufficiently different to be used as markers for Gossypium species
For G. raimondii, the SSR polymorphism trend is in accordance with the whole genome SSR analysis in plants, while for G. arboretum and G. hirsutum this ranking is different. This might be due to the input data as TFs that are part of the genome and more conservative than other functional genes. Multi-allelic SSRs are good tools to detect genetic structure and can demonstrate any likely relation between two genotypes, species, genera, ecotypes, etc.
Gossypium hirsutum is tetraploid and a descendant of G. arboreum and G. raimondii, therefore, it is assumed that it possesses a complex dataset of SSRs, but the findings of this study are not in accordance with expectation. This suggest that during its evolution G. hirsutum has experienced deletion events that has decrease its SSRs and polymorphism level. As previously declared, upland cotton has narrow genetic background toward low polymorphism rate (Khan et al. 2016) and this study confirm it.
The SSRs presented in this study are potential molecular markers in order to use in subsequent breeding program. As this study presented only SSRs within the TF genes, these markers can provide a powerful resource for searching markers associated to known genes and TFs. Due to regulatory function of TFs, these markers can be sued in association mapping research as well as finding gene networks. On the other hand, they can help to understand the likely phylogenetic links among different species of Gossypium and be useful for diversity and evolution analyses. They can provide knowledge of how mutation and changes in Gossypium species have been occurring over time.

Simple sequence repeats for the genes involved in stress and insect attack response
The SSRs of the genes of interests in fiber, stress and insect attack were mined. Interestingly, there are SSRs present for the genes of interest that make such SSRs worthy to work on. These SSRs could be used in subsequent studies as a dataset to choose the genes for plant selection and breeding in fiber and defensive response. Among them, CAA and CAG are more abundant for G. hirsutum and G. arboreum. For G. raimondii, AT and TA appear in more numbers. However, considering the genes of interest in fiber production and in stress and insect response, relevant SSRs may attract attention as they share more genes as markers in their sequences. These markers are presented based on in silico analysis as useful markers for Gossypium species.

CONCLUSION
Three Gossypium species were analyzed based on their TFs. The similarities among TFs showed that G. hirsutum and G. raimondii share more TFs and, therefore, are closer to each other. Gossypium hirsutum is a descendant of the other two, but it seems that its TF set has experienced deletion events, since in terms of number and common TFs it showed different behavior from what expected. The TFs related to SCW, fiber, membrane and response to insects and diseases showed that the members of NAC, WRKY, MYB families are present in more copies in G. hirsutum than in the other two species, suggesting its ability to produce more fiber. However, some other TFs like BES1/BZR, that is involved in fiber production via amylase activity, are present in one/two more copies in G. arboreum.
Simple sequence repeats markers showed that G. arboreum and G. hirsutum have more dense number of TFs per bp, while for G. raimondii 3-5 times more SSRs were seen for each 100 TFs. Between the three species, G. raimondii show more SSRs per Mbp comparing to the other species. However, the average SSR density for TF can be a representative of the whole genome. Comparing with other plants, Gossypium show denser SSR distribution.
These findings suggest that G. raimondii could cause more variations in genes of interest and is a source of polymorphism markers that can be used in breeding programs. The SSRs of TFs associated to fiber, stress and insect attack should be further studied for plant selection purposes. The details on the functionality of such TFs families and their members remain to disclose.