Incorporating symmetrical and asymmetrical dispersal into Ecological Niche Models in freshwater environments

Abstract: Aim Ecological niche models (ENMs) are based mainly on environmental (mostly climatic) and occurrence data to predict the potential distribution of species. In freshwater habitats, species dispersal is not restricted only by physical barriers but also by the directional movement of the hydrographic network, which can be considered through spatial predictors. Here, we aim to evaluate the effect of including asymmetrical and symmetrical spatial predictors in the potential geographic distribution of a freshwater fish in the Tocantins-Araguaia River basin, Brazil. Methods For this, we built models with seven variable sets representing the climatic and spatial models, as well as their interactions. Results We found that the overall best models (higher evaluation and lower variation among modeling methods) are those built using AEM (asymmetrical dispersal [i.e., dispersal along the river flow path]), either alone or in combination with environmental variables (ENV). Moreover, the inclusion of asymmetrical dispersal variables, taking into account dispersal limitations of species, decreased the overprediction to climatically suitable but disconnected areas through rivers. Conclusions Therefore, future ENM studies, especially those using species groups with directional dispersal, should consider the inclusion of asymmetrical spatial predictors to increase the model’s accuracy and ecological reality.

Asymmetrical movement is the occurrence of a preferential direction of migration, i.e., the probability of moving in one direction is not the same as moving in the opposite direction (Pringle et al., 2011;Acevedo & Fletcher Junior, 2017). Asymmetrical dispersal has been reported in various ecological systems and for a variety of groups. For example, asymmetrical dispersal due to oceanic and stream currents for many species (Riginos et al., 2019), and wind patterns generating directional dispersal in fungal, orchid, and zooplankton species (Rieux et al., 2014;Acevedo et al., 2015). Also, asymmetric dispersal can arise or intensify due to landscape fragmentation and climate change (Pavlacky Junior et al., 2012;Acevedo et al., 2020;Dalui et al., 2020). Specifically in freshwater environments, species dispersal is constrained not only by geographical distance but also by the hierarchical structure and flow direction of rivers (Domisch et al., 2015). Consequently, in these systems, there is a predominant downstream direction of migration (Asymmetrical dispersal: Pringle et al., 2011;Altermatt et al., 2013). Thus, dispersal proxies that account for asymmetrical (directional) movements are more appropriate to characterize species movement in freshwater systems (Altermatt, 2013;Mozzaquattro et al., 2020). Although some recent studies have addressed the impacts of dispersal limitation in ENMs in freshwater environments (Bush & Hoskins, 2017;Perrin et al., 2020), few studies explicitly have included flow direction in their dispersal metrics [e.g., Ver Hoef et al. (2006)].

Introduction
Ecological Niche Models (ENMs) use environmental (mostly climatic -abiotic) and species distribution (mostly occurrences) data to predict climatically suitable areas for species survival Araújo & Peterson, 2012;Peterson & Soberón, 2012). ENMs have been widely used for predicting the species' potential distribution as a species conservation tool in poorly known areas (Guisan et al., 2013). Moreover, they have been used in different research areas, such as climate change (Nabout et al., 2011;Anderson, 2013;Ruaro et al., 2019;Ferreira et al., 2021), invasive species , disease transmission (Peterson et al., 2005), delimitation of conservation areas (Carvalho et al., 2017), the effect of habitat loss , among others [see Peterson (2006) for more applications].
The factors determining species distribution regarding their niche are represented in the BAM diagram (Biotic, Abiotic, and "Movement") (Soberón, 2007). However, correlative ENMs often use only species occurrence data and environmental factors (abiotic) . Therefore, including Biotic and Movement factors in ENMs are current knowledge frontiers to improve the predictions of species distributions, and some recent papers have used these new predictors in ENMs [see, for example, Barve et al. (2011);Cardador et al. (2014); Cunha et al. (2018); Gherghel et al. (2018)].
Movement constraints can be inserted as a component of the ENMs by including areas environmentally accessible or inaccessible to species, functioning as a proxy of dispersal limitation or migration rate (Miller & Holloway, 2015). In this sense, identifying relevant movement constraints and successfully incorporating them into ENMs are crucial to understanding landscape-habitat connectivity and species dispersal (Vasudev et al., 2015;Perrin et al., 2020). Some studies have successfully included dispersal proxies as predictors in ENMs, generally through distance functions (e.g., Euclidean and Kernel distances) and fixed dispersal rate (Barbet-Massin et al., 2012b;Holloway et al., 2016;Monsimet et al., 2020), either a priori (i.e., as explanatory variables) or a posteriori (i.e., by overlapping accessible and futuros estudos de ENM, especialmente aqueles que envolvem grupos de espécies com dispersão direcional, devem considerar a inclusão de preditores espaciais assimétricos para aumentar a precisão do modelo e sua aplicabilidade ecológica. Palavras-chave: Mapas Assimétricos de Autovalores; Modelos de Nicho Ecológico; dispersão direcional; Coordenadas Principais de Matrizes de Vizinhos; modelagem espacial. be used by inserting the directional effect of species dispersal in many macroecological approaches and at many scales (Blanchet et al., 2011). Moreover, recent studies have shown the importance of AEM filters as a surrogate for dispersal in freshwater metacommunity structuring, which, sometimes, contributes more than environmental processes [e.g., Dong et al. (2016); Mozzaquattro et al. (2020); Rocha et al. (2020)]. Besides the potential of such spatial predictors to account for the dispersal structure of species, for our knowledge, there is no study using them to evaluate dispersal constraints (either asymmetrical or symmetrical) in ENMs.
Therefore, in this study, we aim to evaluate the effect of including asymmetrical and symmetrical dispersal predictors in the potential geographic distribution of Aspidoras eurycephalus, a Neotropical fish endemic to the studied area. Considering the study objective, we selected one Neotropical basin (Tocantins-Araguaia River basin) and one fish species occurring in this basin. This basin is located in the central portion of Brazil and has a bioclimatic variability mainly along the latitudinal gradient. Moreover, the basin is divided into two major sub-basins (Tocantins and Araguaia) connected in the northernmost part of the basin, where some species have been recorded only in one sub-basin (e.g., our model species). Therefore, the latitudinal gradient of climatic variables and the longitudinal dispersal limitation (except for one connection in the north region) support this basin as a suitable model region to evaluate the influence of directional spatial predictors on ENMs. We hypothesize that models built with variables including the directional dispersal effect (AEM) through the rivers will produce more realistic and accurate models since freshwater fish species have directional dispersal routes through the hydrographic network.

Study area
The study area of the ENMs is the Tocantins-Araguaia River basin, covering the entire hydrographic network. This basin has two major rivers (Tocantins and Araguaia), forming two subbasins that merge in the north region, close to the mouth ( Figure 1). Therefore, a species occurring exclusively in one sub-basin needs to disperse a long way through the main river course in that sub-basin to occupy the other sub-basin. Thus, this basin is an interesting area for studies on dispersal limitation because it shows the species effort to disperse between the two basins (Tocantins and Araguaia). The entire hydrographic network was rasterized into grid-cells with 0.5º resolution (latitude and longitude), totaling 282 cells for the Tocantins-Araguaia basin ( Figure 2A).

Species occurrence data
We obtained the fish occurrence data from specific online databases: Species Link, Global Biodiversity Information Facility (Gbif ), and FishBase (Froese & Pauly, 2019). All records are from fish collections and museums, such as the UNT (Fish collection of the Federal University of Tocantins) and the collection time frame varies from 2002 to 2012. We used Aspidoras eurycephalus Nijssen & Isbrücker, 1976 as our model species to evaluate the contribution of spatial predictors in the potential distribution. This small catfish of the order Siluriformes is endemic to the upper Tocantins River sub-basins, with a maximum size of 3 cm and wide distribution in this river (Reis et al., 2003). We removed the duplicated occurrence points (more than 1 point in the same grid-cell), resulting in 23 unique occurrence points of this species ( Figure 1). Moreover, because presence-absence methods (e.g., GLM) require absence records, we generated 56 pseudo-absences (~20% of study area cells) randomly distributed in all extent area (1 per grid-cell), avoiding grid-cells containing occurrence records (Barbet-Massin et al., 2012a).

Environmental data
The environmental variables (ENV) used were the 19 bioclimatic variables from the online database Worldclim (Hijmans et al., 2005). We rescaled the variables to a resolution of 0.5º (~ 55 km) using the function aggregate from the raster package (Hijmans, 2019), and then used a factorial analysis (FA) using varimax rotation to remove collinearity among variables. For this, we used the functions fa and fa.parallel of psych package (Revelle, 2019) of the R software (R Core Team, 2019) to determine the adequate number of non-orthogonal axes. We determined four axes (i.e., the number of factors with eigenvalues higher than simulated eigenvalues) through a screen plot of actual data and simulated data eigenvalues, and then selected the variables with the highest loading (-or +) in each axis. The environmental variables selected were: BIO1 = Annual Mean Temperature, BIO2 = Mean Diurnal Temperature Range, BIO13 = Precipitation of Wettest Month, and BIO15 = Precipitation Seasonality. It is important to note that, whenever possible, predictors should be chosen considering the species biology, however in the absence of such information, statistical selection (as used here) have been widely used. In this sense, the choice of environmental variables based on species biology can lead to an increase in the importance of environmental variables.

Asymmetrical binary matrix
The binary matrix was created through the grid-cells of the Tocantins-Araguaia basin with a resolution of 0.5º (latitude e longitude) (Figure 2A). This matrix consists of river connections (edges) in the columns and numbers of each cell (nodes) in the rows ( Figure 2B and 2C). We started counting the grid-cells containing rivers upstream, with the first grid-cell corresponding to the mouth of the basin. The first connection (E1) received value 1 for all nodes of the basin. The subsequent connections (E2, E3, ...) also received value 1 for all nodes, except for the first node with 1 in this connection. We repeated this step until we reached a tributary (lower order) river. In this case, we started a new edge (e.g., E7 in our study), assigning value 1 for all nodes in this connection and subtracting the first node in the following edges of this tributary river. Once we finished the last edge in the tributary, we returned to the main connection (the entire basin), attributing value 1 to the remaining nodes until we reached another tributary. We repeated this process until we reached the last edge, which usually contains the source furthest from the mouth of the basin ( Figure 2C). We ended up with an asymmetrical (directional) matrix with 282 grid-cells (nodes) and 420 connections (edges). This methodology is based on Blanchet et al. (2008).

AEM and PCNM calculation and axes selection
We used asymmetrical and symmetrical spatial filters as dispersal predictors in the construction of ENMs. The spatial filters used were PCNM (Principal coordinates of neighbor matrices; Borcard & Legendre, 2002) and AEM (Asymmetric eigenvector maps; Blanchet et al., 2008). Each filter (orthogonal axes) is a representation of the geographic space, where the first axes represent a large-scale variation, and the last axes represent a fine-scale variation (Borcard & Legendre, 2002). Specifically, in our study, AEM filters represent the flow path of rivers throughout the basin, where the river connections and directionality are translated from a binary matrix of connectivity. PCNM filters, on the other hand, represent the non-directional (symmetrical) effect of dispersal into the species distribution, such as a distribution by air. The spatial filters can be incorporated into approaches of multiple regressions (e.g., niche models), inserting spatial autocorrelation into the models (Diniz-Filho & Bini, 2005). We removed the spatial filters (AEM and PCNM) with low importance considering the comparison with null models (broken-stick), their low spatial structure, and high correlation with bioclimatic variables (redundant spatial filter).
The AEM calculation [function aem of adespatial package (Dray et al., 2019)] is based on the SVD (Singular Value Decomposition) analysis using the asymmetrical binary matrix as the input matrix. Similar results can be obtained through a PCA (Principal Component Analysis) using the same binary matrix or a PCoA (Principal Coordinates Analysis) using a Euclidean distance matrix from this binary matrix ( Figure 2D) (Blanchet et al., 2008). The AEM calculation generated eigenvalues and eigenvectors (axes) that correspond to the dispersal predictors. Since here we are using standardized river connections (edges), which are grid-cells with the same size throughout the basin, it was not necessary to include weights in into AEM calculation, different from using river segments as edges that may have different lengths. We selected the axes sufficient to explain the total variance (i.e., the sum of the eigenvalues) using the broken stick method for AEM based on SVD. We found no correlation between axes with any bioclimatic variable. Therefore, we retained the first seven axes through the broken stick, which were used posteriorly as predictors to construct the niche models (Jackson, 1993) ( Figure 2E).
The construction of PCNMs [function pcnm of package vegan (Oksanen et al., 2017)] was performed using the centroid coordinates (latitude and longitude) of the 282 grid cells. The analysis resulted in 281 eigenvectors (axes) corresponding to the dispersal predictors. We used the Moran's I test (Cliff & Ord, 1981) to select the axes sufficient to explain the total variation. To calculate the Moran's I, we used the function moran.randtest of package adespatial (Dray et al., 2019), using the eigenvectors (axes) generated by the PCNM analysis as input. This function computes the Moran's I for each axis generating significant (p<0.05) correlation values ranging from 0 to 1 for each axis, where the axes selected had values above 0.7. Finally, through the Moran's I approach, we retained the first 20 axes. However, we removed the first axis (PCNM1) due to its collinearity with the climatic variables. Therefore, 19 axes were posteriorly used as symmetrical dispersal predictors (PCNM) in the niche model construction ( Figure 2F).
Models were built using 75% of presence and absence points for training (model's construction) and 25% for testing the models' performance (Guisan & Zimmermann, 2000), randomly chosen. Thus, we have the same set of pseudo-absences for all models, but with random subsets for each run. In this sense, we can compare the different sets of variables among models that have the same initial datasets. All models generated were evaluated using the Area Under the 'receiver operating characteristic' Curve (AUC; Swets, 1988), which is an evaluation metric threshold-independent (the limit for determination of presences/absences) that compare predict with observed values. The final map of potential distribution is a consensus map (ensemble) built using the mean suitability values of all models with AUC > 0.7 weighted by the AUC values. The models were generated and evaluated using the dismo package (Hijmans et al., 2016) available in the R software (R Core Team, 2019).

Data analysis
The eigenvectors (axes) generated by the asymmetrical (AEM) and symmetrical (PCNM) spatial filters, as well as the environmental variables, were used as predictors in the ENMs. These models were evaluated and compared using different statistics for models' performance (AUC, TSS [Sensitivity + Specificity -1], Sensitivity [True Positive Rate], and Specificity [True Negative Rate]) to assess the effect of including spatial predictors in the niche models. For this, we compared the mean values of the evaluation metrics between the models. For this, we used the mean metric values of the 50 models, built using the six modeling methods for each of the seven sets of variables. Therefore, we ended up with a mean value for the 300 (50 models x 6 algorithms) models for each variable set, representing the set of models with the highest performance among the different sets of variables. Metric values were compared using Venn diagrams, where it is possible to visualize the values in the individual (ENV, AEM, and PCNM) and interacting sets (ENV+AEM, ENV+PCNM, AEM+PCNM, and FULL), through connected circles.
Besides comparing the evaluation metrics for each set of variables, we tested the variation of ENM evaluation metrics (AUC, TSS, Sensitivity, and Specificity) among the variables and methods through an interaction plot between the set of variables and modeling methods used. Since each model built used one set of variables and one modeling method, we arranged the output metric values corresponding to each method and variable. For this, we obtained the mean metric values of all 50 models for each interaction between the set of variables and modeling methods (6 x 7 = 42 points plotted). In this sense, we generated two interaction plots between variable types and modeling methods, using the ENM evaluation metrics of these interactions.
All analyses in this study were performed in the R software version 3.5.1 (R Core Team, 2019).

Results
The consensus maps (ensembles) generated using the sets of variables individually (ENV, AEM, and PCNM), and in addition to other variables (ENV+AEM, ENV+PCNM, AEM+PCNM, and FULL), showed similar predicted distribution outputs. The models predicted suitable areas close to the known species core region. However, it is possible to notice that AEM-based maps limited the species` potential distribution for the Tocantins River sub-basin where the species currently occurs (Figure 3).
The models generated using the combination of climate and asymmetrical dispersal variables (AEM+ENV) had the best performance for all evaluation metrics, except for sensitivity. Besides, models using only asymmetrical spatial predictors (AEM) also had a high performance in their evaluations, showing the highest sensitivity of all models. On the other hand, models using only PCNM and together with other variables (ENV+PCNM and AEM+PCNM) generated the models with the lowest performance for both metrics (Figure 4).
The evaluation metrics of models' performance (AUC, TSS, Sensitivity, and Specificity) showed high variability among modeling methods and variable types (environmental, asymmetrical, and symmetrical spatial variables). In general, AEMbased models had the highest evaluation metrics and the lowest performance variation, especially for models with AEM only and AEM+ENV. On the contrary, models built using ENV and PCNM (alone or together with other variables) generated high variability among methods, especially for PCNM-based models, which had the worst performances. This reduction in performance is accentuated when combined with Bioclim and  GLM. The remaining model combinations had consistent performances ( Figure 5).

Discussion
Our results showed a significant influence of asymmetrical spatial predictors in freshwater ENM performance. Moreover, the inclusion of asymmetrical predictors (AEM) accounted for dispersal limitations, informing the models about areas with difficult access to the species, unlike models based only on climatic aspects. In this sense, AEM-based models limited overprediction in areas climatically similar but disconnected through rivers, such as in the upper Tocantins-Araguaia River basin.
The predictors considered here generated models with similar spatial distribution. However, models based on AEM (alone or together with other predictors) showed the highest restriction to the Tocantins River basin, the known species distribution area. This restriction demonstrates the effect of the dispersal path the species has to travel to reach other regions. Therefore, the representation of directional dispersal along the river course through asymmetrical spatial predictors (AEM) is more suitable to include dispersal routes of rivers in niche models. Besides, different methods have been used to incorporate the dispersal effect in ENMs (Engler et al., 2012). However, dispersal limitation in those studies is often based on symmetrical dispersal or migration rates for areas surrounding the current species distribution (Miller & Holloway, 2015).
The AEM predictors utilized in the study presented are widely employed in aquatic metacommunity research as a proxy for species dispersion [e.g., Rocha et al. (2020)]. The AEM has shown particular promise for aquatic environments due to its consideration of connectivity and direction within the hydrographic network, which are crucial for dispersal processes in such environments [see, for instance, Heino et al. (2015)]. In this paper, we applied the AEM on a broader spatial scale. Despite this expansion, the effectiveness of the AEM as a statistical model was confirmed, thereby establishing its potential for niche modeling research involving species from aquatic environments.
The models built using the set of climatic and asymmetrical spatial predictors (ENV+AEM) had the best evaluations, similar to models built with AEM only, which had the highest sensitivity of all models (i.e., more accurately predicted occurrences). The evaluation is directly linked to the model's performance; therefore, the best-evaluated models represent more reliable predictions of the species` potential distribution (Allouche et al., 2006). Conversely, models using PCNMs showed the worst evaluations, either alone or together with other predictors, for all tested combinations of variables. Therefore, models built using only the symmetrical space (PCNM) are not reliable to predict the potential distribution of this freshwater fish species. However, hitting or missing presence and absence areas of species may not be enough to assess the quality of the models, as they can be inflated by the lack of extrinsic characteristics of the species (e.g., dispersal), generating overpredicted and unrealistic models (Uribe-Rivera et al., 2017). Therefore, besides increasing the model's performance, asymmetrical spatial predictors in ENMs insert dispersal routes (i.e., rivers and streams) that species must travel to be present in other locations, increasing the ecological reality of ENMs.
The models' performance also varied depending on the combination of predictor variables (i.e., spatial and climatic) and modeling methods used to build the ENMs. Some modeling methods showed more variation in their evaluations than others, especially those combined with symmetrical predictors (PCNM). Those variations (uncertainties) in ensemble models among modeling methods are widely known and studied (Diniz-Filho et al., 2009), and can also change depending on the combinations of variables used (Parreira et al., 2019). Among all combinations, models built using only asymmetrical dispersal (AEM) or with other predictors (i.e., AEM-based models), had the lowest variation among methods, showing overall satisfactory performances. It is expected that models with spatial restrictions generate less variation among modeling methods as these "barriers" or dispersal paths may limit the species to occur in other climatically suitable locations, hindering access to inaccessible or less accessible areas (Uribe-Rivera et al., 2017). For example, the upper basin elevation between Araguaia and Tocantins tributaries limits the species dispersal between these two sub-basins, except for their only downstream connection. Therefore, this dispersal restriction in ENMs limits possible overpredictions to climatically similar but disconnected areas (Mendes et al., 2020).
Spatial predictors in freshwater environments can be a valuable tool to build models with lower overprediction of suitable areas for species. In these models, the dispersal routes along the river course, through spatial filters, are included to represent the species movement throughout the basin hydrographic network (Blanchet et al., 2008), improving the accuracy of ENMs predictions for these environments. Moreover, we emphasize that the models built using AEM (alone and together with other predictors) showed higher accuracy than the traditional climatic-only models. Nevertheless, new approaches to generate the asymmetrical binary matrix automatically for the AEM analysis are necessary since manually designing this matrix for large extensions, such as biogeographic regions (e.g., Neotropics) or the entire globe, is yet very demanding. Furthermore, we expect future ENM studies to assess the insertion of this spatial direction in other basins, at different spatial scales, and with a larger species pool, which could allow assessing the effect of different dispersal abilities.
In fact, the dispersal of the species can be strongly related to intrinsic morphological traits (e.g. body size) (Bie et al., 2012), life-history [e.g. Tamme et al. (2014)] and fundamental niche breadth [e.g. Arribas et al. (2012)]. Therefore, the use of AEM to predict the potential species distribution is "context-dependent" and understanding dispersion dynamics is important for considering the use of AEM. For example, it is possible that asymmetrical predictors are less relevant for species with higher dispersal abilities and/or frequent upstream movement, as AEM could be less effective in capturing the dispersal behavior of such species. However, organisms can present different rates of dispersion throughout their life cycle. Therefore, introducing dispersion predictors together with the different factors that are associated with the ability of organisms to disperse is one of the frontiers of research on ecological niche modeling.
In conclusion, our results show an emerging potential for the use of asymmetrical dispersal filters (AEM) as variables for the construction of aquatic ENMs, either alone or together with climate-based variables. Besides, other approaches could be developed following this methodology of inserting asymmetrical (directional) dispersal in freshwater ENMs. For example, 1) more accurately predicting the dispersal pathways of freshwater invasive species in new habitats, or 2) evaluating the effect of dispersal interruption, such as current or planned hydropower plants (HPP), on the dispersal of fish species by inserting the disconnection into ENMs built using AEM filters through disconnections in the directional matrix in the HPP areas. Therefore, this approach of using asymmetrical dispersal in freshwater ENMs may contribute to new insights regarding the potential distribution of freshwater species by considering their dispersal routes in the models. (CAPES -finance code 001) and PNPD/CAPES, respectively. Our work on Ecological Niche Model has been continuously supported by different grants: CNPq, Fundação de Amparo à Pesquisa do Estado de Goiás (FAPEG/process 201710267000519), and Institutos Nacionais de Ciência e Tecnologia (INCT) in Ecology, Evolution and Biodiversity Conservation (MCTI/CNPq/ FAPEG/465610/2014-5), and Rede Brasileira de Pesquisas sobre Mudanças Climáticas Globais (Rede CLIMA). GT was supported by DTI scholarship from CNPq/Rede CLIMA. JCN was supported by the CNPq productivity fellowship.

Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.