Distribution of Chinese traditional villages and influencing factors for regionaliszation. Distribution of Chinese traditional villages and influencing factors for regionalization RURAL

: Traditional Villages (TVs) are typical and representative of the agricultural civilization in millions of Chinese villages. The distribution of TVs shows spatial heterogeneity, based on the complexity and diversity of several influencing factors. In this study, 6,819 Chinese TVs were identified and the influencing factors that affect their distribution were screened in terms of three indicator groups: climatic, geographic, and humanity-related factors. Additionally, the K-means clustering algorithm clustered the TVs into different distribution regions. The quantitative relationships between the dominant influencing factors of different distribution regions were revealed to ensure a lucid understanding of the regional distribution of TVs. The results indicated that 1) climatic factors have the greatest impact on the spatial distribution of TVs, followed by geographic factors, particularly the elevation, and then by human factors, of which ethnic distribution played a relatively important role. 2) Twenty-one TV clustering distributions were obtained, which were classified into eight regions of TV distribution with different dominant influencing factors. Management and protective strategies were formulated based on the attribute analysis of influencing factors in each region. The obtained results delineated homogeneous TV distribution regions via the clustering method to achieve an accurate statistical analysis of the influencing factors. This study proposes a new perspective and reference for managing and protecting the diversity, continuity, and integrity of TVs across administrative regions.


INTRODUCTION
Urbanization, producer and consumer behavior, and extensive economic growth have resulted in the serious degradation of the living environment in villages (ANTROP, 2004;TAN & LI, 2013) such that there is a disappearance of important natural and cultural heritages specific to certain villages (YANG et al., 2018). In China, there are approximately 10 million rural-to-urban migrants each year, and the number of villages has decreased by 900,000 over the past decade (YANG, 2013). In 2012, the Chinese government launched a collaborative investigation into Traditional Villages (TVs) to identify and protect villages with valuable historical and cultural heritage values. TVs, also known as 'ancient villages,' formed early; have rich traditional resources; certain historical, cultural, scientific, artistic, social, and economic values; and still retain relatively well-preserved village style features, architectural structures, and unique folk customs that must be protected (MINISTRY OF HOUSING AND URBAN-RURAL DEVELOPMENT OF THE PEOPLE'S REPUBLIC OF CHINA et al., 2012).
Although, there is no uniform definition of TVs, traditional/historical rural settlements with homology have long been studied (PARSONS, 1972). Previous studies have examined the villages' locations in relation to several factors, such as the geographical environment, economics, and population demographics (GROSSMAN, 1987;DIXON, 1996) as well as immigration (MCLEMAN, 2011;YAWORSKY & CODDING, 2018). These studies were either focused on a single factor or used a qualitative approach. Several studies have evaluated the influencing factors and laws that govern the distribution of ancient settlements from an archaeological perspective (PARSONS, 1972;DEMJÁN & DRESLEROVÁ, 2016;FANTA et al., 2020). These studies have emphasized the relationship between early complex rural settlements and specific environments (GREEN & PETRIE, 2018), often focusing on a fixed historical period, with an emphasis on the protection of relics. With the development of computer and remote sensing technologies, spatial analysis technology based on geographic information systems (GIS) and mathematical modelling have been used to quantitatively analyze the spatial patterns of TVs (SEVENANT & ANTROP, 2007;BRAGA et al., 2016;YANG et al., 2018;FANTA et al., 2020). These studies often adopt the methods of the case study and lack ofquantitative analysis of large samples of TVs. Additionally, even macroscale studies have not considered in-depth analyses of the internal mechanisms that drive the differences and similarities in the TV distribution due to different dominant factors.
In this study, based on the 13 influencing factors of 6,819 TVs in China, we used cluster analysis to divide the TVs into 21 groups. In addition, regression analysis of each cluster further divided the TVs into 8 distribution regions. We focused on three issues: what is the spatial distribution of TVs under the influence of differences and synergy?; what are the dominant factors?; and how do we carry out the scientific management and protection zoning of TVs based on the dominant factors? The purpose of this study was to: 1) further understand the distribution status and formation environment of TVs and 2) identify the distribution of TV management regions under the influence of similar environmental factors and determine the main influencing factors within each region to guide subsequent planning and policy formulation. This study seeks to overcome the limitations associated with the over-dependence on administrative boundaries and the qualitative method and integrated analysis of environmental factors affecting TV employed in the previous studies. The research framework proposed in this study is conducive to the clarification of regional studies of TVs.
The structure of this article is as follows. First, we introduced the data collection and research methods. We present the results of this study from two perspectives: 21 cluster attributes and the characteristics of 8 distribution regions. Finally, we discussed our results and propose management and protection suggestions for different distribution regions.

Data collection
According to an evaluation of TV Protection and Development, a total of 6,819 TVs from five approval batches dated up to June 2019 were selected, excluding those in Hong Kong, Macao, and Taiwan (MINISTRY OF HOUSING AND URBAN-RURAL DEVELOPMENT OF THE PEOPLE'S REPUBLIC OF CHINA et al., 2019). Spatial attributes such as the geographical coordinates of TVs were obtained manually using Google Earth to draw a spatial distribution map of the selected TV points ( Figure 1).
The influencing factors for the TV distributions were divided into three categories: climate, geography, and humanities.  LINGUISTICS CASS, 2012], in combination with the distribution of the Han nationality, and vectorized using ArcGIS), and cultural relics and historical sites (obtained from Baidu Maps).

Data processing
The 13 influencing factors included continuous and categorical variables. The categorical variables were the climatic zone, aspect, and soil type. Due to the limitations of categorical variables in the regression analysis and K-means clustering analysis, these variables were transformed into multiple dummy binary variables. When a categorical variable has n categories, n -1 dummy variables were defined. If a dummy variable corresponds to a category, its value was set to 1; otherwise, the value was set to 0 (TSUNODA et al., 2012).
Ethnicity is representative of religion, beliefs and customs, and it has numerous classifications. Therefore, identifying and processing this variable is difficult. We adopted the sixth census statistics from 2010 and set the county scale as the calculation unit. The number of dominant ethnic groups determined the influence of ethnic diversity, which was then classified as 'high,' 'medium,' or 'low' diversity, assigned as values of 3, 2, and 1, respectively by ArcGIS, which were then transformed into a map (Figure 1-b).
The special continuous variables were the traffic network and history relics, which required data processing. Using the macro traffic system of the main Chinese railways, highways, and highway service points and the multiple ring buffer module of the analysis toolbox in ArcGIS, a three-layer valuation buffer zone was established (Figure 1-c) in terms of the radiation attenuation of traffic influence. Buffer zones closer to the traffic network resulted in higher values for the traffic influence, which were then assigned to the TV points. We used a kernel density analysis tool to establish a map of the cultural relics and historic sites and determine their influence, which ranged from high to low based on the distance from the sites (Figure 1-d).
The obtained values were also assigned to the TV location points. For the remaining continuous variables, we only removed the outliers and missing values. Finally, a database of the 13 influencing factors of 6,834 TV points was constructed.
Cluster analysis of influencing factors K-means clustering, one of the classic and most commonly used clustering methods, is a simple unsupervised learning algorithm (MACQUEEN, 1967). This algorithm is widely used to cluster large-scale data (JAIN & DUBES, 1988). The sum square error (SSE) value of the specified cluster was calculated using the 'elbow method.' When the SSE value dropped sharply to a specific point and remained unchanged significantly, this point was designated as the elbow point, which represented the optimal K value of the clustering number (KODINARIYA & MAKWANA, 2013), calculated as follows: where L i is the i-th cluster, P is the sample point in L i , q i is the centroid of L i (mean of all samples in L i ), and k represents the number of groups.
In this study, the 13 factors and geographic coordinate information of TVs were selected as the clustering database, where the optimal cluster number range was identified using the 'elbow method.' Within this range, the optimal K value was determined through experiments and combined with manual selection. The obtained optimal K value and the clustering database were input into the grouping analysis module of the spatial statistical tool in ArcGIS. The K-means clustering algorithm was selected to obtain the clustering distribution diagram of TVs' influencing factors.

Analysis of the spatial density distribution of TVs using the kernel density tool
In kernel density estimation, a moving cell was mainly used to estimate the density of point or line patterns (ALLEN, 2016). A continuous density surface was generated by calculating the discrete point elements in the region to visually show the agglomeration status of points within the entire region as follows: where x and x i are the spatial points, f n (x) is the estimated kernel density at a certain point, and h is the bandwidth. Bandwidth was set according to the default of the system.

Multiple regression analysis
SPSS (v25.0) was used to conduct multiple linear regression analysis, where the kernel density value of TV points in each cluster was the dependent variable and the 13 influencing factors were the independent variables. The statistically significant factors influencing the TV density distribution of each clustering were determined. The areal units in close proximity to each other (GUO et al., 2000) were grouped into eight regions representing different dominant factors.

Cluster analysis of TVs with similar influencing factors
According to the 13 influencing factors and the longitude and latitude information of the TV points, the optimal cluster number k value was determined to be 21 using the elbow method ( Figure 2). The grouping analysis tool in ArcGIS was used to generate the TV cluster distribution map, which is shown in figure 3.
The majority of the cluster areas showed strong agglomeration in space, including clusters 2, 3, 5, 8, 9, 10, 11, 12, 14, 20, and 21. For example, in cluster 10, there were 459 TVs that were predominantly distributed in the temperate monsoon climate zone and at high elevations in north-central China. Dense areas of cultural relics (p = 0.000, β = 0.205) had a positive influence on TV distribution, whereas the wetness index (p = 0.008, β = -0.222) had significantly negative effects. Additionally, cluster 11 had a medium density of TVs, but exhibited strong characteristics. TVs were mainly distributed in eastern Yunnan and had a high density of cultural relics and historic sites, which can be attributed to the high ethnic diversity (p = 0.000, β = 0.537), as Yunnan is the province with the most ethnic minorities. In addition, cluster 11 had low population and traffic densities and it was easily distributed in plain areas with high elevations and slopes. These findings suggested that this cluster was typically protected by the natural environment from the adverse effects of social factors, such as the economy and traffic network.
Certain clusters were multi-regional, including clusters 15, 18, and 19, which were widely distributed in central and southern China. Notably, cluster 17 had 193 TVs with no significant influence from all 13 factors.

TV regions with different dominant influencing factors
Combining clusters and our interpretation of clustering attributes, we summarized the eight regions of TV distribution (Figure 4). Region Ⅰ, climatedominant type, included clusters 9 and 5. This region, located in southern Tibet and the northeast of Sichuan, has low precipitation, wetness, and temperature, but high elevations. TVs in Region I tend to be distributed in areas with suitable climatic conditions. Region Ⅱ, humanity/climate-dominant type, includes clusters 1, 2, and 12. Region II is located in northern China, southern Inner Mongolia, and Gansu, has a low population, and few cultural relics. TVs are typically distributed in places with a relatively high concentration of cultural relics and historical sites, and they are significantly influenced by ethnic groups.
Region Ⅲ, humanity-dominant type, includes clusters 10 and 20. This area is located in Shaanxi, Shanxi, Hebei, Shandong, and Henan, and has a high density of cultural relics and historic sites, with the strongest influence on the distribution of TVs. In addition, TVs are distributed in areas with improved economic conditions. Region Ⅳ, geography/humanity-dominant type, includes clusters 11 and 21. This region is mainly located in Yunnan. As one of the provinces with the highest density of TVs, Yunnan exhibits typical ethnic diversity characteristics. TVs are mostly distributed in high elevation areas, which are strongly correlated with a high density of cultural relics and historic sites.
Region Ⅴ, geography/climate-dominant type, includes clusters 8 and 14. This region is mainly located in Guangxi, Hainan, and southern Yunnan provinces. The distribution of TVs was evidently polarized according to elevation. Owing to the high temperature in this region, TVs are distributed in areas with high rainfall and relatively low temperature.
Region Ⅵ, climate/humanity-dominant comprehensive type, is mainly represented by cluster 6, but also includes clusters 13, 15, and 18. This region is located in the eastern coastal region of China, and is significantly affected by the distribution density of cultural relics and historic sites, as well as climate Region Ⅶ, geography/climate-dominant comprehensive type, is mainly represented by cluster 4, but also includes clusters 3, 13, and 14. This region is mainly located in Guizhou, Yunnan, and Hunan, with minor regions in additional places. TVs are mainly distributed in areas with low temperature and high humidity, which is typical of residential settlement distribution under similar climatic conditions. Region Ⅷ, the hybrid type, is primarily represented by cluster 16 but also included clusters 7, 15, 18, and 19. It is predominantly located in Sichuan and Chongqing. TV distribution of this region is significantly affected by cultural relics and historic sites. TVs are distributed in the areas with high values of climate factors, slope, and economy, which may be closely related to the unique micro-climate environment of the mountainous terrain in this region.

DISCUSSION
Different countries have different definitions of historic rural settlements. For example, in Scotland, historic rural settlements included all habitations from the fifth to the twentieth century (HARRISON et al., 2008). In China, villages must meet the requirements of 20 indicators in three aspects: architecture, site selection and layout, as well intangible cultural heritage in the 'Traditional Village Evaluation and Identification Index System' to be rated as a traditional village (Traditional village evaluation and identification index system (trial implementation), 2012). This provided a clearer definition of the concept of traditional Chinese villages. Recently, discourses promoting the enhancement and development of villages have become a recurrent theme across the globe; the experience in China presents unique characteristics and also has a certain reference significance (POLA, 2019).
An improved understanding of how environmental factors contributed to regional variation in historic settlement organization can improve our understanding of the significance of variations in rural settlements and their historic character as well as the overall setting of numerous types of rural buildings (LOWERRE, 2014). Generally, in China, the climate factors have the most significant influence on a TV's distribution, followed by geographic factors, among which elevation has the greatest impact, and then by human factors, among which ethnic diversity plays a relatively important role. These three factors are spatio-temporally interrelated, and all of them contribute to the living environment in TVs. Our intent was not to advocate a completely 'environmentally determinist' approach determining the variations in historic settlement organization. Rather, it was to explore the strengths (or weaknesses) of the relationships between environmental factors and settlement organization and how those relationships vary across the country (LOWERRE, 2014). After clustering and drawing the set of settlements and environmental variables, it is possible to more directly explore the relationship between the environmental factors and TVs, and determine the environmental variables that have the greatest impact on the regional changes of TVs.
Although, there are numerous spatial clustering methods, they are not applicable in regionalization (GUO, 2008). The K-means statistical method has proven to be an effective method to determine settlement groups (ÄYRÄMÖ & KÄRKKÄINEN, 2006;KOWALEWSKI, 2008). The resulting 21 types of clusters clarified the relationship between the TVs and their environments by reducing the statistical scale. Results from the clustering showed that the TV groups with similar environmental factors were distributed across administrative boundaries. Cluster types located close together had more similar dominant influencing factors than those further apart (TSUNODA et al., 2012). For the 21 clusters, the fitting effect of multiple regression was reported to be better than the overall analysis. The significant impact factors of each group are different, which also confirms the complexity of TV distribution factors, and hence, unified protection standards may not apply. Interestingly, some groups distributed in south-eastern China have relatively low differences in natural factors such as topography and geomorphology, but owing to the wide-ranging exchanges of culture in this region and the depth of their influences, the complicated interaction between socio-economic factors and changes in the bio-physical environment, present some clusters with a complex state of large-scale cohesion and partial dispersion. To uncover a pattern within the data and render the problem size manageable, it is common in most spatial studies to further aggregate the data into larger areal units or zones (GUO et al., 2000). We summarized the results of multiple linear regressions of cluster types, which provided a further characteristic basis for the regional construction of TVs and considered their geographical proximity. The boundaries of the eight regions were somewhat permeable or vague, but they are internally meaningful (KOWALEWSKI, 2008). This can facilitate the adoption of a rational and prudent approach to space and strategic planning.
Crop and livestock production are heavily dependent on rainfall. Temperature and rainfall in general are the most important climatic factors influencing the availability of natural resources and livelihood strategies (ZAMPALIGRÉ et al., 2014). For the TVs within the climate-dominant region, it is necessary to strengthen the long-term and dynamic monitoring of climatic factors, meanwhile the local residents' perception of climatic factor should also be considered (BYG & SALICK, 2009). Additionally, some secondary factors of climate, which may cause floods, soil and water loss, and drought, should be emphasized and corresponding measures adopted over time.
Early settlements produced complex interactions between culture and nature (MCGOVERN et al., 2007). These complex interactions are reflected in ethnic diversity as influenced by geographic diversity (PARSONS, 1972;PAIK & SHAWA, 2013). Chinese TVs dominated by humanity factors have a strong ethnic diversity and mature cultural landscapes, necessitating their protection along with natural territories. Additionally, economic development and convenient transportation are also necessary to maintain the authenticity and sustainability of TVs.
TVs dominated by geographic factors are mainly located in areas with high elevation, large slope, and an evident topographic relief. Restricted by topography, TVs were scattered on multi-directional slopes where soil is suitable for cultivation (YANG et al., 2016). Based on these results, we suggested regular maintenance of high-quality agricultural land and the living conditions of TVs. These regions are more conducive to the preservation of multiethnic TVs and have highly prominent regional characteristics (PAIK & SHAWA, 2013).
TVs dominated by hybrid factors exhibited distinct geographic characteristics and a diversified distribution that correspondingly made their landscapes diverse. Therefore, it is necessary to strengthen the perception of TV landscape characteristics and further investigate the interactions among various influencing factors of the TV distribution to formulate more detailed management and protection measures (ANTROP, 2004).

CONCLUSION
The 13 influencing factors used in this study were closely related to the distribution of TVs. For different regions, there are differences in the significance of the influencing factors and their positive and negative effects, but there are also significant similarities in these factors, which form the 21 TV clusters and 8 types of regions based on the dominant factors. Based on the characteristics of these 8 TV regions, we can formulate diversified, targeted, and possibly even cross-regional joint management and protection strategies. These results further indicated that the combination of unsupervised classification clustering technique and multiple regression is an effective regional analysis method, and it has been possible to develop new, national-scale characterizations of historic settlement organization and their key environmental variables (LOWERRE, 2014) and effectively manage and protect rural settlements and landscapes with a long history (HARRISON et al., 2008).
This study has the following limitations: 1) the selection of the influencing factors was limited based on data availability, human resources, and technological conditions; 2) during data processing, avoiding the intervention of subjective factors such as definition of the traffic network radiation range, determination of ethnic value, and division of the eight regions was difficult. In future studies, the influencing factors should be further enriched such that the content of the analysis will not be solely limited to the influence of the environmental factors on TVs. Instead, they will also include the interactions between the factors that influence the distribution of the TVs. The environmental tendencies of TVs distribution can be used as a basis for the classification of TV typology and used as a reference for the identification and mining of TVs.