Inferência estatística clássica para a confiabilidade de rede de coautoria com enfoque nos vértices

A research group may be considered a social network, which may be modeled by a graph G with k nodes and m edges. Researchers that make up this network can be interpreted as its nodes or actors, and the connections or links between those researchers (represented by co- authored papers) can be considered as its edges. The aim of this study was to measure the reliability of networks considering unreliable nodes or researchers and perfectly reliable edges or connections. Specifically, a statistical analysis based on classical inference to the network reliability was proposed, obtaining the maximum likelihood estimators and confidence intervals for the individual components (researchers) and the co- authorship network; the methodology was applied to a research group of UNESP registered in CNPq; and measures of centrality of nodes were obtained to assist in identifying situations where the insertion of an edge or connection between two researchers of the group could significantly increases the reliability of this co-authorship network. The results showed the usefulness of statistical inference in the context of social networks reliability, noting that the contribution of each researcher is of extreme importance for the maintenance of a research group. It was also found that calculating the reliability of a co-authorship network can be quite exhausting to be executed and that the centrality measures are a viable tool when it intends to increase the reliability of this network.


Introduction
Reliability is the ability with which an item successfully performs a function under specific operational conditions.The term network reliability is bounded to the calculation of reliability of any general configuration of items (or components) when the reliability of each item is warranted.
Networks are physical, biological or social systems characterized by a huge set of well-defined items that interact dynamically.Physical networks comprise electricity and water distribution, transport, telecommunications, radio and TV, and others; social networks may be networks of personal and/or thematic relationships, communities, emails, blogs; biological networks may include food chains and disease transmission (LYRA & OLIVEIRA, 2011).
The maintenance of the functionality of a network requires information of its structure, functions and characteristics.Since a network´s structure may be represented by a graph, the Theory of Graphs is basic to determine the properties which refer to the network´s topological aspects.
Network reliability is the probability that a network remains functioning even though a flaw demands the removal of one or more subsets of the components (edges and/or nodes).Highly reliable networks are strong structures.Moreover, a network is more reliable than another if the probability of one network is disconnected is less than that of the other.
Everyone agrees that our planet has become more complex and that knowledge is more and more difficult to construct on an individual basis.
The stimulus to form research groups in universities and development organs proves this fact.The institutionalization of research groups in Brazil by the National Council for Scientific and Technological Development (CNPq) coupled to their dissemination and constant upgrading is a practice that foregrounds research in Brazil (MARAFON apud Miorin, 2008).Highly reliable research groups with a strong collaboration structure may contribute intensely towards the emergence and/or concretization of ideas.In fact, these groups perform most current research and are responsible for the formation of numberless researchers.
A research group is a social network and may be modeled by a graph.Researchers that form the network may be called its vertices or nodes and the connections and bonds between these nodes (for example, team publications) are the edges.Current analysis studies the reliability rate of networks when edges are unreliable or prone to flaws and the nodes are totally reliable.In other words, current research proposes (a) a statistical analysis based on the classical inference for a network´s reliability, with estimates of maximum likelihood and the respective confidence intervals for the reliability of edges (co-authorship bonds) individually and for the reliability of the network (the probability that the research group continues to function) at a given time t; (b) the development of an analysis for a special research group of the State University of São Paulo (UNESP) enrolled by the CNPq; (c) the provision of measures of centrality of nodes to identify situations in which the insertion of edges (or the coauthorship bond between two researchers of the team) may significantly increase the network´s reliability.
When scientific production is registered by the researchers and published on the CNPq Lattes Database (Lattes CV), the data may contain several types of imprecision (mistakes in the writing of names causing ambiguities and incorrect identification of authors; scientific articles under the name of one author but lacking under the name of the other co-authors; papers which unawares were not registered under any author, and others).Inference approach is, therefore, highly important for the reliability of co-author network.

Social networks
Social networks are structures composed of people, organizations, territories or others, connected among themselves by one or several types of relationships (friendship, family, commercial etc.) through which information, knowledge, interests, values and aims (relationship, community, political, professionals networks) are shared.Social networks investigate the development of the team´s activity and indicate the group´s and the person´s efforts.Knowledge on the structure, function and traits of a social network are extremely relevant for its functionality.Since they may function at different levels, such as network of relationships, network of professionals, community network, political network etc, co-authorship networks are included in this context.In fact, they are made up of researchers and shared tasks.Networks are symmetrical in the sense that researcher A is a collaborator of researcher B at a given time t in the exact number of times that researcher B is a collaborator of A. Since shared work or co-authorship saves time and financial and material resources, it is encouraged by research funding agencies in Brazil.These factors contribute towards the valorization of researchers that are capable of forming efficient and productive work teams (MAIA & CAREGNATO, 2008).Regardless of certain particularities, co-authorship of products produced by scientific activities, with a special mention to scientific publications, indicates collaboration.Results on studies dealing with co-authorship reveal that collaboration among authors have increased significantly in all areas of knowledge and underscore the importance of co-authorship in the maintenance of research groups.

Basic concepts of the Theory of Graphs
The basic concepts of the Theory of Graphs were investigated by Boaventura Netto & Jurkiewicz (2009), Silva (2010) and Lyra & Oliveira (2011).Graph is a simple, abstract and intuitive notion which represents a sort of relationship between items.It is represented by a drawing with nodes or vertices which signify the items, bonded by lines, called edges, which denote the relationship.The mathematic representation of a simple undirected graph is G=(V,E), where V is the finite, not-empty set whose items are the nodes; E is a set of subsets of two items of V whose items are the edges.The set of nodes V has cardinality (number of items) |V|=m; the set of edges E has cardinality |E|=k ; each edge is denoted by {v i ,v j }, in which v i ,v j V.
The degree of node v i , denoted by d(v i ), is the number of edges in the nodes.Two nodes are adjacent if an edge exists between them.A walk is a family of successively adjacent links.When the last link of the sequence is adjacent to the first, the walk is closed and called a circuit; contrastingly, it is open.A walk occurs when all edges of the graph are distinct.In this case, it is called a path.When other nodes are reachable from any one of them, the graph is connected; otherwise, it is called unconnected.Edge connectivity, denoted by (G), is the least number of edges whose removal transforms the graph into an unconnected G graph.The connectivity of the node, denoted (G), is the least number of nodes whose removal (together with the edges bound to it) transforms the graph into an unconnected G graph.A G-generator sub-graph is a graph from G through the mere elimination of some of its edges (without making it unconnected).Graph G with m nodes may de represented by a matrix, denoted by A (G) of the order m, called adjacency matrix of G in which entrance a ij of the matrix is equal to 1 if v i and v j are adjacent; otherwise, it is equal to zero, for all i, j = 1,2,...,m.Two graphs G=(V 1 ,E 1 ) and H=(V 2 ,E 2 ) are equal when V 1 =V 2 and E 1 =E 2 .Isomorph graphs have the same structure; in other words, they have the same number of nodes and edges, albeit a different pattern.

Basic concepts for the Social Network Analysis
Since studies on social networks are interdisciplinary, several methodologies of analysis based on network structures are extant.Another methodology for the study of social networks is the Social Network Analysis (SNA) whose concepts are very similar to the Theory of Graphs, coupled to the mathematical language employed.Some concepts relevant to SNA, provided by Hayashi, Hayashi & Lima (2008) and Silva (2010) are given below.Agents, items or nodes may be individual social units (people or firms) or collective social units (institutions, organizations, nations) where bonds establish relationships between the agents.Bonds may be classified as absent, weak and strong and are due to any type of liaison, such as consanguinity, friendship, professional and others.Relationship is a set of bonds with the same bonding criteria.In fact, relationships have two important features that condition the methods of data analysis available, or rather, direction and valorization.A relationship may be directional, when the agent is the transmitter and the other is the receiver (friendship; quotation etc), and nondirectional, when the relationship is reciprocal (knowledge, co-authorship etc).In the case of valorization, relationship may be dichotomic (implies the presence or absence of a determined bond between two nodes) or valorized with discrete or continuous values (weight due to relationship; for instance, the number of scientific papers published in co-authorship by a certain number of researchers).The agent´s attributes are his/her individual characteristics, such as name, gender and age.The tools most used in SNA comprise descriptive statistics (graphs, tables, distribution of frequencies, descriptive measurements and other); centrality measurements (degree of information, neighborhood and intermediation); cluster analysis (division of the network in subsets of agents constructed from bonds and the position they occupy).

Calculation of the network´s reliability
Let a network be modeled by a simple undirected graph G=(V,E) with m nodes and k edges.So that the network functions (or in activity) at time t, every pair of nodes should be connected by at least one path.Let´s suppose that the nodes are reliable and only the edge tends to be faulty.Therefore, each edge i ( ) has a functioning probability (reliability of edge i) denoted by i p .There are instances in which all the edges of a graph that models the network have the same functioning probability, simply denoted by p .Further, nodes are independent two by two.In other words, the failure of one does not imply the other´s failure.So that the reliability of a network (the probability graph G that models the network continues connected, even given the failure of one or more edges) may be calculated, the probability of each functioning stage of the network must first be determined: where E is the set of edges of graph G and E' is the set made up by the functioning edges of graph G.When the edges of graph G that models the network have the same functioning probability p , the network´s reliability is given as: where G is the graph that models the network with m nodes and k edges; i S is the number of connected sub- graphs of G with i edges (KELMANS, 1966).When the edges of the graph that models the network have different functioning probabilities i p , the reliability of the network G R p is calculated similarly as expression (2), or rather, when the connected sub-graphs of G with i edges are obtained, the probability of each functioning state of the network should be calculated and results added.

Collection of data and the construction of the co-authorship network
The group called Research Center in Administration and Agribusiness (CEPEAGRO) of Applied Social Sciences area was selected so that a network of scientific co-authorship formed by researchers from a research group of UNESP could be constructed.If each researcher is represented by a node and two nodes are linked by one edge; if, and only if, the researchers have at least one publication in common, then the reliability of the research group during time t (represented by an undirected graph that models the co-authorship network among the researchers of this group) is the probability of the above-mentioned team to continue active during time t, even though one or more flaws (changes in the number of co-authorship´s relations) causes the removal of one or more subsets of the graph´s edges.The following methodological procedures were undertaken in current analysis: I. Survey of scientific production (articles in scientific journals, books, papers read in scientific events), published and listed by researchers on the Lattes database from the moment of their insertion in the research group.The set of data on the scientific production of each researcher required for the proposed analyses was composed of a. the number of publications of each researcher, attributed to the research group; b. the number of common (or co-authored) publications between research peers attributed to the research group; II.Organization and systematization of collected data, coupled to the representation and analysis of the characteristics of collaboration (co-authorships in scientific publications) among the researchers under analysis, by a graph; III.Calculation of three centrality measurements of nodes: 1) measurement of closeness; and, 2) measurement of degree of information.The above measurements identify situations in which the insertion of an edge or of a bond between two researchers of the group may significantly increase the network´s reliability.

Calculation of reliability of co-authorship network
As discussed above, a research group may be dealt with as a social network which may be modeled by a simple undirected graph G=(V,E) with m nodes (researchers that compose the research group) and k edges (co-authorship bonds).Since nodes are utterly reliable and only the edges are prone to failure, the reliability of the network or the probability of the team remaining in activity during time t, even though one or more flaws remove one or more subsets of the graph´s edges, is provided by  Edges have the same probability of functioning p : p is calculated as in the previous expression; in other words, when the connected sub-graphs of G with i edges are obtained, the probability of each functioning state of the network must be calculated, and results added.

Statistical inference (or maximum likelihood)
Supposing a co-authorship network modeled by a simple undirected graph G=(V,E) with m nodes and k edges, i Y is a variable indicator for the functioning of i th edge (or rather, the i th relation of co-authorship; Making equation (6)   Once more deriving equation ( 5) with regard to i p , derivates of second order are obtained which correspond to the diagonals of the matrix expressed by (7), given by: It should be noted that second order derivates (in the above-mentioned matrix) corresponding to are given by: is the level of the confidence of the interval; point p , the invariance property of the estimators of maximum likelihood is employed.In other words, it is sufficient to take the estimators of maximum likelihood of i p , expressed in ( 6), and substitute in

Measurements of the nodes´ centrality
Centrality measurements are employed in SNA to verify the relevance of a node with regard to the others in a network.Through centrality measurements, nodes may be ordered according to their relative importance.Since power is a relation-derived characteristic, it may be associated to centrality measurements by showing power distribution within a network and the influence of nodes to dominate or influence other nodes.Different centrality measurements are used for different types of relevance (position, flux, influence and others).Among extant measures, the following were employed (SILVA, 2010; LYRA & OLIVEIRA, 2011): Closeness measurement relates total distance of a node to other nodes of the network, or rather, it indicates the access velocity of a node to another one in the network and shows the nodes that need improvement.Closeness measurement of node i ( i v ) is calculated by 1 If the distribution convergence of a parameter is known, then, by the Delta method, the distribution convergence of a function of this parameter is also known.The function should satisfy certain conditions such as being differentiable and continuous., or rather, the item that communicates with the highest speed with the other items of the network due to its structural position.Information degree measurement gives relevance to a node due to the number of direct bonds that it establishes with the other nodes of the network.In other words, it evaluates direct interference (or immediate effect for time 1  t ) of a node in the other by the number of measurement unit paths originating from a node.The calculation of the information degrees measure of node i where m is the number of nodes in the network.

Application, Results and Discussion
The graph modeling scientific co-authorship under analysis (CEPEAGRO) was automatically generated by script Lattes V7.02 according to the characteristics of collaboration between researchers of the group (co-authorship in scientific publications).Only articles in scientific journals, books and papers in scientific events were listed.They were filed and published by CEPEAGRO researchers at the Lattes Database from the date of inclusion up to August 2012 (time t).Researchers that quitted the group (at any moment since its establishment) were not taken into account.A network of scientific co-authorship modeled by undirected, simple, connected graph G was obtained, with k = 8 edges or co-authorship relations and m = 7 nodes or researchers, respectively.
Figure 1 -Graph G modeling the scientific co-authorship network.
According to Figure 2, eighteen connected sub-graphs may be formed from graph G of Figure 1.Given the configuration of graph G, it is impossible to form connected sub-graphs with five or less edges.Therefore, the reliability of the network is given by Table1 shows the results of simulations for different p rates.The behavior of the reliability of the co-authorship network increases according to the reliability of each edge or co-authorship relation, or p rate.Due to the configuration of this group and the relation of existing co-authorship, the probability of flaw in the edge over 0.7 (p < 0.3) causes the network reliability (the probability that the group remains in activity during time t, on August 2012) to be close to zero.

Statistical inference
Let each edge i, 8 ,..., 2 , 1  i , of graph G that models the co-authorship network has its reliability denoted by i p .
According to the eighteen connected sub-graphs from G, the generic reliability expression of the network is given by: The reliability estimate process of each edge or relation of co-authorship i ( 8 ,..., 2 , 1 ,  i p i ) and the reliability of network G R p (research group in activity in August 2012) were undertaken by the maximum likelihood method.
Consequently, according to information of scientific publications (papers in scientific journals, books and papers in events) of the research group CEPEAGRO obtained from script Lattes V7.02 and directly confirmed by the researchers, the set of data for the estimation process is given by: , and their respective )% 1 ( 100   confidence intervals (asymptotic) are expressed by are the i th of the diagonal of inverse Fisher matrix information of expression (7), given by: According to expressions ( 14) and ( 15), MLE and the respective 95% confidence intervals for 8 , are given in Table 3 below.Centrality measurements of closeness and information degree for nodes of graph G (Figure 1) were calculated, as Table 6 demonstrates.According to the above measurements, the most central nodes or researchers of graph G are respectively "a" and "c", that is, the researchers with the highest speeds of access and with the greatest influence on the others.Although statistical inference shows that the co-authorship relation between researchers "a" and "c" are the least reliable, if they were removed somewhat from the graph, the scientific co-authorship network would be less connected and, consequently, its reliability would be severely compromised since some paths would not exist anymore.
The less central nodes of graph G are respectively "e" and "f".According to tests by some authors (LYRA & OLIVEIRA, 2011; OLIVEIRA, BRIGANTINI & UEHARA, 2013; SILVA, 2010), if it is aimed at making the research group more reliable with the insertion of a new edge or co-authorship relation, the centrality measurements indicate that the bond between researchers "e" and "f" may bring such improvement.When edge i = 9 with fictional reliability 20 .0 9  p between nodes "e" and "f" (Figure 3) is fitted, and considering the other edges with reliabilities 8 ,..., 2 , 1 ,  i p i equal to their respective maximum likelihood estimates in Table 3 and re

Final Considerations
Studies on the network reliability of scientific co-authorship identify which networks are reliable from different approaches (edges and/or nodes) according to the participation of researchers and the intensity of extant coauthorship relations.Current investigation proposes a classical inference approach for the reliability of a co-authorship network with a specific focus on edges (co-authorship relations), or rather, taking into consideration perfectly reliable nodes (researchers).Further, centrality measurements of nodes were obtained that identified the situation in which the insertion of an edge between two researchers provided a significant increase in the reliability of the network or the research group in remaining active during a given time t.The example provided showed that the calculation of reliability of a co-authorship network may be stressing when executed manually or by computer.The employment of centrality measurements may be considered a feasible alternative.However, some studies have shown that such measures may be an auxiliary alternative but not entirely reliable when investigating a network´s reliability increase (LYRA & OLIVEIRA, 2011; OLIVEIRA, BRIGANTINI & UEHARA, 2013; SILVA, 2010).Consequently, the use of other centrality measurements and the execution of simulations for more trust-worthy results are recommended besides the employment of these measurements.
i th term of the diagonal of the inverse Fisher matrix information (CASELLA & BERGER, 2010).Since the reliability of network G R p is a reliability function of the individual components information expressed by(7).By the method Delta 1 (SEM, SINGER & PEDROSO- distance between node i ( i v ) and node j ( j v ); m is the number of nodes in the network.The most central item of the network has the lowest rate

Figure 2 -
Connected sub-graphs of G (Figure1) with eight, seven and six edges.Let us consider a (fictional) situation where all the edges of graph G have the same reliability 8 by p), or rather, all co-authorship relations contribute equally for the group.
calculating expression (14), the network's reliability increases approximately 3.21 fold ( reliability of the network without the insertion of the above-mentioned edge(
reliability of the i th edge and may be estimated by the maximum likelihood method, an estimate technique very common in statistical inference.The likelihood principle holds that, if the model is correctly identified, all information from the data on the parameters is contained in the likelihood function.The method, therefore, selects the estimators of the model´s parameters that maximize the probability to obtain a really observed sample.
G R p (OLIVEIRA & ACHCAR, 2000)f the publications of researcher r and of researcher s for the research group and i x is the number of co-authored publications of researchers r and s for the , and m is the number of nodes of graph G, or rather, of researchers that form the network(OLIVEIRA & ACHCAR, 2000).
equal to zero, the maximum likelihood estimators (MLE) of

regard to researchers of the research group CEPEAGRO.
is the total (sum) of publications of researchers r and s for the research group; i x is the number of co- authored publications of researchers rand s for the research group, in which