Neither Better nor Worse, Simply Different

Have you ever suffered discrimination because you used secondary data in your research? Since the principal area of research for this article's three authors involves the development and application of techniques for using secondary data, our answer is definitely no. However, we frequently hear complaints by colleagues who have encountered barriers to developing their theses or obtaining research funding because they opted to use secondary data. A recent article by Rothman 1 discusses six erroneous perceptions regarding aspects of epidemiological research that are often reinforced in classrooms and textbooks. Although the author did not discuss data sources, we believe that the list should add a seventh misconception: the notion that primary data are the only valid source for epidemiological studies. Population, vital, epidemiological, administrative , and clinical data have undergone important changes in their production and dissemination. They are now available in online databases that include millions of individual micro-data. In addition to the above-mentioned traditional sources, other modalities have emerged. The digital trails produced in accessing different web-based communication platforms and mobile phones have been used in studies about how patterns of behavior and mobility influence the determination and spread of diseases 2. Secondary data have the potential to back studies on highly relevant public health issues, particularly due to their wide availability, scope, and coverage. They are actually the best data to answer questions on the determinants of incidence rates in populations, as suggested by Rose 3. Even so, it is important to discuss how the two worlds are brought together. For example, gene-environment interaction requires the use of increasingly larger study populations. The context of " big epidemiology " 4 stimulates the practice of " data sharing " , whereby the data collected for specific studies are used by researchers not originally involved in their planning and execution. The age of " big data " has brought about the recommendation of using this wealth of data in research 5 , including population health research 6. However, several authors have emphasized the need for responsible use of such databases 7. The main criticisms aimed at secondary data sources are the absence of mechanisms for data quality assurance and control and the lack of necessary variables for adequately testing causal hypotheses at the individual level. Quality is a crucial issue. One should evaluate the different dimensions of quality 8 before using a secondary data source. Meanwhile, database custodians should …


Cláudia Medina Coeli 1 Rejane Sobrino Pinheiro 1 Marilia Sá Carvalho 2
Have you ever suffered discrimination because you used secondary data in your research? Since the principal area of research for this article's three authors involves the development and application of techniques for using secondary data, our answer is definitely no. However, we frequently hear complaints by colleagues who have encountered barriers to developing their theses or obtaining research funding because they opted to use secondary data.
A recent article by Rothman 1 discusses six erroneous perceptions regarding aspects of epidemiological research that are often reinforced in classrooms and textbooks. Although the author did not discuss data sources, we believe that the list should add a seventh misconception: the notion that primary data are the only valid source for epidemiological studies.
Population, vital, epidemiological, administrative, and clinical data have undergone important changes in their production and dissemination. They are now available in online databases that include millions of individual micro-data. In addition to the above-mentioned traditional sources, other modalities have emerged. The digital trails produced in accessing different web-based communication platforms and mobile phones have been used in studies about how patterns of behavior and mobility influence the determination and spread of diseases 2 .
Secondary data have the potential to back studies on highly relevant public health issues, particularly due to their wide availability, scope, and coverage. They are actually the best data to answer questions on the determinants of incidence rates in populations, as suggested by Rose 3 . Even so, it is important to discuss how the two worlds are brought together. For example, geneenvironment interaction requires the use of increasingly larger study populations. The context of "big epidemiology" 4 stimulates the practice of "data sharing", whereby the data collected for specific studies are used by researchers not originally involved in their planning and execution.
The age of "big data" has brought about the recommendation of using this wealth of data in research 5 , including population health research 6 . However, several authors have emphasized the need for responsible use of such databases 7 . The main criticisms aimed at secondary data sources are the absence of mechanisms for data quality assurance and control and the lack of necessary variables for adequately testing causal hypotheses at the individual level.
Quality is a crucial issue. One should evaluate the different dimensions of quality 8 before using a secondary data source. Meanwhile, database custodians should employ techniques to prevent, detect, and repair errors 9 and make extensive documentation available on their data collections. Financing infrastructure for data management and access is an essential element in policies to encourage the use of secondary data 5,6 . In relation to the available variables for analysis, the integration of databases through record linkage techniques 10 can contribute to better specification of exposure and outcome variables, in addition to expanding the number of variables available for adjustment for confounding. In addition, some methodological solutions have been proposed to mitigate the problem of unmeasured confounding factors 11 . Finally, interest has grown in answering non-etiological questions, which do not require adjustment for confounding. One example are questions regarding the evaluation of public health interventions, which can be answered using different types of data, together with the application of new analytical techniques, for example data mining and computational modeling of complex systems 2,6,10 .
Beyond the methodological issues, responsible use should also contemplate respect for privacy. This requires the development of an ethical framework that considers the specificities of research based on secondary data, especially informed consent 12 . Brazil recently passed Law n. 12,527, regulating access to public information 13 . Care should be taken to prevent overly conservative interpretations of the law from resulting in unnecessary restrictions on the disclosure of anonymous database contents or on access to identified databases (while maintaining the necessary safeguards). According to a study by the U.S. National Research Council, the American legislation governing health information trans-fer (HIPAA Privacy Rule) had negative impacts on relevant research for public health 14 . In Brazil, the legislation should seek a balance between individual rights and collective interests to avoid jeopardizing studies that aim to improve health, healthcare, and living conditions for users of the Unified National Health System.
The use of secondary data in research requires investments in human resource training. If, on the one hand, research teams increasingly need to incorporate information technology professionals, on the other, we need public health researchers capable of interacting with them, as interactive experts as defined by Collins et al. 15 . The necessary skill set and minimum expected level of expertise remain open questions. Relevant contents include SQL (Structured Query Language), record linkage, unstructured data integration, data mining, and computational modeling of complex systems.
We finished this paper in Rio de Janeiro during Carnival, which features the parade of samba schools as one of the city's most important tourist events. The article's title was inspired by a samba refrain coined in the 1960s by Nelson de Andrade, then-president of the Salgueiro samba school. The original refrain, "Neither better nor worse, simply a different School" meant to highlight a creative revolution in Rio's Carnival led by Fernando Pamplona and Arlindo Rodrigues 16 . Secondary data represent a valuable source for research in public health. Taking maximum advantage of the data also requires a revolution: thinking differently, training differently, and doing differently.