Accessibility / Report Error

The experimental method in Public Administration: lessons from replication in Psychology

Christina W. Andrews Michiel S. de Vries About the authors

Abstract

In recent years, an increasing number of studies adopting the experimental method have appeared in Public Administration journals. It has been argued that the advantage of experiments in behavioral sciences is that researchers can control contextual factors while investigating the effect of manipulations on a variable of interest. Another point is that experiments can be replicated and, thus, increase confidence in research results. However, replications are rarely undertaken, especially in the behavioral sciences. This article examines the results of the “Open Science Reproducibility Project: Psychology,” which replicated 100 experiments previously published in leading Psychology journals. Based on the findings of this project, we present seven recommendations to Public Administration scholars that can improve the quality of their experiments.

Keywords:
experimental method in the behavioral sciences; research methods in public administration; reproducibility

Resumen

En los últimos años, han aparecido en revistas de Administración Pública un número creciente de estudios que adoptan el método experimental. Se ha argumentado que la ventaja de los experimentos en las ciencias del comportamiento es que permiten a los investigadores controlar los factores contextuales mientras investigan el efecto de las manipulaciones en una variable de interés. Otro argumento es que los experimentos se pueden reproducir y, por tanto, aumentar la confianza en los resultados de la investigación. Sin embargo, las repeticiones rara vez se realizan, especialmente en las ciencias del comportamiento. En este artículo examinamos los resultados del Open Science Reproducibility Project: Psychology, que repitió 100 experimentos publicados anteriormente en las principales revistas de Psicología. Con base en los hallazgos de este proyecto, presentamos siete recomendaciones a los académicos del área de Administración Pública que pueden mejorar la calidad de sus experimentos.

Palabras clave:
método experimental en ciencias del comportamiento; métodos de investigación en administración pública; reproducibilida.

Resumo:

Nos últimos anos, um número crescente de estudos adotando o método experimental tem surgido em periódicos de Administração Pública. Tem-se argumentado que a vantagem dos experimentos nas ciências comportamentais é que este permite aos pesquisadores controlar os fatores contextuais enquanto investigam o efeito das manipulações em uma variável de interesse. Outro argumento é que os experimentos podem ser replicados e, assim, aumentar a confiança nos resultados da pesquisa. No entanto, replicações raramente são realizadas, especialmente nas ciências comportamentais. Neste artigo, examinamos os resultados do Open Science Reproducibility Project: Psychology, que replicou 100 experimentos publicados anteriormente nas principais revistas de Psicologia. Com base nas conclusões deste projeto, apresentamos sete recomendações a acadêmicos da área de Administração Pública que podem melhorar a qualidade de seus experimentos.

Palavras-chave:
método experimental nas ciências comportamentais; métodos de pesquisa em administração pública; reprodutibilidade

1. INTRODUCTION

Grimmelikhuijsen, Jilke, Olsen and Tummers (2017Grimmelikhuijsen, S; Jilke, S; Olsen, A. L; & Tummers, L. (2017). Behavioral ublic administration: Combining insights from ublic administration and sychology. Public Administration Review, 77(1), 45-56. ) noted that eminent Public Administration scholars such as Herbert Simon and Dwight Waldo have stressed the imortance Psychology in Public Administration research, but only recently this has been acknowledged more frequently. The authors also noticed that, in recent years, “between 3 ercent and 11 ercent of all ublished articles [in Public Administration Journals] are informed by Psychology, a share that has been increasing in recent years” (Grimmelikhuijsen et al., 2017, . 46). Social Psychology has an esecially close affinity to Public Administration whereas both fields seek to understand how the social context influences individual behavior (Sobis & De Vries, 2014Sobis, I., & De Vries, M. S. (2014). The social sychology ersective on values and virtue. In M. S. De Vries, & P. S. Kim (Eds.), Value and virtue in Public Administration: a comarative ersective (IIAS Series: Governance and Public Manpagement). London, UK: McMillan-Palgrave.).

More recently, Psychology has insired Public Administration research not only regarding theoretical ersectives but also regarding method. For examle, Tee and Proko (2017Tee, M., & Proko, C. (2017). Laboratory exeriments: their otential for ublic manpagement Research. In O. James, S. R. Jilke, & G. G. Van Ryzin (Eds.), Exeriments in ublic manpagement research: challenges and contributions (1st ed; . 139-164) Cambridge, UK: Cambridge University Press., . 159) argue that “[...] the setu, conventions, and measurement techniques in sychological exeriments rovide otimal conditions to exlore cognitive evaluation and decision rocesses” in ublic management. It should be noted that Social Psychology is slit between two cometing eistemologies: “social constructionism” and “exerimental social sychology” (Jost & Kruglansk, 2002Jost, J. T; & Kruglanski, A. W. (2002). The estrangement of social constructionism and exerimental social sychology: History of the rift and rosects for reconciliation. Personality and Social Psychology Review, 6(3), 168-187.). The latter has been the dominant methodological aroach and the one that have influenced Public Administration research; some scholars even seak of an “exerimental turn” in the field (Jilke, Van de Walle & Kim, 2016Van de Walle, S. (2016). The exerimental turn in ublic manpagement: How methodological references drive substantive choices. In O. James , S. Jilke, & O. Van Ryzin (Eds.), Exeriments in ublic manpagement research. Cambridge, UK: Cambridge University Press . ). According to Bouwman and Grimmelikhuijsen (2016Bouwman, R; & Grimmelikhuijsen, S. (2016). Exerimental ublic administration from 1992 to 2014: a systematic literature review and ways forward. International Journal of Public Sector Manpagement, 29(2), 110-131.), the reason why Public Administration research could benefit from exerimental designs is the ossibility of controlling for endogeneity and simultaneity. The first roblem emerges in the analysis of observational data due to intervening variables that usually remain unaccounted for. The simultaneity roblem takes lace when the researcher cannot determine the direction of the cause-effect relationshi.

Exeriments have long been regarded to be advantageous for allowing the researcher to control the variables of interest.11 1 In this article we consider only the experimental methods that involve a control group and the maniulation of one or more variables, including laboratory and survey exeriments. Although field exeriments also involve the maniulation of one or more variables, these are not discussed here due to the secific features of the method and its relative limited use vis-à-vis other exerimental methods. For a discussion on field exeriments, see Baldassarri and Abascal (2017). This is an imortant oint but one has to acknowledge that the exerimental method in the behavioral sciences resents some challenges. As we will discuss in detail below, recent attemts to relicate exeriments in the field of Psychology have raised a number of questions about the external validity of research outcomes, i.e., whether results can hold when measurements are made in similar but not identical situations, and whether variables have been sufficiently ket under control.

Recent recommendations for conducting exeriments in Public Administration have focused mainly on the advantages of the method (see Grimmelikhuijsen et al., 2017Grimmelikhuijsen, S; Jilke, S; Olsen, A. L; & Tummers, L. (2017). Behavioral ublic administration: Combining insights from ublic administration and sychology. Public Administration Review, 77(1), 45-56. ; James, Jilke & Van Ryzin, 2017James, O; Jilke, S. R; & Van Ryzin, G. G. (2017). Behavioural and exerimental ublic administration: Emerging contributions and new directions. Public Administration, 95(4), 865-873.). Although some limitations of the exerimental method have been discussed (e.g., Van de Walle, 2016Van de Walle, S. (2016). The exerimental turn in ublic manpagement: How methodological references drive substantive choices. In O. James , S. Jilke, & O. Van Ryzin (Eds.), Exeriments in ublic manpagement research. Cambridge, UK: Cambridge University Press . ), as more Public Administration scholars are considering the exerimental method to advance the field’s knowledge, it is imortant to discuss this methodological aroach in light of the recent findings in the field of Psychology. Thus, our main reference for this discussion will be the “Oen Science Reroducibility Project: Psychology” (Oen Science Collaboration, 2015), a collaborative roject that relicated 100 Cognitive and Social Psychology studies.

This article aims to answer to the following questions: What lessons Public Administration scholars can learn from the successes and failures of exeriments in Social Psychology? Which recautions should these researchers take when oting for the exerimental method to investigate roblems in Public Administration?

Before we begin our discussion, it is imortant to clarify the meanings of key terms used throughout this article. The term “relication” or “relicated” refers to exerimental studies that follow the same methodological rocedures as the original study regardless of the outcome. The term “reroduction” or “reroduced” means that study was relicated and yielded the same results as the original, i.e., that the relication was successful (see Oen Science Collaboration, 2015).

In the next section we examine the methodological characteristics of the exerimental method in the context of the behavioral sciences. In section 3, we discuss the results of the “Oen Science Reroducibility Project: Psychology” (Oen Science Collaboration, 2015) and outline the lessons they bring to Public Administration research. Finally, in section 4, we resent a synthesis of our recommendations and final remarks.

2. THE PRESUMED MERITS OF THE EXPERIMENTAL METHOD

The exerimental method is now dominant in Social Psychology research and it is gaining traction in Public Administration. It is a classic and well-develoed method that has amassed uncountable successes in several scientific fields. Exeriments enable the researcher to hold under control many contextual and interfering factors, enabling the investigator to assess the imact of one or more factors on the variable of interest (see Shadish, Cook & Camwell, 2002Shadish, W; Cook, T. D; & Camwell, D. T. (2002). Exerimental and quasi-exrimental design for generalized causal inference. Boston, MA: Houghton Mifflin Comany. ). The classic exerimental design — the “osttest-only” design — has at least one exerimental grou and one control grou. Of utmost imortance in this method is the control grou, which should be identical to the exerimental grou on all relevant asects in order to assure that changes in the outcome variable are only due to the exerimental maniulation of the factor. Researchers try to achieve this by selecting articiants sharing the same demograhic characteristics (age, gender, education, race) and by randomly assigning subjects to either the exerimental or control grou.

A second merit of the exerimental method is that it requires an oerational definition, that is, the researcher needs to use well-defined variables. The underlying research roblem might aear abstract — e.g., whether training increases ublic servants’ emathy towards service users. That is why the researcher must first secify what is to be regarded as “training” and “emathy” and how these factors are to be measured before testing her hyothesis. This oerational definition is what allows the relication of exeriments. It is exected that the relication will rovide the same outcome as the original exeriment when the variables are oerationally defined and the exerimental design is the same.

The characteristics of the exerimental method — controlling for confounding variables, oerational definition of variables, and enabling relication — can only work aroriately on the assurance that research findings are objective and unrelated to ersonal oinion, bias, or rejudice of the researcher.

Notwithstanding its advantages, exeriments are liable to some itfalls. Firstly, although exeriments resuose that results can be reroduced, relication is rarely ursued. We will discuss why this is the case further below.

Brandt et al. (2014Brandt, M. J; IJzerman, H; Dijksterhuis, A; Farach, F. J; Geller, J; Giner-Sorolla, R; ... Van’t Veer, A. (2014). The relication recie: What makes for a convincing relication? Journal of Exerimental Social Psychology, 50, 217-224., . 218) have resented the following stes for relicating exeriments in the field of Social Psychology:

  1. “. Carefully defining the effects and methods that the researcher intends to relicate;

  2. . Following as exactly as ossible the methods of the original study (including articiant recruitment, instructions, stimuli, measures, rocedures, and analyses);

  3. . Having high statistical ower;

  4. . Making comlete details about the relication available, so that interested exerts can fully evaluate the relication attemt (or attemt another relication themselves);

  5. . Evaluating relication results, and comaring them critically to the results of the original study.”

Even when such rocedures are followed, the outcomes of relications can be disaointing. The academic literature offers a few exlanations on why the reroduction of exeriments is often frustrated. One exlanation is known as the “Exerimenter Bias Effect” (EBE). Rosenthal and Fode (1963Rosenthal, R; & Fode, K. L. (1963). The effect of exerimenter bias on the erformance of the albino rat. Systems Research and Behavioral Science, 8(3), 183-189.) noted that exeriments in which the researcher was convinced of the correctness of the underlying hyothesis corroborated it more often as comared to researchers who were doubtful about the correctness of the hyothesis. However, failures to relicate the EBE bias have also been reorted (Barber et al., 1969Barber, T. X; Calverley, D. S; Forgione, A; McPeake, J. D; Chaves, J. F; … Bowen, B. (1969). Five attemts to relicate the exerimenter bias effect. Journal of Consulting and Clinical Psychology, 33(1), 1-6.; Jacob, 1968Jacob, T. (1968). The exerimenter bias effect: a failure to relicate. Psychonomic Science, 13(4), 239-240.).

Another and more imortant exlanation for failure to reroduce exerimental results is “ublication bias”. Because academic journals understandably seek to ublish innovative findings, studies that show the correctness of a given hyothesis find their way to academic journals more easily than null results (Ioannidis, Munafo, Fusar-Poli, Nosek & David, 2014Ioannidis, J. P; Munafo, M. R; Fusar-Poli, P; Nosek, B. A; & David, S. P. (2014). Publication and other reorting biases in cognitive sciences: detection, revalence, and revention. Trends in Cognitive Sciences, 18(5), 235-241. ; Nosek, Sies & Motyl, 2012). However, novel and unexected results are also more likely to be statistical flukes (see Backer, 2016). A more general roblem is that relications of exeriments — either successful or unsuccessful — are unlikely to aear on the ages of well-regarded research journals.

Relications may show that results do not hold in a context other than of the original exeriment, exosing the lack of external validity of the findings. Exeriments involving subjects with secific features (e.g., undergraduate students) may yield results that are not valid for other tyes of subjects (such as ublic servants). Moreover, in the behavioral sciences, exerimental findings may be valid within a secific cultural setting but not in other contexts.

Notwithstanding the benefits that the exerimental method can bring to Public Administration research, it is worth looking at some of its issues that the “Oen Science Reroducibility Project: Psychology” has exosed (Oen Science Collaboration, 2015).

3. LESSONS FROM THE “OPEN SCIENCE REPRODUCIBILITY PROJECT: PSYCHOLOGY”

In recent years, the field of Social Psychology was witnessed several reorts of research misconduct that resulted in a number of ublication retractions and the destruction of academic careers (see Van Kolfschooten, 2014Van Kolfschooten, F. (2014). Fresh misconduct charges hit Dutch social sychology. Science Magazine, 344(6184), 566-567.). Since the scandals broke-out, researchers began to wonder whether “a scientific culture that too heavily favors new and counterintuitive ideas over the confirmation of existing results” was to blame (Carenter, 2012Carenter, S. (2012). Psychology’s bold initiative. Science, 335(6076), 1558-1561., . 1558). This context aved the way for a large-scale relication roject known as the “Oen Science Reroducibility Project: Psychology” (Oen Science Collaboration, 2015). Launched in 2012, this roject involved more than 270 researchers from several institutions around the world and sought to relicate 100 Psychology studies ublished in 2008 in three of the field’s most resected journals.2 2 The reports for all replications included in the Reproducibility Project are available at the Open Science Framework website (Retrieved from https://osf.io/ezcuj/). In order to access the success or failure to relicate the original exeriments, the relication teams adoted several criteria, including “significance and P values, effect sizes, subjective assessments of relication teams, and meta-analyses of effect sizes” (Oen Science Collaboration, 2015; . aac4716-2)2.3 3 More details on the statistical methods used to evaluate the results of the replication effort can be found in the “statistical analysis” section (Oen Science Collaboration, 2015, . aac4716-2-aac4716-4)

The results of the Reroducibility Project sarkled a heated debate among Psychology scholars and beyond. While 97% of the original studies had significant results (P < 0.05), only 36.1% of the relications reached this standard. The investigation teams also found that in relications the mean for the effect sizes were about half of that found in the original studies (M = 0.197, SD = 0.257 vis-à-vis M = 0.403, SD = 0.188). Relication results showed that Cognitive Psychology exeriments were far more “reroducible” than Social Psychology studies. While 50% of the Cognitive Psychology exeriments were reroduced at the P < 0.05 criterion, only 25% of the Social Psychology exeriments filled this criterion. The results of the relications of the original studies ublished in the Journal of Personality and Social Psychology (JPSP) were even more disaointing: in the original exeriments the mean for effect sizes was 0.29 (SD = 0.10), while for the corresonding relications it was only 0.07 (SD = 0.11), i.e., more than four fold smaller than in the original studies.44 4 A more recent study that replicated 21 social science exerimental studies that were reviously published in Nature and Science found that 62% of the relications were in the same direction as the original studies and that the averpage effect sizes were 50% of the original studies (Camerer et al., 2018).

It is imortant to stress that the Reroducibility Project: Psychology was the first relication effort to be conducted at this scale. Thus, there is no revious arameter to which the outcomes of these relications could be comared. The jolt that followed the ublication of the roject’s results is likely to have more to do with the unrealistic exectations of researchers than to a sober-minded assessment of the matter. The authors of the Reroducibility Project were romt to elucidate this oint:

Because reroducibility is a hallmark of credible scientific evidence, it is temting to think that maximum reroducibility of original results is imortant from the onset of a line of inquiry through its maturation. This is a mistake. If initial ideas were always correct, then there would hardly be a reason to conduct research in the first lace. A healthy disciline will have many false starts as it confronts the limits of resent understanding (Oen Science Collaboration, 2015, . aac4716-7).

As the authors of the relication effort ointed out, neither a successful nor an unsuccessful relication could rovide definitive answers regarding the original exerimental results (Oen Science Collaboration, 2015). Reroduction does not imly that the theoretical interretation is correct, but only that the results aear to be reliable. On the other hand, failure to reroduce does not mean that the original finding is a false ositive. “Relications can fail if the relication methodology differs from the original in ways that interfere with the observing effect”; in addition, “unanticiated factors in the samle, setting, or rocedure could still have altered the observed effect magnitudes” (Oen Science Collaboration, 2015, . aac4716-6). At the end, the authors of the relication effort concluded that the roject could not establish whether any of the studies’ effects were true or false, adding that only the cumulative results from multile relications could validate the effects of the original studies.

This brings us to the first recommendation for Public Administration scholars emerging from the outcomes of the Reroducibility Project: considering that reroducibility is an essential comonent of exerimental research, it should be a comonent of the research design from the start. This means that scholars should engage in collaborative research rojects where different teams of researchers would conduct the same exeriment using the same methodological rocedures. The findings emerging from collaborative investigations would be more likely to find their way to resected academic journals. In addition, because researchers in collaborative rojects need to agree on what to investigate, these studies would be more likely to focus on relevant issues for the field of Public Administration.

As mentioned above, Social Psychology studies were less likely to be reroduced than Cognitive Psychology studies (Oen Science Collaboration, 2015). One ossible reason for this outcome is that the latter tended to use within-subjects research designs and reeated measurements more often than the former. It may be too early to conclude that within-subjects is the best aroach for exerimental studies in Public Administration, but the matter should not be ignored altogether. Therefore, a second recommendation is that Public Administration scholars should investigate the effect of different research designs —between-subjects versus within-subjects, osttest-only versus retest/osttest55 5 In the posttest only experimental designs the variable of interest is measured in the control group and in the experimental grou after the experimental manipulation takes lace. In the retest-osttest exerimental design the variable of interest is measured in the control grou and in the exerimental grou before and after the maniulation in the exerimental grou takes lace. This allows the exerimenter to assess the baseline measurement of the experimental grou, as well as to identify any influence of the exeriment on the data, increasing the reliability of the results. See American Psychological Association (APA, n.d.) — on research outcomes.

We will now examine the outcomes of a selection of studies included in the “Oen Science Reroducibility Project: Psychology”. We selected 35 relication reorts using the following criteria: (a) all the relications of studies originally ublished in the Journal of Personality and Social Psychology, due to the connections between Social Psychology and Public Administration; and (b) relications of studies investigating toics that have relevance for the field of Public Administration (values, otimism, communal resonsiveness, and conflict). The Excel file listing the 35 original studies of our samle, the relications’ outcomes, and excerts from the relication reorts is available at the Oen Science Foundation storage website: htts://osf.io/ta746/.

One asect worth noticing in this samle is that, in many cases, the relication exerimenters did not have a straightforward answer on whether the study was indeed reroduced or not. Although the relication teams alied the evaluation criteria established by the Reroducibility Project, the comments included in the reorts show that the icture is less “black-and-white” than one would initially exect. Sometimes the team was able to relicate the main effect, but not some of the additional effects (most studies included multile exeriments). In other relications the effect was in the same direction as seen in the original study but results did not ass the significance criterion (P < 0.05). The relication teams usually used samles with a larger number of subjects than the original studies, assuring enough ower to detect the alleged effect; in two cases, however, exerimenters admitted that their relications lacked sufficient ower and, for this reason, considered the relications results inconclusive.

This brings us to the debate about P values. This debate is not new, going back to Rozemboom’s criticism of the null-hyothesis significance test (NHST) (Rozemboom, 1960); the controversy seems far from being settled (see Harlow, Mulaik & Steiger, 2016Harlow, L. L; Mulaik, S. A; & Steiger, J. H. (2016). What if there were no significance tests? (Original work ublished 1997). London, UK: Routledge.). The concern over the P values escalated to the oint of moving the American Statistical Association to issue guidelines for its use; it was the first time the association, founded in 1839, has issued such guidelines (Wasserstein & Lazar, 2016Wasserstein, R., & Lazar, N. (2016). The ASA’s statement on P values: context, rocess, and urose. The American Statistician, 70(2), 129-133.). The editors of the Journal of Basic and Alied Social Psychology went as far as to banning the use of P values from the articles ublished in the journal (Trafimow, 2014Trafimow, D. (2014). Editorial. Basic and Alied Social Psychology, 36(1), 1-2. Retrieved from https://doi.org/10.1080/01973533.2014.865505
https://doi.org/10.1080/01973533.2014.86...
; Trafimow & Marks, 2015).

Many statisticians have argued that the P value is unable to tell anything about the veracity of a given hyothesis. According to Goodman (2008Goodman, S. (2008, July). A dirty dozen: twelve -value misconcetions. Seminars in Hematology, 45(3), 135-140., . 136), this is due to the very definition of the P value: “The robability of the observed result, lus more extreme results, if the null hyothesis were true”. The P value can only make a statement regarding whether the null hyothesis is to be rejected or not, but not about the actual veracity of the alternative hyothesis. According to Goodman, this is just one of the many misinterretations involving the P value. He argues that Fisher — the mathematician who introduced NHST — used the term “significance” to mean “worthy of attention in the form of meriting more exerimentation, but not roof in itself” (Goodman, 2008, . 135). Thus, a P < 0.05 does not warrant that H1 is true. Goodman sustains that the “marriage” between P value and hyothesis testing was an “unnatural union”. Benjamin et al. (2018Benjamin, D. J; Berger, J. O; Johannesson, M; Nosek, B. A; Wpagenmakers, E. J; Berk, R; ... Cesarini, D. (2018). Redefine statistical significance. Nature - Human Behaviour, 2(1), 6-10.) argue that the P < 0.05 threshold yields too many false ositives and that the standard for claims of new discoveries should be tightened to P < 0.005. Armhein and Greenland (2018Armhein, V., & Greenland, S. (2018). Remove, rather than redefine, statistical significance (corresondence). Nature Human Behaviour, 4(1), 4., . 4) relied saying that this would only aggravate current roblems and roosed instead that “[…] statistics reform should involve comletely discarding ‘significance’ and the oversimlified reasoning it encourages”. If one wants to demonstrate that a reliable effect exists, then one should show that there is a robust effect size. Therefore, our third recommendation for conducting exeriments in the field of Public Administration is to focus on the effect size and not on the P value (relevance versus significance). This recommendation may not be easy to follow in the current academic context due to the widesread use of NHST. Exeriments cost time and money and scholars hold the reasonable exectation that if the P < 0.05 criterion is satisfied, then there is sufficient reason to ublish the results. Nevertheless, this exectation is also what nudges researchers to engage in P-hacking (see Lindsay, 2020Lindsay, D. S. (2020). Seven stes toward transarency and relicability in sychological science. Canadian Psychology/Psychologie canadienne, 61(4), 310-317. Retrieved from https://doi.org/10.1037/ca0000222
https://doi.org/10.1037/ca0000222...
). Nevertheless, at the end, only a strong effect size — along with relications — can suort a given hyothesis, unless the main hyothesis is that there is no effect, which rises yet another roblem.

Howell (2012Howell, D. C. (2012). Statistical methods for sychology. Belmond, UK: Wadsworth/Cengpage Learning., . 230) noted that exerimenters “[…] have only a small chance of finding the effect they are looking for, even if such an effect does exist in the oulation”. This is the reason why it is imortant that exeriments have enough ower. Sufficiently owered exeriments are more likely to correctly reject H0, reducing the occurrence of tye II error. However, if it is not easy to detect an effect, it might be temting to engage in the “roof of the null hyothesis”, which is another issue that srings from NHST. It is worth to illustrate this oint with an examle from the field of Public Administration.

Moynihan (2013Moynihan, D. P. (2013). Does ublic service motivation lead to budget maximization? Evidence from an exeriment. International Public Manpagement Journal, 16(2), 179-196.) designed an exeriment to investigate whether higher levels of Public Service Motivation (PSM) (see Perry & Wise, 1990Perry, J. L; & Wise, L. R. 1990. The Motivational Bases of Public Service. Public Administration Review, 50(3), 367-73), were associated to budget maximization. According to the author, if this association could be demonstrated, Niskanen’s budget maximization theory would be vindicated (see Niskanen, 1968Niskanen, W. A. (1968). The eculiar economics of bureaucracy. The American Economic Review, 58(2), 293-305.). It should be noted, however, that Niskanen did not assume that PSM was behind budget maximization. He argued that bureaucrats maximize their budgets because this would bestow restige on them, which is another way of saying that bureaucrats are as self-interested as everyone else. Moynihan exlained the twist in his assumtion in the following terms: “Why is a ublic-sirit budget maximizer more lausible? Bureaucrats might seek to maximize budgets because they sincerely believe in the benefits of their rograms” (Moynihan, 2013, . 182). Although Moynihan’s assumtion is disutable, we will focus only on the terms of his exeriment. He recruited undergraduate students as subjects and alied the required exerimental maniulations. The results did not show a significant linear correlation between budget allocation and levels of PSM, even when outliers were removed from the regression model; thus, H0 could not be rejected at P < 0.05. Moynihan affirmed to have conducted a ower analysis and that the samle size was enough to detected an effect, although he did not inform the actual ower of the exeriment. Nevertheless, his conclusion went a long way: he argued that the exeriment’s results were “[…] a significant non-significant finding” (Moynihan, 2013, . 190) that dismantled “[… ] another illar for the budget maximization model” (Moynihan, 2013, . 190). This corresonds to a classical case of the “roof of the null hyothesis”.

Fisher, in his classical 1935 book, The Design of Exerimentation, argued that “[...] the null hyothesis is never roved or established, but it is ossibly disroved, in the course of exerimentation.” (Fisher as cited in Lehman, 2011, . 64). Since then, statisticians have been alerting about the “roof of the null hyothesis” misinterretation (see M. P. Lecoutre, Poitevineau & B. Lecoutre, 2003Lecoutre, M. P; Poitevineau, J; & Lecoutre, B. (2003). Even statisticians are not immune to misinterretations of Null Hyothesis Significance Tests. International Journal of Psychology, 38(1), 37-45.). Failing to detect an effect when there is one is the definition of tye II error, but this is not a roof of the null hyothesis. An underowered exeriment will increase the chances for tye II error, but this — as exected — cannot rove that the null hyothesis is true. If the exerimenter, however, is committed to conducting a owered exeriment, still it is not ossible to rove that the null hyothesis is true due to the very definition of ower, which is: “[…] the robability of correctly rejecting a false H0 when a articular alternative hyothesis is true” (Howell, 2012Howell, D. C. (2012). Statistical methods for sychology. Belmond, UK: Wadsworth/Cengpage Learning., . 230). Thus, a owered exeriment is one that may reject the null hyothesis, but it cannot rove that there is no effect. It is not even ossible to calculate a owered samle size because this requires estimating an effect size that is larger than zero (see Howell, 2012). The best alternative to the “roof of the null hyothesis” is to show that the effect size is too small to have any relevant imlication to the research roblem at hand. Equivalence and noninferiority testing may also be an otion (see Streiner, 2003Streiner, D. L. (2003). Unicorns do exist: A tutorial on “roving” the null hyothesis. The Canadian Journal of Psychiatry, 48(11), 756-761.). Thus, our fourth recommendation is to avoid the “rove the null hyothesis” as an exerimental design.

What are the characteristics that make exeriments in the behavioral sciences more likely to the reroduced? All the relications included in our urosive samle adoted a osttest design; thus it is not ossible to infer if this design had any influence in either a favorable or unfavorable outcome of the relication. However, as will be discussed below, there are reasons to recommend the use of retest-osttest designs whenever ossible.

Among the 35 relications we examined, 12 exeriments were successfully relicated (25%), 21 were not reroduced, and two were considered inconclusive due to the small samle of the relication. Three of the successful relications used survey designs, while only one among those not relicated adoted this design. This may suggest that survey designs are be more reliable than other exerimental designs, but survey exeriments can also be disaointing. Nosek et al. (2012Nosek, B. A; Sies, J. R; & Motyl, M. (2012). Scientific utoia II. Restructuring incentives and ractices to romote truth over ublishability. Persectives on Psychological Science, 7(6), 615-631. ) describe a survey exeriment that included 1,979 articiants. As the authors reort, the initial results were the dream of any researcher; the hyothesis was suorted and the results aeared to be robust and reliable. But, as a matter of caution, the authors decided to relicate the exeriment, collecting data from another 1,300 articiants. This time around the results were disheartening: “[t]he effect-size had vanished (P = 0.59)” (. 616). Large samles can detect weak effect sizes (Streiner, 2003Streiner, D. L. (2003). Unicorns do exist: A tutorial on “roving” the null hyothesis. The Canadian Journal of Psychiatry, 48(11), 756-761.) but because effect sizes can be very weak, large samles are not necessarily a guarantee for successful relications.

Overall, successfully relicated exeriments in our samle tended to resent simler research designs, with fewer exerimental conditions being tested. Nonetheless, there were cases of more comlex designs that were reroduced and of simler designs that were not.

Social Psychology exeriments were more likely to be reroduced when the original effect sizes were robust and the exerimental design used high-owered within-subjects maniulations and reeated measurements. Nevertheless, the fact that the mean for effect sizes in the relication exeriments was much smaller than the original studies —esecially regarding the Social Psychology studies— casts a shadow over large effect-sizes as well. This is why one exerimental study alone cannot yield a definitive answer; neither one relication alone can.

Overall, the Reroducibility Project showed that counter-intuitive results and studies that required several exeriments and comlicated maniulations were correlated with a smaller likelihood of a successful relication (Oen Science Collaboration, 2015). This leads to our fifth recommendation: whenever ossible, Public Administration researchers should design exeriments that are easy to relicate and avoid exerimental studies that require comlicated stes or too stringent assumtions. This recommendation can also lead to more straightforward results. Reality is comlex and investigators strive for realistic results, which may require comlex research designs. However, given that reroducibility is an imortant comonent of the exerimental method, researchers need to balance comlexity versus reroducibility.

The failure to reroduce the original results in the relications included in our samle was attributed to the following reasons: difficulty in interreting the original methods (Lewis & Pitt, 2015Lewis, M; & Pitts, M. (2015). Relication of “Errors are Aversive” by Greg Hajcak & Dan Foti (2008, Psychological Science). Retrieved from https://osf.io/tkq9n/
https://osf.io/tkq9n/...
); differences in samle size (Reinhard, 2014Reinhard, D. (2014). Relication of Förster, J; Liberman, N; & Kuschel, S . ( 2008. The effect of global versus local rocessing styles on assimilation versus contrast in social judgment, Journal of Personality and Social Psychology, 94, 579-599. Retrieved from https://osf.io/mxryb/
https://osf.io/mxryb/...
); the effect of unknown moderators (Kelson, 2015Kelson, K. (2015). Relication of “The sace between us: Stereotye threat and distance in interracial contexts” by P.A. Goff, C.M. Steele, and P.G. Davies (Journal of Personality and Social Psychology , ( 2008). Retrieved from https://osf.io/7q5us/
https://osf.io/7q5us/...
; Johnson, Hayes & Graham, 2015Johnson, K. M; Hayes T; & Graham, J. (2015). Relication of Study 2 by Amodio, Devine, & Harmon-Jones (2008, Journal of Personality and Social Psychology). Retrieved from https://osf.io/ysxmf/
https://osf.io/ysxmf/...
); effects of small differences in the exerimental rocedures (Baranski, 2015Baranski, E. (2015). Relication of “On the relative indeendence of thinking biases and cognitive ability” by KE Stanovich, RF West 2008, Journal of Personality and Social Psychology). Retrieved from https://osf.io/3gz2/
https://osf.io/3gz2/...
; Holubar, 2015Holubar, T. (2015). Relication of “The rejection of moral rebels,” Study 4, by Monin, Sawyer, & Marquez (2008, JPSP). Retrieved from https://osf.io/a4fmg/
https://osf.io/a4fmg/...
; Lane & Gazarian, 2015Lane, K; & Gazarian, D. (2015). Relication of “The effects of an Imlemental mind-set on attitude strength” by Henderson, de Liver, & Gollwitzer (2008 Journal of Personality and Social Psychology). Retrieved from https://osf.io/xqjf4/
https://osf.io/xqjf4/...
; Lin, 2013Lin, S. (2013). Relication of Study 7 by Exline, Baumeister, Zell, Kraft & Witvliet (2008, Journal of Personality and Social Psychology). Retrieved from https://osf.io/svz7w/
https://osf.io/svz7w/...
); and, differences in articiant demograhics/rofile/context (Brown et al., 2013Brown, B; Brown, K; Attridge, P; DeGaetano, M; Hicks, G; Humhries, D. ... Mainard, H. (2013). Relication of Study 5 by Centerbar, Schnall, Clore, & Gavin, 2008, JPSP). Retrieved from https://osf.io/wcgx5/
https://osf.io/wcgx5/...
; Embley, Johnson & Giner-Sorolla, 2015Embley, J; Johnson, L. G; & Giner-Sorolla, R. (2015). Relication of Study 1 by Vohs & Schooler (2008, Psychological Science). Retrieved from https://osf.io/2nf3u/
https://osf.io/2nf3u/...
; Lemn, 2013Lemn, K. (2013). Relication of Blankenshi and Wegener( 2008 , JPSP, Study 5A). Retrieved from https://osf.io/v3e2z/
https://osf.io/v3e2z/...
; Lin, 2013; Marigold, Forest & Anderson, 2015Marigold, D. C; Forest, A. L; & Anderson, J. E. (2015). Relication of “How the head liberates the heart: Projection of communal resonsiveness guides relationshi romotion” by EP Lemay Jr and MS Clark (2008, JPSP). Retrieved from https://osf.io/mv3i7/
https://osf.io/mv3i7/...
; Mechin & Gable, 2015Mechin, N; & Gable, P. (2015). Relication of “Left frontal cortical activation and sreading of alternatives: Test of the action- based model of dissonance” by E Harmon-Jones, C Harmon-Jones, M Fearn, JD Sigelman, P Johnson (2008, Journal of Personality and Social Psychology). Retrieved from https://osf.io/zwne/
https://osf.io/zwne/...
; Talhelm, Lee & Eggleston, 2015Talhelm, T; Lee. M; & Eggleston, C. (2015). Relication of Poignancy: Mixed Emotional Exerience in the Face of Meaningful Endings by Ersner-Hershfield, Mikels, Sullivan, & Carstensen (2008, Journal of Personality and Social Psychology). Retrieved from https://osf.io/fw6hv/
https://osf.io/fw6hv/ ...
). Therefore, the two main reasons offered to exlain the failures to reroduce the original results were different characteristics of articiants and differences in the conditions of the exeriments. The first issue is not necessarily a roblem for Public Administration research; it is even imortant to know that subjects with different characteristics react differently to exerimental maniulations. The second issue, however, is worrisome. Relication of Social Psychology exeriments looks like a frail cairn: one has to carefully balance all the comonents in order to avoid the whole structure to ti over. Because Public Administration has a ractical outlook, this can be a major roblem. Exeriments that cannot withstand the smallest variation in exerimental rocedures have little if any relevance for the field.

It should be noted that Social Psychology exeriments rarely adot retest-osttest research designs; osttest-only designs redominate even in exeriments that involve several maniulations. However, retest measurements rovide baseline information such as, for examle, the initial level of Public Service Motivation in both the control and exerimental grous (see Bellé, 2013Bellé, N. (2013). Exerimental evidence on the relationshi between ublic service motivation and job erformance. Public Administration Review, 73(1), 143-153.). The imortance of the retest-osttest design can be illustrated by the relication of Vohs and Schoolre’s (2008Vohs, K. D., & Schoolre, J. W. (2008). The value of believing in free will: encouraging a belief in determinism increases cheating. Psychological Science, 19(1), 49-54. ) original study. The original study, which argued that belief in freewill enhance moral behavior, received great attention within the Social Psychology community for its far-reaching consequences. The news that it could not be reroduced was a disaointment among researchers. Neither the original study nor its relication included a retest measurement. The two questionnaires relevant in one of the study’s two exeriments — the Freewill and Determinism scale (FWD) and the Positive and Negative Affectivity Schedule (PANAS) — were alied only after the articiants had read the texts corresonding to the control and treatment maniulations (treatment = text affirming freewill is an illusion; control = neutral text about consciousness). However, if the control and treatment grous differed in regard to the baseline condition (belief in freewill), then this should had been accounted for in the analysis of the results (see Bonate, 2000Bonate, P. L. (2000). Analysis of retest-osttest designs. Boca Raton, FL: Chaman & Hall/CRC Press.). This brings a dilemma: if articiants had resonded to the FWD questionnaire before the exerimental maniulation was alied, this would likely affect the outcome of the exeriment. This roblem takes lace when the baseline variable is inside the head of the subjects articiating in an exeriment. In clinical trials, baseline levels corresond to hysiological factors that can be recisely measured; in such cases, the retest does not influence the outcome of the osttest. However, in the behavioral sciences, the elaboration of exerimental designs often involves difficult choices. All things considered, Public Administration scholars should be aware that, deending on the investigation at hand, the retest-osttest exerimental design may yield more reliable results than the osttest-only design. This bring us to the sixth recommendation: whenever ossible, adot a retest-osttest design.

Given that Social Psychology exeriments are difficult to reroduce, it is reasonable to assume that exerimental research should not stand as the sole source of knowledge in the field of Public Administration. Fortunately, the field does not have a redominant research method — at least not yet — and has relied on many methodological rocedures to construct the field’s scholarshi. A thoroughly “exerimental turn” may be a bad idea for individual researchers and esecially for the field as a whole.

Let’s look at two studies investigating a recurrent theme in ublic management: the issue of whether management tolls yield the same results in the ublic and rivate sectors. Robertson and Seneviratne (1995Robertson, P. J; & S. J. Seneviratne (1995). Outcomes of lanned organizational change in the ublic sector: a meta-analytic comarison to the rivate sector. Public Administration Review, 55(6), 547-558.) found that, overall, management change interventions were as effective in the ublic sector as in the rivate sector, but this result deended on the outcome variables considered. Banerjeea and Solomon (2003Banerjeea, A; & Solomon, B. D. (2003). Eco-labeling for energy efficiency and sustainability: a meta-evaluation of US rograms. Energy Policy, 31, 109-123., . 119), in a study about the effectiveness of ecological labeling, argued that “[g] overnment run rograms have been far more successful than the rivate ones. Government suort to a labeling rogram not only increases its credibility and recognition, but also imroves financial stability, legal rotection and long-term viability”. The findings from these two studies would have been unlikely to emerge if the method of choice was an exeriment. To begin with, the concets of rivate and ublic sectors imly so many different asects that it would be almost imossible to select indeendent variables to test the effect on the deendent variable. Regime tyes; intergovernmental relations; central, regional or local government; organizational structure and culture; management and leadershi characteristics; motivation and behavior of administrators and staff — these are only a handful of variables that may affect how well ublic servants resond to incentives, job ressures, tasks, and resonsibilities. One may argue that the methods used in these studies are less recise than exeriments. However, as argued above, using more than one method and different sources of data can render more reliable results and have more ractical relevance. Thus, our seventh recommendation is that Public Administration researchers should avoid relying exclusively on the exerimental method.

Finally, it is necessary to mention that the Reroducibility Project: Psychology results have generated some controversy. Gilbert, King, Pettigrew and Wilson (2016Gilbert, D. T; King, G; Pettigrew, S; & Wilson, T. D. (2016). Comment on “Estimating the reroducibility of sychological science” (Technical Comments). Science, 351(6277): 1037b.) argued that the roject failed to account for random error, sustaining that, in addition to the 5% error exected due the 95% confidence interval, the relications used samles from different oulations and, in some cases, the relication rocedures did not strictly follow the original studies in other asects as well. On the other hand, Gelman (as cited in Baker, 2016Baker, M. (2016, March 03). Psychology’s reroducibility roblem is exaggerated — say sychologists. Nature News. Retrieved from http://www.nature.com/news/sychology-s-reroducibility-roblem-is-exaggerated-say-sychologists-1.19498#/b4
http://www.nature.com/news/sychology-s-r...
) argued that relications are more reliable guides than original studies because the latter are more likely to be the statistical flukes as comared to relications, articularly due to ublication bias. Relications are more thoughtful and lanned endeavors, argues Gelman, while original studies showing strong effect sizes tend to find their way to ublication too quickly.

4. CONCLUSIONS

This aer addressed the romises of what has been called the “exerimental turn” in Public Administration. The argument for the exerimental method is that, just as Public Administration and Psychology hold several theoretical affinities, research in the former would benefit from making more use of the exerimental method. Exeriments have the advantage of controlling and isolating variables as well as allowing within-subjects and between-subjects analyses, effectively controlling for confounding variables, and assuring a high degree of internal validity. Nevertheless, only several exeriments yielding the same outcome can rovide any confidence on the external validity.

As the “Oen Science Reroducibility Project: Psychology” revealed, while exerimental studies in Cognitive Psychology are not easily reroduced, Social Psychology studies are even less so (Oen Science Collaboration, 2015). As more exerimental studies in Public Administration are erformed and failure to reroduce their results emerge, the initial enthusiasm for the exerimental method among scholars may begin to whither away. This does not imly that scholars should discharge the exerimental method altogether but that they should take the necessary recautions when adoting it.

Our examination of the outcomes of the Reroducibility Project led to seven recommendations to Public Administration researchers, which are summarized in Box 1

BOX 1
Recommendations drawn from psychology studies to public administration scholars

The single most imortant recommendation to Public Administration scholars is that relications should become an inherent comonent of the exerimental design. The exerimental method should not be an isolated endeavor but a collaborative research roject in which relications of exeriments in different cultures, regimes, and organizational settings is art and arcel of the research design. The advantage is that collaborative rojects — regardless of the results of the exeriments — can avoid ublication bias, since leading academic journals are certainly interested in the results of research with this characteristic. As more exeriments in the field of Public Administration become available, meta-analytical studies also become more feasible. This is of aramount imortance, given that even underowered exeriments, when combined, can increase the ower of the analysis (Cooer, 2017Cooer, H. M. (2017). Research synthesis and meta-analysis: A ste-by-ste aroach (5a ed.). London, UK: Spage.).

Public Administration scholars should learn from the itfalls of exeriments in Psychology in regard to some secific technical issues as well, such as focusing on the effect size instead of P value. There is a reason why the NHST has become so controversial over the years, as we have discussed above. Moreover, while detecting an effect may not be easy, this should not be an incentive to engage in the “roof the null hyothesis”. As discussed above, it is not ossible to rove that the null hyothesis is true even when there is enough ower in an exeriment to detect an effect. This is because the statistical technique used to calculate owered samle sizes needs to assume an the effect size larger than zero (see Howell, 2012Howell, D. C. (2012). Statistical methods for sychology. Belmond, UK: Wadsworth/Cengpage Learning.). Equivalence and noninferiority testing may be an otion when a researcher wants to demonstrate, for examle, that one training rograms is not better than other (Streiner, 2003Streiner, D. L. (2003). Unicorns do exist: A tutorial on “roving” the null hyothesis. The Canadian Journal of Psychiatry, 48(11), 756-761.). Pretest-osttest exerimental designs should be used more often, although measuring variables that are “inside the heads” of subjects reresent a challenge for this research design.

Not all research toics are suitable to be investigated through exeriments. The exerimental method requires variables that are already firmly established and measurement instruments that have been roerly validated. Public Administration is a disciline that emerged from a secific context: governmental organizations and its interactions with the ublic. Thus, it is also recommended that exeriments testing hyothesis related to the field’s roblems use ublic servants as subjects whenever ossible. The outcomes of exeriments using undergraduate students enrolled in Western universities as subjects are unlikely to serve as a solid base for ractical interventions.

Public Administration scholars not only ask about the “what” and “why” of matters but also about the “how”. The knowledge generated through research in our field needs to be alied to ractical roblems. It may be reasonable to argue that relications fail because exeriments have not followed exactly the same rocedures as the original studies. However, in Public Administration, exeriments have to be robust enough to withstand some degree of contextual differences. Public Administration investigators deal with a large array of factors and the interrelatedness between them. Therefore, focusing on a hand full of variables may not be sufficient. Exeriments may be useful to address issues that are relevant for ractitioners and decision-makers, but other methods would still be necessary to investigate knotty social issues. As to exerimental designs, comlex factorial designs may be necessary, but this does not mean that a straightforward aroach can be overlooked.

Finally, our imlicit argument throughout this article is that Public Administration scholars need to engage in meta-science, i.e., use scientific tools to reflect about research itself. Understandably, meta-science has gained renewed relevance in the behavioral sciences after the reroducibility crisis in Psychology (see Passmore & Chae, 2019Passmore, D. L; & Chae, C. (2019). Potential for meta-scientific inquiry to imrove the usefulness of HRD research outcomes for ractice. Advances in Develoing Human Resources, 21(4), 409-420.).

All in all, the field of Public Administration can benefit from the exerimental method by being aware of its difficulties. If the field is moving towards an “exerimental turn”, may this turn be erformed with caution and a sense of direction.

REFERENCES

  • American Psychological Association. (s.d.). APA Dictionary of Psychology Retrieved from https://dictionary.aa.org/retest-osttest-design
    » https://dictionary.aa.org/retest-osttest-design
  • Armhein, V., & Greenland, S. (2018). Remove, rather than redefine, statistical significance (corresondence). Nature Human Behaviour, 4(1), 4.
  • Baker, M. (2016, March 03). Psychology’s reroducibility roblem is exaggerated — say sychologists. Nature News Retrieved from http://www.nature.com/news/sychology-s-reroducibility-roblem-is-exaggerated-say-sychologists-1.19498#/b4
    » http://www.nature.com/news/sychology-s-reroducibility-roblem-is-exaggerated-say-sychologists-1.19498#/b4
  • Baldassarri, D; & Abascal, M. (2017). Field exeriments across the social sciences. Annual Review of Sociology, 43, 41-73.
  • Banerjeea, A; & Solomon, B. D. (2003). Eco-labeling for energy efficiency and sustainability: a meta-evaluation of US rograms. Energy Policy, 31, 109-123.
  • Baranski, E. (2015). Relication of “On the relative indeendence of thinking biases and cognitive ability” by KE Stanovich, RF West 2008, Journal of Personality and Social Psychology). Retrieved from https://osf.io/3gz2/
    » https://osf.io/3gz2/
  • Barber, T. X; Calverley, D. S; Forgione, A; McPeake, J. D; Chaves, J. F; … Bowen, B. (1969). Five attemts to relicate the exerimenter bias effect. Journal of Consulting and Clinical Psychology, 33(1), 1-6.
  • Bellé, N. (2013). Exerimental evidence on the relationshi between ublic service motivation and job erformance. Public Administration Review, 73(1), 143-153.
  • Benjamin, D. J; Berger, J. O; Johannesson, M; Nosek, B. A; Wpagenmakers, E. J; Berk, R; ... Cesarini, D. (2018). Redefine statistical significance. Nature - Human Behaviour, 2(1), 6-10.
  • Bonate, P. L. (2000). Analysis of retest-osttest designs Boca Raton, FL: Chaman & Hall/CRC Press.
  • Bouwman, R; & Grimmelikhuijsen, S. (2016). Exerimental ublic administration from 1992 to 2014: a systematic literature review and ways forward. International Journal of Public Sector Manpagement, 29(2), 110-131.
  • Brandt, M. J; IJzerman, H; Dijksterhuis, A; Farach, F. J; Geller, J; Giner-Sorolla, R; ... Van’t Veer, A. (2014). The relication recie: What makes for a convincing relication? Journal of Exerimental Social Psychology, 50, 217-224.
  • Brown, B; Brown, K; Attridge, P; DeGaetano, M; Hicks, G; Humhries, D. ... Mainard, H. (2013). Relication of Study 5 by Centerbar, Schnall, Clore, & Gavin, 2008, JPSP) Retrieved from https://osf.io/wcgx5/
    » https://osf.io/wcgx5/
  • Camerer, C. F; Dreber, A; Holzmeister, F; Ho, T. H; Huber, J; Johannesson, M; ... Altmejd, A. (2018). Evaluating the relicability of social science exeriments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637-644.
  • Carenter, S. (2012). Psychology’s bold initiative. Science, 335(6076), 1558-1561.
  • Cooer, H. M. (2017). Research synthesis and meta-analysis: A ste-by-ste aroach (5a ed.). London, UK: Spage.
  • Embley, J; Johnson, L. G; & Giner-Sorolla, R. (2015). Relication of Study 1 by Vohs & Schooler (2008, Psychological Science). Retrieved from https://osf.io/2nf3u/
    » https://osf.io/2nf3u/
  • Gilbert, D. T; King, G; Pettigrew, S; & Wilson, T. D. (2016). Comment on “Estimating the reroducibility of sychological science” (Technical Comments). Science, 351(6277): 1037b.
  • Goodman, S. (2008, July). A dirty dozen: twelve -value misconcetions. Seminars in Hematology, 45(3), 135-140.
  • Grimmelikhuijsen, S; Jilke, S; Olsen, A. L; & Tummers, L. (2017). Behavioral ublic administration: Combining insights from ublic administration and sychology. Public Administration Review, 77(1), 45-56.
  • Harlow, L. L; Mulaik, S. A; & Steiger, J. H. (2016). What if there were no significance tests? (Original work ublished 1997). London, UK: Routledge.
  • Holubar, T. (2015). Relication of “The rejection of moral rebels,” Study 4, by Monin, Sawyer, & Marquez (2008, JPSP) Retrieved from https://osf.io/a4fmg/
    » https://osf.io/a4fmg/
  • Howell, D. C. (2012). Statistical methods for sychology Belmond, UK: Wadsworth/Cengpage Learning.
  • Ioannidis, J. P; Munafo, M. R; Fusar-Poli, P; Nosek, B. A; & David, S. P. (2014). Publication and other reorting biases in cognitive sciences: detection, revalence, and revention. Trends in Cognitive Sciences, 18(5), 235-241.
  • Jacob, T. (1968). The exerimenter bias effect: a failure to relicate. Psychonomic Science, 13(4), 239-240.
  • James, O; Jilke, S. R; & Van Ryzin, G. G. (2017). Behavioural and exerimental ublic administration: Emerging contributions and new directions. Public Administration, 95(4), 865-873.
  • Jilke, S; Van de Walle, S; & Kim, S. (2016). Generating Usable Knowledge through an Exerimental Aroach to Public Administration. Public Administration Review, 76(1), 69-72.
  • Johnson, K. M; Hayes T; & Graham, J. (2015). Relication of Study 2 by Amodio, Devine, & Harmon-Jones (2008, Journal of Personality and Social Psychology). Retrieved from https://osf.io/ysxmf/
    » https://osf.io/ysxmf/
  • Jost, J. T; & Kruglanski, A. W. (2002). The estrangement of social constructionism and exerimental social sychology: History of the rift and rosects for reconciliation. Personality and Social Psychology Review, 6(3), 168-187.
  • Kelson, K. (2015). Relication of “The sace between us: Stereotye threat and distance in interracial contexts” by P.A. Goff, C.M. Steele, and P.G. Davies (Journal of Personality and Social Psychology , ( 2008). Retrieved from https://osf.io/7q5us/
    » https://osf.io/7q5us/
  • Lane, K; & Gazarian, D. (2015). Relication of “The effects of an Imlemental mind-set on attitude strength” by Henderson, de Liver, & Gollwitzer (2008 Journal of Personality and Social Psychology). Retrieved from https://osf.io/xqjf4/
    » https://osf.io/xqjf4/
  • Lecoutre, M. P; Poitevineau, J; & Lecoutre, B. (2003). Even statisticians are not immune to misinterretations of Null Hyothesis Significance Tests. International Journal of Psychology, 38(1), 37-45.
  • Lehmann, E. L. (2011). Fisher, Neyman, and the creation of slassical Statistics New York, NY: Sringer.
  • Lemn, K. (2013). Relication of Blankenshi and Wegener( 2008 , JPSP, Study 5A). Retrieved from https://osf.io/v3e2z/
    » https://osf.io/v3e2z/
  • Lewis, M; & Pitts, M. (2015). Relication of “Errors are Aversive” by Greg Hajcak & Dan Foti (2008, Psychological Science). Retrieved from https://osf.io/tkq9n/
    » https://osf.io/tkq9n/
  • Lin, S. (2013). Relication of Study 7 by Exline, Baumeister, Zell, Kraft & Witvliet (2008, Journal of Personality and Social Psychology). Retrieved from https://osf.io/svz7w/
    » https://osf.io/svz7w/
  • Lindsay, D. S. (2020). Seven stes toward transarency and relicability in sychological science. Canadian Psychology/Psychologie canadienne, 61(4), 310-317. Retrieved from https://doi.org/10.1037/ca0000222
    » https://doi.org/10.1037/ca0000222
  • Marigold, D. C; Forest, A. L; & Anderson, J. E. (2015). Relication of “How the head liberates the heart: Projection of communal resonsiveness guides relationshi romotion” by EP Lemay Jr and MS Clark (2008, JPSP). Retrieved from https://osf.io/mv3i7/
    » https://osf.io/mv3i7/
  • Mechin, N; & Gable, P. (2015). Relication of “Left frontal cortical activation and sreading of alternatives: Test of the action- based model of dissonance” by E Harmon-Jones, C Harmon-Jones, M Fearn, JD Sigelman, P Johnson (2008, Journal of Personality and Social Psychology). Retrieved from https://osf.io/zwne/
    » https://osf.io/zwne/
  • Moynihan, D. P. (2013). Does ublic service motivation lead to budget maximization? Evidence from an exeriment. International Public Manpagement Journal, 16(2), 179-196.
  • Niskanen, W. A. (1968). The eculiar economics of bureaucracy. The American Economic Review, 58(2), 293-305.
  • Nosek, B. A; Sies, J. R; & Motyl, M. (2012). Scientific utoia II. Restructuring incentives and ractices to romote truth over ublishability. Persectives on Psychological Science, 7(6), 615-631.
  • Oen Science Collaboration. (2015). Estimating the reroducibility of sychological science. Science, 349(6251), aac4716-1-aac4716-8.
  • Passmore, D. L; & Chae, C. (2019). Potential for meta-scientific inquiry to imrove the usefulness of HRD research outcomes for ractice. Advances in Develoing Human Resources, 21(4), 409-420.
  • Perry, J. L; & Wise, L. R. 1990. The Motivational Bases of Public Service. Public Administration Review, 50(3), 367-73
  • Reinhard, D. (2014). Relication of Förster, J; Liberman, N; & Kuschel, S . ( 2008 The effect of global versus local rocessing styles on assimilation versus contrast in social judgment, Journal of Personality and Social Psychology, 94, 579-599 Retrieved from https://osf.io/mxryb/
    » https://osf.io/mxryb/
  • Robertson, P. J; & S. J. Seneviratne (1995). Outcomes of lanned organizational change in the ublic sector: a meta-analytic comarison to the rivate sector. Public Administration Review, 55(6), 547-558.
  • Rosenthal, R; & Fode, K. L. (1963). The effect of exerimenter bias on the erformance of the albino rat. Systems Research and Behavioral Science, 8(3), 183-189.
  • Rozeboom, W. W. (1960). The fallacy of the null-hyothesis significance test. Psychological Bulletin, 57(5), 416-428.
  • Shadish, W; Cook, T. D; & Camwell, D. T. (2002). Exerimental and quasi-exrimental design for generalized causal inference Boston, MA: Houghton Mifflin Comany.
  • Sobis, I., & De Vries, M. S. (2014). The social sychology ersective on values and virtue. In M. S. De Vries, & P. S. Kim (Eds.), Value and virtue in Public Administration: a comarative ersective (IIAS Series: Governance and Public Manpagement). London, UK: McMillan-Palgrave.
  • Streiner, D. L. (2003). Unicorns do exist: A tutorial on “roving” the null hyothesis. The Canadian Journal of Psychiatry, 48(11), 756-761.
  • Talhelm, T; Lee. M; & Eggleston, C. (2015). Relication of Poignancy: Mixed Emotional Exerience in the Face of Meaningful Endings by Ersner-Hershfield, Mikels, Sullivan, & Carstensen (2008, Journal of Personality and Social Psychology). Retrieved from https://osf.io/fw6hv/
    » https://osf.io/fw6hv/
  • Tee, M., & Proko, C. (2017). Laboratory exeriments: their otential for ublic manpagement Research. In O. James, S. R. Jilke, & G. G. Van Ryzin (Eds.), Exeriments in ublic manpagement research: challenges and contributions (1st ed; . 139-164) Cambridge, UK: Cambridge University Press.
  • Trafimow, D. (2014). Editorial. Basic and Alied Social Psychology, 36(1), 1-2. Retrieved from https://doi.org/10.1080/01973533.2014.865505
    » https://doi.org/10.1080/01973533.2014.865505
  • Trafimow, D., & Marks, M. (2015). Editorial. Basic and Alied Social Psychology, 37(1), 1-2. Retrieved from http://dx.doi.org/10.1080/01973533.2015.1012991
    » https://doi.org/10.1080/01973533.2015.1012991
  • Van de Walle, S. (2016). The exerimental turn in ublic manpagement: How methodological references drive substantive choices. In O. James , S. Jilke, & O. Van Ryzin (Eds.), Exeriments in ublic manpagement research Cambridge, UK: Cambridge University Press .
  • Van Kolfschooten, F. (2014). Fresh misconduct charges hit Dutch social sychology. Science Magazine, 344(6184), 566-567.
  • Vohs, K. D., & Schoolre, J. W. (2008). The value of believing in free will: encouraging a belief in determinism increases cheating. Psychological Science, 19(1), 49-54.
  • Wasserstein, R., & Lazar, N. (2016). The ASA’s statement on P values: context, rocess, and urose. The American Statistician, 70(2), 129-133.

  • 1
    In this article we consider only the experimental methods that involve a control group and the maniulation of one or more variables, including laboratory and survey exeriments. Although field exeriments also involve the maniulation of one or more variables, these are not discussed here due to the secific features of the method and its relative limited use vis-à-vis other exerimental methods. For a discussion on field exeriments, see Baldassarri and Abascal (2017).
  • 2
    The reports for all replications included in the Reproducibility Project are available at the Open Science Framework website (Retrieved from https://osf.io/ezcuj/).
  • 3
    More details on the statistical methods used to evaluate the results of the replication effort can be found in the “statistical analysis” section (Oen Science Collaboration, 2015, . aac4716-2-aac4716-4)
  • 4
    A more recent study that replicated 21 social science exerimental studies that were reviously published in Nature and Science found that 62% of the relications were in the same direction as the original studies and that the averpage effect sizes were 50% of the original studies (Camerer et al., 2018).
  • 5
    In the posttest only experimental designs the variable of interest is measured in the control group and in the experimental grou after the experimental manipulation takes lace. In the retest-osttest exerimental design the variable of interest is measured in the control grou and in the exerimental grou before and after the maniulation in the exerimental grou takes lace. This allows the exerimenter to assess the baseline measurement of the experimental grou, as well as to identify any influence of the exeriment on the data, increasing the reliability of the results. See American Psychological Association (APA, n.d.)

  • [Original version]

Publication Dates

  • Publication in this collection
    21 Jan 2022
  • Date of issue
    Sep-Oct 2021

History

  • Received
    28 Sept 2020
  • Accepted
    08 Mar 2021
Fundação Getulio Vargas Fundaçãoo Getulio Vargas, Rua Jornalista Orlando Dantas, 30, CEP: 22231-010 / Rio de Janeiro-RJ Brasil, Tel.: +55 (21) 3083-2731 - Rio de Janeiro - RJ - Brazil
E-mail: rap@fgv.br