Determining sexual dimorphism in frog measurement data : integration of statistical significance , measurement error , effect size and biological significance

Several analytic techniques have been used to determine sexual dimorphism in vertebrate morphological measurement data with no emergent consensus on which technique is superior. A further confounding problem for frog data is the existence of considerable measurement error. To determine dimorphism, we examine a single hypothesis (Ho = equal means) for two groups (females and males). We demonstrate that frog measurement data meet assumptions for clearly defined statistical hypothesis testing with statistical linear models rather than those of exploratory multivariate techniques such as principal components, correlation or correspondence analysis. In order to distinguish biological from statistical significance of hypotheses, we propose a new protocol that incorporates measurement error and effect size. Measurement error is evaluated with a novel measurement error index. Effect size, widely used in the behavioral sciences and in meta-analysis studies in biology, proves to be the most useful single metric to evaluate whether statistically significant results are biologically meaningful. Definitions for a range of small, medium, and large effect sizes specifically for frog measurement data are provided. Examples with measurement data for species of the frog genus Leptodactylus are presented. The new protocol is recommended not only to evaluate sexual dimorphism for frog data but for any animal measurement data for which the measurement error index and observed or a priori effect sizes can be calculated.


INTRODUCTION
Study of animal sexual dimorphism can lead to important biological insights.For example, in a seminal frog paper, Shine (1979) convincingly demonstrated that for species in which male combat occurs, the males are often larger than females.Aside from There are two outstanding problems when evaluating sexual dimorphism in measurement variables in frogs: (1) large measurement error, and (2) statistical versus biological significance.
Measurement error in frogs is large and impacts both statistical and biological results (Hayek et al. 2001).As part of a recent study, WRH detected an apparent conflict between statistical and biological significance for several morphological variables in a group of large species of the frog genus Leptodactylus (Heyer 2005).WRH brought the problem to LCH, who proposed a study on appropriate statistical methodology for evaluating sexual dimorphism for measurement data in frogs.LCH suggested that WRH select a limited number of data sets that would allow for evaluation of problems associated with sample sizes and geographic variation and that would likely exhibit a range of variation in sexual dimorphism.LCH would then use these data to examine appropriate statistical procedures for evaluating sexual dimorphism in the variables measured.
Through review of the literature and analyses of our data we find a new approach to the problem is superior to other methods in use.Our protocol consists of the following sequential steps: 1) Analyze the overall size measurement data (in our case snout-vent lengths [SVL]) with ANOVA and the other measurement variables with ANCOVA (using SVL as the independent variable) to determine whether the results are statistically significant.If the results are statistically significant, proceed to the next step.
2) Evaluate the statistically significant results from Step 1 with the measurement error index, developed herein, to screen out statistically significant results that are compromised by measurement error.For results that are not compromised by measurement error, proceed to the final step.
3) Calculate and use effect size (ES) coefficients to evaluate the biological significance of the statistically supported results.Effect size values are standardized scores that can be compared across studies irrespective of sample sizes.We find that small effect size values are not biologically meaningful in our data, but that medium and large effect size values do have biological meaning.
We lay out arguments for the appropriateness of this 3-step protocol for frog measurement data; discuss this protocol in terms of other approaches used in the literature to study sexual dimorphism in measurement data; define small, medium, and large effect sizes for frog measurement data; and show examples of the application of the new protocol with frog data.
We propose that the procedure described in this paper should be adopted in future studies when evaluating sexual dimorphism of measurement data in animals in general.

Materials
Almost all of the data used in this study come from years of study of the variation in members of the frog genus Leptodactylus by WRH.The variables are: Snout-vent length (SVL), a measure of overall size; head length; head width; head area; eye-nostril distance; tympanum diameter; thigh length; shank length; and foot length.Not all of these variables were examined in earlier studies, so, there are no data or smaller sample sizes for eye-nostril distance and tympanum diameter in some cases.Methods for taking the measurements are those found in Heyer (2005).Head area is calculated as one-half an ellipsoidal conic section fit to the triangular area determined from measured head length and head width of each frog in the study.
The data were selected to answer a variety of questions.One problem of concern was whether characterizations of sexual dimorphism based on specimens throughout the geographic range differed from characterizations based on single locality samples.Specifically, should sexual dimorphism al-ways be studied at a local level?Two data sets address this problem: (1) a substantial sample available for Leptodactylus fuscus throughout its geographic range (Panama to Argentina) and a single large sample of L. fuscus SVL data from PortoVelho, Brasil; and (2) a substantial sample of the widely distributed Leptodactylus podicipinus (southern Amazonia, central and eastern Brasil to northern Argentina) and four reasonably-sized samples from single localities (Alejandra, Bolivia; Curuçá, Brasil; Porto Velho, Brasil; Rurrenabaque, Bolivia).
A second problem involved sexual dimorphism of similar species within a single genus.Two sets of data analyzed in previous studies demonstrated different statistically significant results for measurements made between morphologically similar appearing species (Heyer 1978): (1) the species pair Leptodactylus bufonius and L. troglodytes; and (2) the species pair Leptodactylus furnarius and L. gracilis.
In addition, questions regarding biological versus statistical significance were raised for the species Leptodactylus knudseni, L. pentadactylus, and an undescribed species, referred to herein as Middle American pentadactylus (Heyer 2005).
Finally, the importance of determining effect sizes (ES) to define sexual size dimorphism in frogs became clear during the course of our study.A previously assembled data set for Eleutherodactylus fenestratus (Heyer and Muñoz 1999) was included because the effect size values of this species would be at the large end of a possible range of values.The difference in male and female size in E. fenestratus is obvious by visual inspection.
To evaluate measurement error, individual specimens were measured 20 times.Maximum and minimum values were obtained from these measurements.Three individuals of about the same SVL were selected for measurement at more-or-less regular intervals spanning the adult size ranges of species of Adenomera and Leptodactylus, with the exception that only one specimen was available for the largest size category.Previous data were available for one individual of Vanzolinius discodactylus (Hayek et al. 2001) and a male and female of an undescribed species from Pará, Brasil (Heyer 2005).In addition to the previously available date, the following specimens were re-measured: Adenomera marmorata -USNM 209101, 209110, 209112, Leptodactylus knudseni -USNM 216785, 531513, L. labyrinthicus -USNM 121284, 303175, 370593, 507904, L. leptodactyloides -USNM 202519, 202522, 321214, L. myersi -USNM 302191, L. podicipinus -USNM 148685, 148686, L. rhodomystax -USNM 343256, 343257, 531559, L. vastus -USNM 109144, 109148.These specimens not only span the size range for leptodactylid frogs in general, but also include examples of well and poorly preserved individuals (Fig. 1).Twenty data forms were produced on which to record the measurement data for these 20 frogs.Only one data sheet was filled out on any given day.The 20 specimens were placed in three containers, one containing the smallest individuals, one the medium-sized, and one the largest.The order of container examination was indicated on the top of each data form so that each container was examined almost equally either first, second, or third in the study.Individuals were haphazardly selected from each container each session.The date and time were also recorded on each form as they were filled out.All measurements were taken by WRH to avoid inter-observer error.

Evaluation of Statistical Methods
Statistical significance for a hypothesis of sexual dimorphism of frog body parts using measurement data indicates whether the study results are due to chance or to sampling variability.Total reliance upon statistical tests for amphibian hypotheses leads to the anomalous results that prompted the present study.Previous studies of size sexual dimorphism in frogs seems to have been reduced to the selection of a fixed level of significance and a desire for a dichotomous reject/ do not reject decision regardless of sample size to test merely whether there is or is not a difference of 0. The alternative hypothesis is, by default, that any unspecified statistically significant sexual size difference at all that is not 0, is equated with biological importance.There is little if any emphasis upon the actual or expected size of that difference in nature or whether such a difference has biological meaning.Therefore, contradictory results occur when the p-value is the focal point.Two tests on the same species can lead to results that on the one hand infer sexual dimorphism and on the other hand deny any differences exist.For example, when a total of 35 L. furnarius were examined (Heyer 1978), male head length was larger than female.With a sample size of 74 specimens in the present study, the opposite was found.We conclude that emphasis on p-value statistical test results alone is not what the researcher should be seeking.Understanding size sexual dimorphism in frogs requires answers to questions of existence, magnitude, and strength of any association or inter-relationship.
Statistical significance, or p value, actually provides little insight about frog size dimorphism.The result of the statistical test depends upon sample size, test level, and power at which the test was performed, as well as the difference between the quantities being tested.To reject a null hypothesis of no sexual dimorphism is to reject that the size difference between the sexes is really 0. Since all nature varies, it follows that before any statistical test is even per-formed, such a strict null hypothesis has to be false (given a large enough sample size).If we reject a 0 difference between the sexes, what is the alternative?Failing to demonstrate an effect is quite distinct from either implicitly or explicitly concluding that no difference exists at all.Is dimorphism then any malefemale difference on average?Clearly, hypothesis test results provide no indication of the magnitude of the difference between the sexes, or the actual effect of dimorphism's being observed.For example, based upon hypothesis test results for SVL of p < 0.05, both L. pentadactylus and L. troglodytes exhibit sexual dimorphism.However, for the former species the mean difference between the sexes is 13.7 mm (maximums: 195 mm males; 174 mm females); whereas for the latter, it is 1.3 mm (maximums: 52.8 mm males; 52.7 mm females).Not only are these values highly discrepant, but with a twosided test ''significance'' indicates ''not equal''.The test result cannot provide inference on significance of males or females being larger.We require that the maximum values exhibit a reasonably large difference in the same direction indicated by the statistical test results.Classical statistical significance tests are not independent of sample: the larger the sample size the more likely is rejection (and in prac- tice power is higher).Thus, there is always a sample size that will allow for the rejection of any non-zero difference; with enough specimens the sexes will be called dimorphic.Because this dependence is so often ignored in amphibian research we propose supplementing significance test results with two factors: a measurement error index; and a standardized, biologically meaningful effect size defined herein specifically for frogs.These two quantities are used in tandem to determine existence of consequential size dimorphism.

Statistical Methods
Descriptive and inferential statistical analyses were computed for each measurement variable for all adult individuals of each species and by sex.Locality analyses were performed on the L. fuscus and L. podicipinus data.All assumptions, hypothesis tests, and analyses under a general linear model were performed using SPSS (SPSS for Windows, version 11.0, 2001, SPSS Inc., Chicago).Power and effect size calculations were programmed into Mathcad Professional 2000 (Mathsoft Inc., Cambridge, Massachusetts).We used a modification (see Appendix I) of Cohen's d (1977) as our effect size measure, d = m f -m m /σ , where d = effect size index, m f , m m = population means expressed in original measurement unit, and σ = the standard deviation of either population (assuming they are equal).Cohen's d was selected because means are the focus of any study of sexual dimorphism.For our study we defined d as the standardized mean difference of female versus male measurements and σ as the pooled standard deviation of the two groups (Appendix I).Under an ANOVA model the numerator is the difference between female and male means.Under an ANCOVA model the numerator uses the difference between covariance-adjusted means.This measure, d, can also be computed from regression calculations that give correlation coefficients.That is, d is defined as twice a correlation coefficient divided by the square root of one minus the coefficient squared.The two calculation methods yield equivalent values for d.Computations of effect size as either average percentile or percentage non-overlap were programmed in Mathcad following Cohen (1977).Reliability calculations for indices and regressions were modeled with SYSTAT (Wilkinson and Coward 2000).

Determination of a New Measurement Error Index
Regression analyses were performed for each measurement variable with mean SVL as the independent variable and the range of each variable as dependent.The data used were the individuals measured 20 times each.The mean SVL is the mean of the 20 measurements of each individual.The range of each variable was the maximum measurement minus the minimum measurement of the 20 measurements of each individual.
For some variables, linear regression was the most appropriate analysis (e.g. head width, Fig. 2), for others, quadratic regression was more appropriate (e.g.SVL, Fig. 3).Table I gives the regression formula suitable for each variable.of specimens is required the larger they are to position them for measurement of each variable.In the case of SVL, such error is particularly true for large, poorly preserved specimens where the specimen must be flattened out to take the measurement.In some cases, measurement error was greater for the smallest size specimens relative to moderate sized specimens (e.g., those variables for which the quadratic regression is most appropriate such as SVL, Fig. 3).For example, to measure head length, the proximal point of the needle nose caliper is ''hooked'' behind the jawbones.For the small specimens, the size of the needle nosed point is large relative to distinguishing the posterior angle of the jawbones from the overlying skin and associated tissues.Some variables were measured more accurately than others.For example, shank length was measured more accurately than either thigh length or foot length (Fig. 4).
The regression formulae determined for mean measurement errors (Table I) are appropriate to evaluate measurement error in this study, since WRH took all of the measurement data.However, we propose that these regression formulae are appropriate to evaluate measurement error in any study involving frogs with similar overall body shapes, such as members of the families Leptodactylidae, Myobatrachidae, and Ranidae.Tree-frogs (Hylidae, Rhacophoridae) and hopping toads (Bufonidae) should be at least spot-checked for some variables to determine whether measurement error results are comparable to those established herein.We used the above results to define a measurement error index as the mean sexual difference divided by the measurement error regression quantity for the variable of interest (solving for y in the equations of Table I, also see Measurement Error Screening of Statistically Significant Results, below).

Statistical Tables
Tables II-VI provide comparisons of mean difference values and conventional hypothesis tests of means for males and females across species, by sex and locality.The statistical significance of results is designated by p-value, where p < 0.05 indicates the null hypothesis rejection, or the inferred existence of possible sexual dimorphism.
The information in these tables provides the data for the interpretations and further analyses relating to determination of sexual dimorphism in this paper.
Should Raw, Transformed, or Covariate-adjusted Data be Used?
In previous studies of size dimorphism for variables other than SVL, differing types of data have been used: (1) raw versus transformed data; and, (2) raw measures versus ratios of the measures.
In general, morphological measurements on frogs are ratio-scaled (i.e.there is an absolute zero value) and continuous so that results of tests for normality and variance homogeneity in the population show that the raw, untransformed measurement data can be used for general linear modeling.Although tests that reject these assumptions can be found in the literature, they are sample-based.It is actually not appropriate to base tests only on small field sam-ples, especially samples that are unrepresentative of the population.The assumptions concern characteristics of the populations from which the samples are taken.
It is quite usual in studies of sexual dimorphism that the raw measurements are divided by a measure of overall body size before beginning hypothesis testing.For amphibian research the usual denominator of such a ratio is SVL.In turn, such ratios are either used as the variable of interest or are transformed.Across research areas the most commonly applied transformation is the logarithm (Sokal and Rohlf 1969 p. 382).The arcsine transformation has also been used for ratio transformation (e.g.Heyer 1994).
In the present study, with its emphasis on reliability of body part measurements and determination of actual magnitude of sexual differences, we also compared results of covariate -adjusted data, with SVL as the covariate.In statistical application, ANOVA treats sex as a grouping factor, whereas regression models treat sex as the variable being predicted.In this regard, ANCOVA represents a link between the two models.The ANCOVA technique allows the researcher to adjust for body size after the field sampling has been completed and the measurements made.Use of a ratio is for the purported aim of adjusting these same data and therefore ANCOVA can be seen as an alternative.The statistical assumption of ANCOVA that the regression data be linear is not violated by frog data, because the adults we measure do not exhibit allometry in size as static individuals.In fact, there is no indication of allometry for juveniles and adults in L. knudseni, a species for which allometry in head width was anticipated by WRH (Fig. 5).Rather than simple division to form a ratio (a quantity with known properties that disallow the use of parametric linear models in general) ANCOVA provides statistical control whereby the influence of the covariate is removed from the comparison on the measurement of interest.
Table VI, which provides results for an example species L. podicipinus, illustrates that regardless of transformation, or of body measurement considered, when we compare the ratio results we find that the effect sizes and power of the tests are virtually identical.From a standpoint of detectable male-female difference these results are equivalent as well.When results on the raw data are compared with ratio results it is clear that in general the division by body size changes and often greatly reduces the observed effect size from that seen with the raw measure.We therefore present our results for both ANOVA and ANCOVA on the raw measures only.

Measurement Error Screening of Statistically Significant Results
For SVL, there are three additional aspects of the data that address the stability and reliability of statistically significant results for a test of sexual dimorphism: (1) measurement error, (2) maximum specimen sizes, and (3) corresponding size differences in the other variables.To evaluate the influence of measurement error on dimorphism test results, we use a simple index of measurement error calculated as the mean difference determined between males and females for the variable involved, divided by the mean measurement error as determined by the regression formulae in Table I.A value of 1 indicates that the degree of the measurement error is of the same magnitude as the observed mean differences.Values in the range of 0.7 or less indicate that the measurement error is much larger than the observed measurement differences between males  tical Methods).When SVLs differ between sexes, one would expect that overall size difference would be evidenced in ANOVA results with most or all of the other variables.When these three criteria are used to supplement and assess the robustness of the statistically significant results for SVL differences between males and females, the following species are considered to not demonstrate meaningful differences in SVL with the available data: L. troglodytes (measurement error index is equivocal; the maximum sizes of males and females are virtually identical; and 6 of 8 other variables do not differ in the ANOVA analyses, Table III) and Middle American pentadactylus (the measurement index is very small; the maximum female size is larger than the maximum male size, but not impressively so; and 6 out of 8 of the other variables do not differ in the ANOVA analyses, Table IV).
The following two species results are equivocal concerning whether the statistically significant differences in SVL are meaningful: L. bufonius (the measurement error is moderate; maximum size differences between males and females are small relative to the mean differences; and the ANOVA results for the other variables support the statistical results, Table III) and L. fuscus from Porto Velho (the mea- surement error index is borderline; the maximum size differences between males and females is small relative to the mean differences; [no data available for other variables], Table II).
To assess the biological implications of the ANCOVA results that are statistically significant, only one of the three criteria described above can be applied, namely the measurement error index.Using the measurement error index criterion, the following statistically significant results are not considered to be meaningful with the available data: L. bufonius foot length (Table III), L. fuscus thigh length (Table II), foot length (Table II), Middle American pentadactylus head length (Table IV), head width (Table IV), L. troglodytes head length (Table III), shank length (Table III), foot length (Table III).

Effect Size as a Conveyor of Biological Meaning
Testing for statistical rejection of the null hypothesis is a necessary first step for scientific investigation, even though it provides little practical biological information about the parameters that demonstrate statistical significance.There are two additional problems to consider when testing for sexual dimorphism: (1) the magnitude of the difference that we are trying to detect (or define), and (2) the size of the sample.Clearly the column of raw mean differences contains values that are both sample and sample size dependent as well as being variable and noncomparable across species, localities or subgroups.Therefore, we require a method for comparing sexual differences that is ''dimensionless'' in the sense The columns labeled ''effect size'' in Tables II-V contain values that are standardized and tell the researcher how much sexual difference actually exists.This measure quantifies the magnitude of the difference between the sexes.The division of the mean difference by the standard deviation standardizes the difference between the male and female means and puts the difference on a scale that is adjusted for the standard deviation of the measure.This produces the same result as when raw scores are converted to standard-normal or z-scores.Therefore, effect size can be used to compare results from studies on different species or genera, even when unequal sample sizes are involved.
Comparing columns for p-value and effect size in Tables II-V clarifies that a statistically significant test result can obtain either (a) when sample size is excessive and effect size small, or (b) when there is small sample size and large effect size.Thus, tests of sexual dimorphism with very large sample sizes can demonstrate statistical significance, yet the raw differences involved may be biologically trivial or meaningless.The ANCOVA results for head length (with sex as the covariate) in the total sample of L. podicipinus provides a good example.The test result is statistically significant at the observed 0.001 level of probability (power of 0.93), yet the effect size for this variable is only 0.012 (Table II, Fig. 7A), a value so small that there is likely to be negligible biologically meaningful information in the population differences of head length between males and females.The second sample size problem is at the other end of the spectrum.When available sample sizes are small or not representative of the population as a whole there can be interpretation problems.An example from our data is the difference in SVL length of male and female L. pentadactylus.The ANOVA results for SVL are significant, with the mean size of females being 13.7 mm larger than the mean size for males.However, there are two other features of the data that militate against this result being considered biologically meaningful.First, the mean measurement error is large for L. pentadactylus, 14.9 mm, just exceeding the mean size differences between the sexes.Second, the largest male in the sample is 195.0 mm SVL, whereas the largest female is only 174.2 mm.A plausible explanation to ac- It is reasonable to assume that younger males that are unable to oust resident males, are more likely to be collected because they spend all their time on the forest floor.Thus, it is important to examine each statistically significant result to evaluate whether the results are biologically meaningful.
An effect size has other interpretations that make it superior to a p-value as an aid in evaluating sexual dimorphism.
First, effect size is the extent to which the populations of the two sexes do not overlap (Table VII).That is, if there were no overlap at all (or 100% non-overlap), then every single female would be larger than every single male, or vice versa.Surely we would agree to a conclusion of dimorphism at this level.The largest non-overlap value we ob- were large and the overlap wider than the difference between average SVL values, then the effect observed would not seem to be biologically important.An ES = 0 means that the male and female distributions completely overlap, indicating there is 0% non-overlap.With zero observed non-overlap (or 100% overlap), clearly the sexes could not be dimorphic.In Tables II and VII we find that with an observed ES = 0.017 there is less than 1% nonoverlap of the populations of tympanum diameters of L. fuscus males and females.This result obviously speaks more directly to our question of sexual dimorphism than the p = 0.000 that resulted, and was based to a great extent upon the large sample sizes for male and female L. fuscus.Note here that despite wide-held belief among many practitioners, it is clearly not true that the smaller the observed p-value the more dimorphism exists.Consideration of effect size illuminates this issue.
A second interpretation is that of an average percentile.For example, when ES = 0.2 (Table VII), this indicates that the mean of the males (females) is at the 15 th percentile of the distribution of the females (males).
Finally, we can use an observed effect size value from our study as a comparative value with any range of effect size values defined specifically for frog species.That is, we can use frog-specific effect size values or ranges as a starting point for  Cohen (1977) is the authority for the rationale underlying effect size usage in the behavioral sciences.In his seminal work, Cohen (1977:12) proposed: ''...as a convention, ES [effect size] values to serve as operational definitions of the qualitative adjectives 'small', 'medium', and 'large'.''He went on to clarify (p.13): ''Although arbitrary, the proposed conventions will be found to be reasonable by reasonable people.An effort was made in selecting these operational criteria to use levels of ES which (sic) accord with a subjective average of effect sizes such as are encountered in behavioral science.'Small' effect sizes must not be so small that seeking them amidst the inevitable operation of measurement and experimental bias and lack of fidelity is a bootless task, yet not so large as to make them fairly perceptible to the naked observational eye.Many effects... are likely to be small effects as here defined, both because of the attenuation in validity of the measures employed and the subtlety of the issues frequently involved.In contrast, large effects must not be defined as so large that their quest by statistical methods is wholly a labor of supererogation, or to use Tukey's delightful term, 'statistical sanctification'.That is, the difference in size between apples and pineapples is of an order that hardly requires an approach via statistical analysis.On the other side, it cannot be defined so as to encroach on a reasonable range of values called medium.' ' Cohen's (1977) characterizations of small, medium, and large effect sizes have become the standards used subsequently in the behavioral and most other sciences.However, as early as 1982 Cohen and colleagues (Welkowitz et al. 1982:220) explicitly stated that their values defining small, medium, and large not be used as conventions ''if you can specify [effect size] values that are appropriate to the specific problem or field of research.''To our knowledge, conventions for small, medium, and large effect sizes have not been established for measurement data used to evaluate sexual dimorphism in frogs.
For ANOVA and ANCOVA, Cohen (1977:285-287) defined a small effect size as 0.10, a medium effect size as 0.25, and a large effect size as 0.40.The nature of our data indicates that it is inappropriate to use a single definition of effect size for all frog body measurement variables.
Sexual dimorphism of overall size, as reflected by SVL in our data, can fit into the category of ''statistical sanctification'' cited above.That is, in some species of frogs, the males are very much smaller than the females -no statistical analyses are necessary to demonstrate what is obvious from visual inspection.In order to know what such large effect size values would be, we included the data on Eleutherodactylus fenestratus, in which there is a gap, or no evidence of overlap, in the SVL mea- surements between the males and females.To be useful, an effect size should represent the smallest effect that would be of substantive (biological) significance to the researcher.That is, not every possible non-zero difference is important.For example, the mean raw or unstandardized difference for L. knudseni between the sexes' SVLs is only about 0.7 mm and the test result (at negligible .055power, ES = 0.000) was not significant (Table IV).If one were to decide that a difference of about this magnitude could be important, then with the same means (132.05 and 131.37) and standard deviations (11.00 and 17.57), it would take about 18,800 specimens of L. knudseni to attain statistical significance.This sample size would provide about 80% power to detect such an 'important' difference.The selection of critical differences must have some better and more realistic basis than merely selecting any non-zero value that arises.Based on the range of standardized effect size values in our data (Fig. 6A), we propose that appropriate effect size conventions for evaluating hypotheses with SVL data are small = 0.20, medium = 0.45, and large = 0.70.Effect size values for the ANCOVA results for the variables other than SVL extend over a much smaller range and would be expected to do so, since means are adjusted.In no case is a statistically significant ANCOVA result for effect size obvious to the eye for the specimens themselves.To interpret effect size values for ANCOVA analyses, it is useful visually to examine the data over the range of values obtained in this study (Fig. 6B). Figure 7A shows an example for which large sample size induces a statistically significant result for a very small effect size, which can readily be interpreted as not having biological significance.The graphs for the largest AN-COVA effect sizes for our data (Fig. 7E, F) demonstrate differences that probably do have biological meaning.Given the small number of ANCOVA significant effect size results we have in our study, as a first approximation, we propose adopting Cohen's conventions, namely small = 0.10, medium = 0.25, large = 0.40 for the ANCOVA-based effect sizes.
We emphasize that the effect size characterizations we propose are just that -proposals.The actual characterizations should come from testing our proposals against multiple frog measurement data sets before being adopted as conventions.

Comparison with Previously Published Results
Previous analyses of sexual dimorphism in the morphologically similar species L. bufonius and L. troglodytes indicated differences in sexual dimorphism in the majority of the measurement variables analyzed.Both species are stocky and short legged.In the previous study (Heyer 1978), L. bufonius demonstrated sexual dimorphism in SVL (females larger), head length (male heads longer), head width (male heads wider), whereas L. troglodytes demonstrated sexual dimorphism in head length (male heads longer), shank length (male shanks longer), and foot length (male feet longer).In this study, both L. bufonius and L. troglodytes demonstrate statistically significant differences in femalemale SVL, but SVL differences are considered not meaningful and can not be demonstrated to be dimorphic for L. troglodytes with the available data.
The effect size for SVL in L. bufonius is 0.172, a small effect size as defined herein.Head length is not sexually dimorphic for L. bufonius as analyzed herein.Although head length is statistically signif-icant for L. troglodytes in our results, it is considered to be not meaningful due to the large measurement error relative to the actual measurement differences between the sexes (measurement error index = 0.1).Head width is not statistically different between males and females in our results for L. bufonius.For both shank and foot lengths, the ANCOVA results are statistically significant for L. troglodytes but are considered not meaningful due to the large measurement errors relative to actual measurement differences in the available data (measurement error index = 0.0, 0.1 respectively).Foot length dimorphism in L. bufonius is statistically significant but is considered not meaningful, also due to large measurement errors relative to actual measurement differences (measurement error index = 0.6).Our Leptodactylus furnarius and L. gracilis are both gracile, long-legged species, but are readily morphologically distinguishable from each other whereas L. bufonius and L. troglodytes are difficult at best to tell apart morphologically.In a previous study (Heyer 1978), L. furnarius (as L. laurae) demonstrated sexual dimorphism in SVL (females larger) and no dimorphism in thigh, shank, or foot length; L. gracilis demonstrated dimorphism only for head width (male heads longer) for the variables analyzed.Our results demonstrate statistically significant results solely for SVL in L. furnarius (females larger), with an effect size of 0.385, a medium effect size as defined herein.
There are two differences between the previous and current studies involving L. bufonius, L furnarius, L. gracilis, and L. troglodytes.First, the earlier study employed t-tests and the analysis of ratio data for all variables other than SVL, while ANOVA (equivalent to t-test) for SVL and ANCOVA for untransformed variable data were used in this study.Second, the data sets analyzed herein are larger because measurement data were added for each species over the years between the studies.Given these differences, one would not expect the results to be the same between the studies.Overall, the statistically significant results between the studies are quite similar.The major differences between the studies lie in the variables considered to be biologically meaningful based on effect sizes and measurement error relative to the magnitude of the mean differences in the variables between females and males -in these terms, the results of the two studies are quite different.Data for SVL, head length, head width, eyenostril distance, tympanum diameter, thigh length, shank length, and foot length were analyzed previously for L. knudseni, L. pentadactylus, and Middle American pentadactylus (Heyer 2005).As for the Heyer (1994) study, the data were analyzed using t-tests and for all variables other than SVL, arcsine transformed ratio data were used.Although the arcsine is not the most appropriate transformation, almost identical effect size results obtain if the more appropriate log transformations or the untransformed ratios are used (Table VI) as we have mentioned.Sample sizes are identical for the previous and current analyses for these three species.In the previous study (Heyer 2005), L. knudseni demonstrated statistically significant differences only in head width (male heads wider); L. pentadacty- Interpretations of Effect Size values.Percent non-overlap is the amount of overlap between two groups: An Effect Size = 0.0 indicates that the distribution of the female measurement data totally overlaps that for the males, i.e., 100% overlap or 0% non-overlap.Percentile standing: The percentage of the female population data that the upper half of the male population data exceeds, i.e., Effect Size = 0.0 indicates that the mean of the female data is at the 50 th percentile of the male data and an Effect Size = 0.8 indicates that the mean of the females is at the 79 th percentile of the male distribution.lus for SVL (females larger) and eye-nostril distance (male distances longer); and Middle American pentadactylus for SVL (females larger), head length (male heads longer), and head width (male heads wider).The results from this study are exactly the same for L. knudseni.The statistical results are the same for Middle American pentadactylus for all variables except head length and head width for which the opposite sex demonstrated the larger variable values (females with longer and wider heads in the results in this study); effect size values for both head length (0.005) and width (0.001) are negligi- ble.Both sets of statistical results are the same for SVL in L. pentadactylus, but in this study there is no dimorphism for eye-nostril distance (ES = 0.002), while there is statistical support for shank length (ES = 0.224; female shank longer).The effect size for SVL dimorphism in L. pentadactylus is 0.160, a small effect size as defined herein, hence not biologically meaningful (also see discussion in Effect Size as a Conveyor of Biological Meaning).The statistically significant results for Middle American pentadactylus SVL (ES = 0.031), eye-nostril (ES = 0.028), and foot (ES = 0.030) in this study are considered to be biologically insignificant.As for the above previous study comparisons, the overall statistically significant results are again more similar between the studies than are the biologically significant results.

Biological Implications
As indicated in the introduction, one of the main interests in analyzing sexual dimorphism of measurement data in frogs is to gain insights to their biology.From the results discussed above, L. furnarius demonstrates sexual dimorphism in size (fe-males larger) whereas L. gracilis does not demonstrate size dimorphism.Based upon our tests and supplemental methodology, these results are robust and most likely have an as yet undetermined biological explanation.
The lack of sexual dimorphism for SVL in L. knudseni is robust, whereas the results for dimorphism in SVL for Middle American pentadactylus and L. pentadactylus require further investigation.The lack of sexual dimorphism in size may relate to territorial defense and fighting as indicated by Shine (1979), since males are typically smaller than females in most species of frogs.Some samples were included in this study to assess whether geographic variation may have a confounding effect when trying to understand sexual dimorphism for the species.Only SVL data were available for this aspect in L. fuscus (Table II).The sample size for Porto Velho is large enough that it almost certainly characterizes the range of SVL values for the species at that locality.The range of SVL values at Porto Velho is less than half the range for the species as a whole (Porto Velho male SVL range 34.2-43.7 mm, female range 34.3-44.2mm; for en- tire species sample male SVL range 32.4-55.3mm, female range 32.2-56.3mm).Thus, there is meaningful geographic variation in SVL that exceeds the range of intra-population variation.For the entire species sample, SVL sexual dimorphism is not statistically significant.For the Porto Velho sample, sexual dimorphism in SVL is statistically significant but the effect size is so small (ES = 0.047) that it is biologically meaningless.More data are available to address the problem of species versus population variation in sexual dimorphism for L. podicipinus.The sample from Porto Velho is large enough to characterize the range of measurement variables for the frogs at that site.In all cases, the sample for the entire species exceeds the ranges of values for the Porto Velho sample, but there is variation in the magnitude of the differences (Table VIII).The variation in SVL is meaningfully greater than the intra-population variation at Porto Velho, whereas the variation in tympanum diameter at Porto Velho is approximately equivalent to that observed in the data for the entire sample.Thus, the magnitude of geographic variation varies depending on the variable involved.Sexual dimorphism in SVL demonstrates similar results for the entire species sample and for each of the four individual locality samples: for the entire species sample the females are significantly larger with an effect size between small and medium (ES = 0.324); the four locality samples demonstrate that the females are also significantly larger with medium to large effect sizes (ES values range 0.403-0.697).Head length and width are both statistically greater in females in the entire species sample, with medium effect size (ES = 0.309, 0.277) and in each of the localities with medium to large effects (ES range 0.328-0.734).Despite the small sample size, eye-nostril distance is significantly different for the entire species sample with a medium-large effect (ES = 0.565).Data for eye-nostril distance are only available for the Curuçá specimens and achieve statistical significance with an ES = 0.408.Dimorphism in tympanum diameter is statistically significant for the entire species sample, but the effect size is small (ES = 0.174) and the individual localities mirror this result.The very limited data analyzed herein suggest that although there is geographic variation in the variables, the variables that show biologically meaningful sexual dimorphism show it for not only the individual locality samples, but also the entire species sample as well.
For variables other than SVL, those having at least medium effect sizes appear to be biologically meaningful (Fig. 7).The differences between male and female eye-nostril distances in L. troglodytes (Table III, Fig. 7F) are of a magnitude (ES = 0.260) that indicates some as yet unknown biological significance.The effect size differences between male and female shank lengths in Eleutherodactylus fenestratus (ES = 0.228) also appear to be biologically meaningful (Table V, Fig. 7E).In this case, a plausible biological explanation is that because the mass of females is much greater than the mass in males, females would require longer legs to jump similar distances as males.

Null Hypothesis Testing Versus Scientific Inference
The use of null hypothesis testing in the ecological literature is well established but the limitations of this approach are less well recognized.Emphases are placed on rejecting the null hypothesis and the size of the p-value rather than on the data and whether or not it supports the scientific contention.Null hypothesis testing is not solely a dichotomous decision on whether to reject or not.It is also a procedure that gives the researcher a method for determining whether present sample sizes are adequate or need to be increased to demonstrate meaningful statistical results.Thus, this approach should only be used in circumstances where additional data can be obtained.Indeed, there are popular and wide-spread misinterpretations of these distinctions concerning statistical and scientific results.The two errors most commonly seen in the ecological and herpetological literature are: (1) believing that the p-value is the probability that the null hypothesis is actually true; and, (2) interpreting the p-value for hypothesis rejection as the probability that a substantive effect exists in the population (i.e., the smaller the p-value, the larger the biological effect).
An error of major import in herpetological studies is the equating of very small p-values with the existence of meaningful differences between the groups or species being compared (e.g., p = 0.0001 is a much more meaningful result than p = 0.0499).Although it is a necessary part of the quantitative evaluation of field results, the p-value alone cannot provide the researcher with this information.Even though the testing of the null hypotheses of no effect and the estimation the size of an effect are closely related, there has been total reliance upon the former and lack of interest in the latter in the ecological literature (e.g., Mapstone 1995).
All basic statistical texts contain a section on the inter-relationships involved in a statistical test of hypothesis.We learn that type 1 error, type 2 error, power, and sample size are all related.This information, if considered at all for analysis of field research data, urges researchers to consider the power of the test (Hayes and Steidl 1997, Peterman 1990, Reed and Blaustein 1995, Taylor and Gerrodette 1993, Toft and Shea 1983, Yezerinac et al. 1992, Zielinski and Stauffer 1996).However, in the published literature dealing with sexual dimorphism, a null hypothesis of no difference between the sexes has been set up based upon the data obtained from those specimens that were observed or captured.The refrain is commonly heard that a predetermined sample size is useless in field research because we ''obtain what we can'' given time, funding and behavioral characteristics of the species of interest.There are two facets of power: (1) prospective power, which can be used to determine what samples should be used or if sample sizes should be increased, and (2) retrospective observed power, determined after data collection, which actually can confuse without providing insight.Because retrospective or observed power is clearly a decreasing function of the observed pvalue, we need only one of these quantities.The p-value is clearly the best known and more easily calculated.

Null Hypothesis Testing Versus Data Exploration
Investigation into sexual dimorphism and its correlates is a common theme in amphibian and other literatures.Such research usually focuses on size differences, but distinctions of shape and patterns of adaptation, evolutionary, or ecological influences can be of interest as well.The statistical approach is determined by the specific question being framed within the research regardless of the number of foci being considered.In the present study our interest lies in the single unambiguous question ''Is there a difference in the size of the chosen measure between males and females of a given species?''We therefore rely upon a test of the null hypothesis of no difference within a univariate model framework.Our investigations examine a single alternative hypothesis and its relationship to a standardized measure of the definable difference between the sexes for the given variable.Alternatively, many researchers desire to incorporate related influences into their study and thus advocate multivariate methods (e.g.Butler and Losos 2002).Such an approach requires consideration of measurement data that may need to be adjusted for morphological, phylogenetic or other concerns.That is, these adjustments are beyond the usual statistical methods.Also, many researchers employ exploratory data analytic techniques; that is, techniques that explore the data rather than test a hypothesis about the data (e.g.Butler and Losos 2002, multivariate general linear models).The present study is not intended as a generalized treatment of all of the possible questions involving dimorphism.Rather, it is a unified treatment of the most effective methods that amphibian workers may use to obtain a substantively or operationally significant answer to the single question of the existence of sexual dimorphism in size variables.

Conclusions and Recommendations
We consider the concept of ''effect size'' to be an instrument for the incorporation of biological meaning into the testing methodology as well as an interrelated factor in statistical hypothesis testing.A statistical significance test merges information on size of an effect observable in the data with information on the sample size.For this reason the p-value is not the correct device for evaluating the magnitude of frog population differences.Effect size is a scale-free and standardized measure of the relative magnitude of the effect of interest, in our case sexual dimorphism.Effect size and the ability to detect it are directly related.The larger the effect size, the easier it is to detect, as demonstrated by the Eleutherodactylus fenestratus data (Table V).Conversely, the smaller the dimorphism effect, the more difficult it is to demonstrate, as shown with the L. knudseni example.A larger sample size generally leads to parameter estimates with smaller variances resulting in a statistically significant difference for small effect sizes.
Any frog study must be of an adequate sample size relative to the study's goals.Thus, the ageold problem arises of ''what should be the sample size?''A statistically significant result can occur if either the effect size is very ''big'' (despite having a small sample), or, if the sample size is very ''big'' (despite a very small effect size).A review of Tables II-V shows that the sample should be big enough so that an effect size that is biologically interesting or important will be recognized as statistically significant.Consequently, we developed a range of ES values that have biological meaning for the species included in this study, and for other frog species as well.Use of these conventions will allow the researcher to evaluate power in past or future studies as well as to determine when the sample size for the study should be enlarged in a future project, rather than merely ignoring the results as ''non-significant'', or worse, accepting statistically significant results that are biologically trivial as biologically meaningful, and not pursuing the research any further.Sample size is important.An undersized study wastes valuable resources because it is not capable of producing useful results.A study that is too large requires greater resources and the cost benefit ratio is excessive.
In our quest for biological meaning or importance, we incorporate the concept of measurement error.It is well recognized that for many frog morphometric variables, measurement error is high (e.g.Hayek et al. 2001).Our measurement error index provides insight to the relationship of the impact of measurement error for each variable on the ability to detect meaningful sexual dimorphism in the data.

Recommendations.
1. Studies of sexual dimorphism based on measurement data should rely on more than the dichotomous decision to either accept or reject the null hypothesis of no dimorphism made on an arbitrary number of specimens for which sample sizes cannot be increased (most museum-based specimen data).
2. The results of any hypothesis test for sexual or geographic dimorphism should be supplemented with information on measurement error for the morphological variable of interest.Interpretation of effect size and results for the entire variable may be problematic if measurement error is high.Results can be evaluated by use of our measurement error index.
3. We recommend the use of effect size as the primary statistic to evaluate sexual dimorphism in measurement data.Power has been suggested as the primary statistic to evaluate biological magnitude of statistical analyses.However, others say it is not worthwhile, or it is too complicated a factor to consider and report on without a pilot study.We avoid the argument by noting that observed power increases with probability level.Thus, p-values can be used as a proxy for power for researchers who wish to compare power among studies.
4. Adequate sample size, relative to study goals, can be determined by use of effect size.The range of effect size values provided in this study will enable the researcher to determine a sample size large enough to garner statistically significant and biologically meaningful results.Alternatively, the effect size values will help the researcher identify sample sizes so large that a statistically detectable result is of no scientific importance.
5. Effect size information can be used for planning as well as synthesizing studies and their results.Use of an effect size with its confidence interval conveys the same information as the usual hypothesis test of significance, but the emphasis is on the significance of the effect or actual difference between the sexes rather than on the arbitrary sample size.Reporting and interpretation of effect sizes in addition to statistical test results is simple and more effective than other statistical approaches currently in use, particularly for field-based research that can not be controlled experimentally.

APPENDIX I CALCULATION OF EFFECT SIZE VALUES
Cohen's original work was in the areas of psychology and education, which quite commonly deal with relationships between independent and dependent variables.In such cases an effect size is a standardized measure of the change in the dependent variable as explained by or as a result of change in the independent variable.Thus, standardization was first accomplished by dividing by σ = the standard deviation of the control or independent group.This allowed for the measurement of the effectiveness of the treatment with reference to the group not affected by that treatment.
We present results based upon general linear modeling with ANOVA and ANCOVA.Field and museum specimens used for sexual or geographic dimorphism study do not involve ''control'' and ''treatment'' or ''experimental'' groups.Therefore, we adjust the data used for the standard deviation in order to standardize our effect size.We desire to examine the difference between male and female specimens and relate this difference to, or standardize by, the within group dispersion.Selecting one of the two standard deviations would make an appreciable difference in our value of d, so we established a pooled estimate of the standard deviation.The formulae used were: The use of this pooled estimate of the standard deviation depends on the assumption that the two calculated standard deviations are estimates of the same population value, or differ only with sampling variability.This of course is the null hypothesis.
It is advantageous to use the pooled standard deviation because there is an alternative method for calculation of d, easily computed from computerized printouts.For ANOVA and ANCOVA: SS effect / (SS effect + SS error ).Also called Partial Eta Squared, this is the proportion of the effect plus error variance that is attributable to the effect of dimorphism.This is the quantity that we report in this paper.

Fig. 1 -
Fig. 1 -Smallest and largest individuals used to determine measurement error, showing both the size differences involved and variation in quality of preservation.Above -Leptodactylus vastus, USNM 109144; below -Adenomera marmorata, USNM 209112.

Fig. 2 -
Fig. 2 -Measurement error data for head width with linear smoother.All values in mm.

Fig. 3 -
Fig. 3 -Measurement error data for SVL with quadratic smoother.All values in mm.

Fig. 4 -
Fig. 4 -Comparison of magnitude of measurement errors for thigh length (left) and shank length (right).All values in mm.

Fig. 6 -
Fig. 6 -Distribution of effect size values for SVL (A) and other variables combined (head length, head width, etc.) (B).Solid bars are biologically significant values.Open bars are biologically insignificant values.Open bar on left of B has been truncated to 40 occurrences for purposes of display; the actual value is 76.

ACKNOWLEDGMENTS
Drs. M. H. Heyer, P. E. Vanzolini, and G. R. Zug provided comments that were useful in clarifying our presentation.P.E.Vanzolini kindly prepared the Resumo.The research for this paper received support from the Smithsonian Institution, National Museum of Natural History, USA (Lee-Ann C. Hayek and W. Ronald Heyer) and the National Science Foundation, USA (award 9815787 to Rafael O. de Sá and W. Ronald Heyer).

TABLE II (continuation) L. podicipinus -ANOVA for Alejandra, Bolivia sample
L. podicipinus -ANCOVA for Alejandra, Bolivia sample and females and that any statistically significant results are probably spurious.Values in the range of 2 or greater indicate that measurement error is not influencing variability of the effect size.Maximum size data have been addressed in preceding examples (Materials and Methods: Evaluation of Statis-

TABLE III (continuation) Leptodactylus furnarius -A N O V A
DISCUSSIONSmall, Medium, and Large Effect Sizes for Sexual Dimorphism in Frogs

TABLE V Sexual dimorphism statistics for Eleutherodactylus fenestratus measurement data. Meas. = measurement. All mean difference values are positive; female values are greater than male values.
An Acad BrasCienc (2005) 77 (1)

TABLE VI Comparison of ANOVA and ANCOVA analyses on raw and transformed data for Leptodactylus podicipi- nus, N = 528 females, 419 males for all variables except EN, N = 21 females and males. A N O V A -Raw data
df = degrees of freedom; EN = eye-nostril distance; H area = head area; HL = head length; HW = head width; sd = standard deviation; SVL = snout-vent length; TD = tympanum diameter.