SciELO - Scientific Electronic Library Online

 
vol.48 issue4Prevalence of zoonotic visceral leishmaniasis in dogs in an endemic area of BrazilAcute disseminated encephalomyelitis following inactivated influenza vaccination in the Brazilian Amazon: a case report author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

Share


Revista da Sociedade Brasileira de Medicina Tropical

Print version ISSN 0037-8682On-line version ISSN 1678-9849

Rev. Soc. Bras. Med. Trop. vol.48 no.4 Uberaba July/Aug. 2015  Epub June 26, 2015

http://dx.doi.org/10.1590/0037-8682-0013-2015 

Short Communications

Description of continuous data using bar graphs: a misleading approach

Edson Zangiacomi Martinez 1  

1Departamento de Medicina Social, Faculdade de Medicina de Ribeiro Preto, Universidade de São Paulo, Ribeirão Preto, São Paulo, Brasil.


ABSTRACT

INTRODUCTION:

With the ease provided by current computational programs, medical and scientific journals use bar graphs to describe continuous data.

METHODS:

This manuscript discusses the inadequacy of bars graphs to present continuous data.

RESULTS:

Simulated data show that box plots and dot plots are more-feasible tools to describe continuous data.

CONCLUSIONS:

These plots are preferred to represent continuous variables since they effectively describe the range, shape, and variability of observations and clearly identify outliers. By contrast, bar graphs address only measures of central tendency. Bar graphs should be used only to describe qualitative data.

Key words: Biostatistics; Descriptive statistics; Medical research

Over the decades, many authors have used bar graphs to describe continuous data(1). The height of the bars in these graphs indicates a measure of central tendency (mean or median) of the data, while error barsdescribe a measure of dispersion (standard deviation) or precision (standard error). These graphs have become quite popular with the ease provided by some current computer programs. Despite their wide use, bar graphs have fostered a misleading approach to describe continuous data, while traditional tools such as box plots and dot plots are more suitable for this purpose. Bar graphs do not provide useful information about the behavior of data, such as skewness, range, and presence of atypical values (outliers). They only describe the position of the mean (or median) and dispersion around this measure.

The box plot, also called box-and-whiskers plot, was introduced by the American mathematician John Wilder Tukey (1915-2000) as a practical method to describe groups of numerical data based on their quartiles and extreme values(2). When represented vertically, the box plot displays a rectangle (the box) whose base and top represent the position of the first (Q1) and third (Q3) quartiles, respectively. A band inside the rectangle describes the second quartile (the median). The height of the rectangle then represents the inter-quartile range (IQR), and can be interpreted as a measure of data spread. To complete the graph, two vertical lines connect the third quartile to the highest value and the first quartile to the lowest value. A practical method to detect potential outliers is to identify values above Q3 + 1.5 IQR and bellow Q1 - 1.5 IQR in the plot. Outliers are represented by points (or other symbols), and vertical lines connect the third quartile to the highest point below Q3 + 1.5 IQR and the first quartile to the lowest value above Q1 - 1.5 IQR. This is the standard form of a box plot, but since its introduction by Tukey, many alternative forms have been proposed(3) (4). The dot plot can be used to describe small sizes, since the box plot requires a sample size of at least 5 to be adequate.

For example, we simulated data on a continuous variable with different means, dispersion, and skewness among three groups. For each group, we simulated samples of size n = 30. In the first and second groups, the variable follows a normal distribution with population means 40 and 50, respectively, and standard deviations 8 and 6, respectively. In the third group, the variable follows an asymmetric gamma distribution with population mean 12.5. Figure 1 A shows a bar graph with standard deviation bars describing these data (height of the bars indicates the means), while Figure 1 B and Figure 1 C show box plots and dot plots, respectively, where vertical lines overlapping the points represent sample means. We note that the bar graph ( Figure 1 A) provides no information about the range of observed data (minimum and maximum values) or the presence of outliers. In addition, the bar graph cannot describe the shape of the data distribution. Evidence of data symmetry or non-symmetry and information about the presence of outliers are crucial to the choice of an appropriate statistical method of analysis. For example, analysis of variance (ANOVA) and t-tests involve statistics whose asymptotic distributions are well approximated by known density probability functions (such as Student's t or Snedecor's F). However, these approximations cannot be satisfactorily achieved when the data distribution is skewed(5), and the results obtained from these analyses can be consequently spurious. Outlier values can strongly influence the results of the analysis, given that they may have a drastic effect on the sample mean, especially when the sample size is small. However, box plots ( Figure 1 B) and dot plots (Figure 1 C) adequately describe the range of observations, satisfactorily present the shape of the data distribution, and clearly demonstrate the presence of outliers. The layout of Figure 1 A is not the same as that of Figure 1 B or Figure 1 C. Box plots and dot plots can easily be obtained with the aid of packages such as R, Stata, SPSS, or SAS. However, the use of SAS and R software requires some knowledge of programming language, but Stata and SPSS are user-friendly software packages that allow a beginner to create them with relative ease. The figures in this article were prepared using R, a software available free of charge at http://www.r-project.org/. The R codes for drawing the graphs are omitted here, but are available with the author.

Figure 1: Data are shown for three simulated samples (n = 30) from normal (Groups 1 and 2) and gamma (Group 3) distributions. A: Bar graphs with standard deviation bars are inadequate. B: Box plots adequately describe the data distribution and highlight an outlier. C: Dot plots are also adequate to describe data. The horizontal lines in this graph represent the means. 

A disadvantage of the box plot is that it cannot clearly describe the distribution of data with more than one mode. This situation is quite common when dealing with mixtures of two or more different populations. For example, the distribution of anthropometric data from a sample of both genders usually present different shapes for men and women. Figures 2 A, Figure 2 B, and Figure 2 C show, respectively, a box plot, a histogram, and a dot plot for simulated data from a variable that follows a mixture of two normal distributions with means 20 and 40 and standard deviations 3 and 5. The sample size was fixed at 30 and 20 for the first and second components, respectively. We note that the box plot fails to describe the bimodal distribution of data, while the histogram and dot plot can be more suitable options to highlight the shape of data. However, dot plots allow two or more groups in a single figure to be compared, but this can be difficult when using a histogram. The display of two clouds of points ( Figure 2 C) exemplifies how this figure is capable of describing the behavior of data.

Figure 2: Box plots cannot clearly describe multimodal distributions. A: Box plot for a sample from a random variable that follows a mixture of two normal distributions. The bimodality is not visible in this graph. B: A histogram for these data. The bimodality is now visible in this graph. C: A dot plot for these data. The display of two clouds of points in this figure suggests a bimodal distribution. 

Criticism of the use of bar graphs to describe continuous data can also be found in an article by Krzywinski and Altman (1), who argue that box plots are a rather more communicative way to show sample data. Strongly discouraging the use of bar plots with error bars, the authors state that this misleading visual approach has unfortunately been more widely used in the medical literature than have box plots. In addition, the bar itself reportedly encourages the visual perception that the respective mean is related to its height rather than the position of its top (1). Streit and Gehlenborg(6), too, provide useful commentaries on the use of bar graphs.

As discussed by Cumming et al.(7), some figures with error bars can, if used properly, give useful information about the data. These authors warn that it is necessary to distinguish between descriptive and inferential bars, given that they provide different information, such as confidence intervals, standard errors, standard deviations, or simply an amount of spread between the extremes of data. Descriptive bars address the variability of sample data, while inferential bars are related to the precision of the result. It is important to note that bars expressing standard deviations are not properly interpreted as error bars since the standard deviation is a measure of the sample variability around the sample mean, instead of a precision measure in relation to the true value of the population mean. For these reasons, various authors(8) (9) have argued about the importance of including legends or subtitles on their figures describing the meaning of the bars. Other details about the adequate use of error bars are presented by Altman(10).

In conclusion, the choice of an appropriate graphical tool for data description should not be made according to the convenience offered by computer programs or even influenced by the aesthetic of the figure. It is important that data visualization consider the accuracy of information to be transmitted to the reader and provide an appropriate method to evaluate all important characteristics of the data distribution: range, shape, multimodality, variability, and presence of outliers. For these reasons, the use of bar plots with error bars to describe continuous data has no basis in medical studies and should be discouraged. Box plots and dot plots are still the best tools to present data of this type.

ACKNOWLEDGMENTS

I am grateful to the anonymous reviewers of this journal for their constructive comments and suggestions.

REFERENCES

Krzywinski M, Altman N. Visualizing samples with box plots. Nat Methods 2014; 11:119-120. [ Links ]

Tukey JW. Exploratory Data Analysis. Reading: Addison-Wesley Publishing Co; 1977. [ Links ]

McGill R, Tukey JW, Larsen WA. Variations of box plots. Am Stat 1978; 32:12-16. [ Links ]

Hintze JL, Nelson RD. Violin plots: a box plot-density trace synergism. Am Stat1998; 52:181-184. [ Links ]

Bland JM, Altman DG. The use of transformation when comparing two means. BMJ 1996; 312:1153. [ Links ]

Streit M, Gehlenborg N. Bar charts and box plots. Nat Methods2014; 11:117. [ Links ]

Cumming G, Fidler F, Vaux DL. Error bars in experimental biology. J Cell Biol 2007; 177:7-11. [ Links ]

Vaux DL. Error message. Nature 2004; 428:799. [ Links ]

Belia S, Fidler F, Williams J,. Cumming G Researchers misunderstand confidence intervals and standard error bars. Psychol Methods 2005; 10:389-396. [ Links ]

Altman DG. Statistics and ethics in medical research, VI. Presentation of results. Br Med J 1980; 281:1542-1544. [ Links ]

Received: January 13, 2015; Accepted: March 30, 2015

Corresponding author: Dr. Edson Zangiacomi Martinez. Deptº de Medicina Social/FMRP/USP. Av. Bandeirantes 3900, Monte Alegre, 14049-900 Ribeirão Preto, São Paulo, Brasil. Phone: 55 16 3602-2569 e-mail: edson@fmrp.usp.br

Conflict of interest

The author declare that there is no conflict of interest

Creative Commons License This is an open-access article distributed under the terms of the Creative Commons Attribution License