## Overview #

A box and whisker plot (aka boxplot) is a way to show data distribution.

The dividing lines along the “box” part of box and whisker plot typically represent the **median** (the middle observation in a sequentially sorted dataset), the **upper quartile** (the observation that is the middle point of the upper half of the dataset), and the **lower quarter** (the observation that is the middle in the lower half of the dataset).

The “box” part captures what is known as the **interquartile** range of the dataset.

The “whisker” part usually extend out to some multiple of the calculated interquartile range, usually 1.5x the interquartile range.

**Outliers** beyond the extreme ends of the whiskers are typically represented as individual points.

For instance, in the sequence `1, 2, 5, 7, 9`

:

`5`

is the median`7`

is the upper quartile`2`

is the lower quartile`1`

is the lower end of one of the whiskers`9`

is the end of the the other whisker- The interquartile range is the difference between 7 (the upper quartile) and 2 (the lower quartile), or 5 (7-2).

Each group of data is shown within its own box and whisker block. On a single plot, there can be many groups of data shown.

### Advantages #

A box and whisker plot is extremely simple when compared to something like a histogram or a density plot.

In fact, the concept underlying a box and whisker plot lends itself well to simplification. Edward Tufte takes the simplification to an extreme by reducing the classic box and whisker plot further to line-dot-line plot (The Visual Display of Quantitative Information, p. 123-124), or what he refers to as a **quartile plot**.

### Disadvantages #

Due to the simplification in representation of a box and whisker plot, a lot of the underlying detail is lost. This may be a bad thing depending on the context.

## Data #

At a very minimum, a box and whisker plot requires one continuous numerical data field.

continuous |
---|

1 |

2 |

5 |

7 |

9 |

A discrete categorical variable can be added to enable the display of separate box and whisker plots for different groups of data.

continuous | group |
---|---|

1 | A |

2 | A |

5 | A |

7 | A |

9 | A |

2 | B |

4 | B |

5 | B |

8 | B |

## R #

Box and whisker plots can be rendered in R using the base R language and with ggplot2.

### Base R #

In base R, a simple boxplot can be generated using the `boxplot(x, data)`

command, where `x`

refers to a formula that specifies what goes into the boxplot and is of the form `continuous~group`

, and `data`

refers to the source dataframe.

```
example_dat
```

```
## # A tibble: 9 × 2
## continuous group
## <dbl> <chr>
## 1 1 A
## 2 2 A
## 3 5 A
## 4 7 A
## 5 9 A
## 6 2 B
## 7 4 B
## 8 5 B
## 9 8 B
```

```
boxplot(continuous~group, data = example_dat)
```

### ggplot2 #

The ggplot2 package can also be used to generate more refined box and whisker plots.

```
library(ggplot2)
```

A basic box and whisker plot using the synthetic data from above:

```
ggplot(example_dat) +
geom_boxplot(
aes(
x = group,
y = continuous
)
)
```

Great! We now have a box and whisker plot in ggplot2, but that’s not really stretching the potential of the ggplot2 package. Let’s challenge ourselves a bit.

Let’s try making another, more sophisticated plot using the built in sample `iris`

dataset.

```
# generate a preview of the iris dataset, limited to 10 records
head(iris, 10) %>% kable()
```

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa |

4.9 | 3.0 | 1.4 | 0.2 | setosa |

4.7 | 3.2 | 1.3 | 0.2 | setosa |

4.6 | 3.1 | 1.5 | 0.2 | setosa |

5.0 | 3.6 | 1.4 | 0.2 | setosa |

5.4 | 3.9 | 1.7 | 0.4 | setosa |

4.6 | 3.4 | 1.4 | 0.3 | setosa |

5.0 | 3.4 | 1.5 | 0.2 | setosa |

4.4 | 2.9 | 1.4 | 0.2 | setosa |

4.9 | 3.1 | 1.5 | 0.1 | setosa |

There’s a `Species`

categorical field, and a few other continuous numerical fields. For simplicity, let’s pick one numerical field - `Sepal.Length`

.

```
ggplot(data = iris) +
geom_boxplot(
aes(
x = Species,
y = Sepal.Length
)
)
```

I think we can do better.

```
ggplot(data = iris) +
geom_boxplot(
aes(
x = Species,
y = Sepal.Length,
fill = Species # color the boxes by species
)
) +
coord_flip() + # turn it sideways
labs( # give the plot some labels
title = "Box and whisker plot of Iris Species",
x = "Species",
y = "Sepal Length"
) +
theme(
legend.position = "none" # remove the legend since it doesn't really convey any real useful information
)
```

Let’s enhance that even more by adding the individual data points. We’ll use the `geom_jitter()`

function in ggplot2 for the points to give the point positions some random variation.

```
ggplot(
data = iris,
aes( # note that the aes() aesthetic mappings were moved out from geom_boxplot() to ggplot(). This is now being shared across other mappings, namely geom_jitter()
x = Species,
y = Sepal.Length,
)
) +
geom_boxplot(
aes(
fill = Species # fill the boxes with by species
),
alpha = .5 # make the box and whisker plots semi-transparent
) +
geom_jitter(
aes(
color = Species, # color the points by species
alpha = .9
)
) +
coord_flip() + # turn it sideways
labs( # give the plot some labels
title = "Box and whisker plot of Iris Species",
x = "Species",
y = "Sepal Length"
) +
theme(
legend.position = "none" # remove the legend since it doesn't really convey any real useful information
)
```

This is still a fairly basic plot, but it’s much richer in detail than what we started with.