(This article was originally published at Simply Statistics, also syndicated at StatsBlogs.)

I am co-teaching a information science course in Johns Hopkins with John Muschelli. I gave the lectures on EDA and that he just gave a lecture on the best way to create an “expository graph”. You make for yourself only to try to know a data set when we teach the course an graph is the type of chart. An chart is one where you are currently trying to communicate data.

It’s usually simple, with no axes, legends colours, or other attempt to ensure it is transparent, understandable and fairly when you are making an exploratory graph. John has a great blog post on the best way to build up a figure that is expository.

Recently I gave a talk at McGill University and had to create a plot for the talk. I figured one more illustration would be better for all, so I thought I’d go through my process here.

I wished to demonstrate that the supply in the package. So first I loaded the information:

```
library(tidypvals)
Library(ggridges)
```

`## Loading required package: ggplot2`

`library(dplyr)`

```
##
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
##
## filter, lag
```

```
## The following objects are masked from 'package:bottom':
##
## intersect, setdiff, setequal, marriage
```

```
library(ggplot2)
Library(forcats)
information(allp)
```

I knew I needed to utilize the package so I read the docs and began with the version:

```
Allp %>%
Ggplot(aes(x = pvalue, y = area)) +
geom_density_ridges()
```

`## Picking combined bandwidth of 0.00413`

Right away I saw there were several issues here. First of all a greater than one should not be in there, so that was a error. I don’t like that you can’t really find the values because most of the activity is near zero.

So let’s fix the x-axis a bit. I spent a few minutes fiddling and decided I only wished to find the values between 0 and 0.25.

```
Allp %>%
Ggplot(aes(x = pvalue, y = area)) +
geom_density_ridges() +
xlim(c(0,0.25))
```

`## Selecting joint bandwidth of 0.00401`

```
## Warning: Removed 359521 rows comprising non-finite values
## (stat_density_ridges).
```

Okay that is better, however I don’t really like the grey background so let us select a background color that is different

```
Allp %>%
Ggplot(aes(x = pvalue, y = area)) +
geom_density_ridges() +
xlim(c(0,0.25)) +
theme_ridges(grid = FALSE)
```

`## Selecting joint bandwidth of 0.00401`

```
## Warning: Removed 359521 rows comprising non-finite values
## (stat_density_ridges).
```

That is a bit more pretty, however we see that area is occasionally NA so we will need to remove those values.

```
Allp %>%
Filter(! Is.na(area)) %>%
ggplot(aes(x = pvalue, y = area)) +
geom_density_ridges() +
xlim(c(0,0.25)) +
theme_ridges(grid = FALSE)
```

`## Selecting joint bandwidth of 0.00404`

```
## Warning: Removed 349629 rows comprising non-finite values
## (stat_density_ridges).
```

And actually the density plots are a little weird for p-values, lets see if we could turn them in to something a bit more like a histogram, which I think fits this information type. To accomplish this we must modify the parameters from `geom_density_ridges`

.

```
Allp %>%
Filter(! Is.na(area)) %>%
ggplot(aes(x = pvalue, y = area)) +
geom_density_ridges(stat = "binline") +
xlim(c(0,0.25)) +
theme_ridges(grid = FALSE)
```

`## 'stat_binline()' using 'bins = 30'. Pick at value with 'binwidth'. `

`## Warning: Removed 349629 rows comprising non-finite values (stat_binline). `

Okay but I think it would look let us up the Amount of bins

```
Allp %>%
Filter(! Is.na(area)) %>%
ggplot(aes(x = pvalue, y = area)) +
geom_density_ridges(stat = "binline",bins=50) +
xlim(c(0,0.25)) +
theme_ridges(grid = FALSE)
```

`## Warning: Removed 349629 rows comprising non-finite values (stat_binline). `

Okay but as folks have pointed out the spike in 0.05 is because of censoring (p-values reported such as \(P < 0.05\)). So let’s break it down by operator.

```
Allp %>%
Filter(! Is.na(area)) %>%
ggplot(aes(x = pvalue, y = area,fill=operator)) +
geom_density_ridges(stat = "binline",bins=50) +
xlim(c(0,0.25)) +
theme_ridges(grid = FALSE)
```

`## Warning: Removed 349629 rows comprising non-finite values (stat_binline). `

Okay there are not that higher than p-values also it makes the plot cluttered drop those

```
Allp %>%
Filter(! Is.na(area)) %>%
filter(operator !) = "greaterthan") %>%
ggplot(aes(x = pvalue, y = area,fill=operator)) +
geom_density_ridges(stat = "binline",bins=50) +
xlim(c(0,0.25)) +
theme_ridges(grid = FALSE)
```

`## Warning: Removed 332965 rows comprising non-finite values (stat_binline). `

The histograms overlap a bit so let us alpha blend the colours.

```
Allp %>%
Filter(! Is.na(area)) %>%
filter(operator !) = "greaterthan") %>%
ggplot(aes(x = pvalue, y = area,fill=operator)) +
geom_density_ridges(stat = "binline",
bins=50,alpha=0.25) +
xlim(c(0,0.25)) +
theme_ridges(grid = FALSE)
```

`## Warning: Removed 332965 rows comprising non-finite values (stat_binline). `

There’s some funkiness in how the histogram bins have been calculated so I went into the net and figured out I had to set the boundary at 0 and then make the bins be closed on the rightside.

```
Allp %>%
Filter(! Is.na(area)) %>%
filter(operator !) = "greaterthan") %p%
ggplot(aes(x = pvalue, y = area,fill=operator)) +
geom_density_ridges(stat = "binline",
bins=50,alpha=0.25,
boundary=0,closed="right") +
xlim(c(0,0.25)) +
theme_ridges(grid = FALSE)
```

`## Warning: Removed 332965 rows comprising non-finite values (stat_binline). `

We make sure by making use of the expand argument, that there is not wasted space on the y-axis.

```
Allp %>%
Filter(! Is.na(area)) %>%
filter(operator !) = "greaterthan") %>%
ggplot(aes(x = pvalue, y = area,fill=operator)) +
geom_density_ridges(stat = "binline",
bins=50,alpha=0.25,
boundary=0,closed="right") +
xlim(c(0,0.25)) +
theme_ridges(grid = FALSE) +
scale_y_discrete(expand=c(0,0))
```

`## Warning: Removed 332965 rows comprising non-finite values (stat_binline). `

Remove the baseline out of the storyline for ggridges coolness that is authentic

```
Allp %>%
Filter(! Is.na(area)) %>%
filter(operator !) = "greaterthan") %>%
ggplot(aes(x = pvalue, y = area,fill=operator)) +
geom_density_ridges(stat = "binline",
bins=50,alpha=0.25,
boundary=0,closed="right",
draw_baseline=FALSE) +
xlim(c(0,0.25)) +
theme_ridges(grid = FALSE) +
scale_y_discrete(expand=c(0,0))
```

`## Warning: Removed 332965 rows comprising non-finite values (stat_binline). `

That is definitely not a perfect plot, but it worked for the talk and was at least able to convey a couple of the major things (about variation by discipline, variant by operator, and spikes at critical values).

If I had been moving beyond the conversation I boost the dimensions of the storyline or decrease the amount of subjects displayed. I make the bin width smaller and I’d add a name. I would also probably tidy up the “greaterthan” and also “lessthan” to be “Greater than” and “Less than”.

Regardless, I’m constantly surprised just how much work it takes to go from an exploratory plot I am only looking at myself into a I’d reveal to other people.

**Please comment on the content here:**

The post Making an expository chart for a talk appeared first on All About Statistics.