Establishing an graph for a talk

I am co-teaching a data science course at Johns Hopkins using John Muschelli. I gave the lectures on EDA and that he simply gave a lecture on how best to make an “expository graph”. When we instruct the course a exploratory graph is the kind of chart you create for yourself only to try and understand a data collection. An expository chart is one where you are attempting to communicate data to someone else.

When you are creating an abysmal graph it’s generally easy, with no axes, legends, elaborate colours, or other effort to ensure it is fairly, understandable and transparent. John has a great blog post on how best to build up a figure that’s expository.

Lately I gave a discussion at McGill University and needed to make a plot for the talk. I figured one more example is always better for everything, so that I thought I would go through my process.

I wanted to show the p-value distribution in the tidy-pvals package. So I loaded with the data:

library(tidypvals)
Library(ggridges)
## Loading required package: ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:bottom':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
Library(forcats)
data(allp)

I knew I wanted to use the package so I read the docs and began using the easiest version:

Allp % >% 
Ggplot(aes(x = pvalue( y = field)) +
    geom_density_ridges()
## Deciding joint bandwidth 0.00413

Right off I saw there were some problems here. To start with, obviously a p-value greater than one should not be in there, so that was a error. I also don’t like you can’t really see the values because the majority of the action is near zero.

So let’s correct the x-axis a little. I spent a few minutes fiddling and decided that I only wanted to see the values between 0 and 0.25.

Allp % >% 
Ggplot(aes(x = pvalue( y = field)) +
    geom_density_ridges() + 
  xlim(c)(0,0.25))
## Selecting joint bandwidth 0.00401
## Warning: Removed 359521 rows comprising non-finite values
## (stat_density_ridges).  

Ok that’s better, however I do not really like the grey background so let’s select a different background color

Allp % >% 
Ggplot(aes(x = pvalue( y = field)) +
    geom_density_ridges() + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)
## Selecting joint bandwidth 0.00401
## Warning: Removed 359521 rows comprising non-finite values
## (stat_density_ridges).  

That is a little prettier, but we see that field is occasionally NA so we will need to remove those values.

Allp % >% 
Filter(! Is.na(field)) %>%
  ggplot(aes(x = pvalue( y = field)) +
    geom_density_ridges() + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)
## Selecting joint bandwidth 0.00404
## Warning: Removed 349629 rows comprising non-finite values
## (stat_density_ridges).  

And actually the density plots are a little weird for p-values, lets see if we can turn them in to something a bit more like a histogram, which I think fits this data type much better. To do that we have to alter the parameters at geom_density_ridges.

Allp % >% 
Filter(! Is.na(field))  g>%
  ggplot(aes(x = pvalue( y = field)) +
    geom_density_ridges(stat = "binline") + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)
## 'stat_binline()' using 'bins = 30'.  Pick at better value using 'binwidth'.  
## Warning: Removed 349629 rows comprising non-finite values (stat_binline).  

Ok but I think it might seem when it had been a little bit higher resolution, let’s up the Amount of bins

Allp % >% 
Filter(! Is.na(field))  g>%
  ggplot(aes(x = pvalue( y = field)) +
    geom_density_ridges(stat = "binline",bins=50) + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)
## Warning: Removed 349629 rows comprising non-finite values (stat_binline).  

Ok but as people have pointed out that the spike at 0.05 is because of censoring (p-values reported such as \(P < 0.05\)). So let’s break it down from operator.

Allp % >% 
Filter(! Is.na(field))  g>%
  ggplot(aes(x = pvalue( y = field(fill=operator)) +
    geom_density_ridges(stat = "binline",bins=50) + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)
## Warning: Removed 349629 rows comprising non-finite values (stat_binline).  

Ok there are not that many larger than p-values also it makes the plot messy so let’s drop those

Allp % >% 
Filter(! Is.na(field)) % >%
  blocker(operator !) = "greaterthan") %>%
  ggplot(aes(x = pvalue( y = field(fill=operator)) +
    geom_density_ridges(stat = "binline",bins=50) + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)
## Warning: Removed 332965 rows comprising non-finite values (stat_binline).  

The histograms overlap a little so let’s alpha mix the colours.

Allp % >% 
Filter(! Is.na(field)) % >%
  blocker(operator !) = "greaterthan") %>%
  ggplot(aes(x = pvalue( y = field(fill=operator)) +
    geom_density_ridges(stat = "binline",
                        bins=50(alpha=0.25) + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)
## Warning: Removed 332965 rows comprising non-finite values (stat_binline).  

There is some funkiness in the way in which the histogram bins have been computed so that I went to the internet and figured out I needed to set the boundary at 0 and also make the bins be shut on the best.

Allp % >% 
Filter(! Is.na(field)) % >%
  blocker(operator !) = "greaterthan") %>%
  ggplot(aes(x = pvalue( y = field(fill=operator)) +
    geom_density_ridges(stat = "binline",
                        bins=50,alpha=0.25,
                        boundary=0,shut="right") + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)
## Warning: Removed 332965 rows comprising non-finite values (stat_binline).  

Now we make certain that there isn’t wasted space on the y-axis by making use of the expand argument.

Allp % >% 
Filter(! Is.na(field)) % >%
  blocker(operator !) = "greaterthan") %>%
  ggplot(aes(x = pvalue( y = field(fill=operator)) +
    geom_density_ridges(stat = "binline",
                        bins=50,alpha=0.25,
                        boundary=0,shut="right") + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE) + 
  scale_y_discrete(expand=c(0,0))
## Warning: Removed 332965 rows comprising non-finite values (stat_binline).  

Eliminate the baseline out of the plot for true ggridges coolness

Allp % >% 
Filter(! Is.na(field)) % >%
  blocker(operator !) = "greaterthan") %>%
  ggplot(aes(x = pvalue( y = field(fill=operator)) +
    geom_density_ridges(stat = "binline",
                        bins=50,alpha=0.25,
                        boundary=0,shut="right",
                        draw_baseline=FALSE) + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE) + 
  scale_y_discrete(expand=c(0,0))
## Warning: Removed 332965 rows comprising non-finite values (stat_binline).  

That is definitely not an ideal plot, however, it worked for my discussion and was able to convey a couple of the key things (about variation by discipline, variant by operator, and also spikes at significant values).

If I had been moving past the conversation I would likely lessen the number of fields displayed or actually increase the dimensions of the plot. I would probably create the bin width even smaller and I would add a title. I would also likely clean up the “greaterthan” and also “lessthan” to become “Greater than” and “Less than”.

Regardless, I am always surprised just how much work it requires to go in the exploratory plot I am only looking at myself to you I would show to other people.