across and summary tables
By Jen Richmond in dplyr across tables
February 26, 2021
Thanks to
Tom Mock for saving me from this blogpost on twitter. I was monster number 6… beginning to convince myself that there were gremlins in the across()
function.
In fact, I had forgotten the golden rule:
don’t make objects that have the same names a functions
That is for another post…
the across() function
I have been playing with the new across()
function in dplyr
which is a magical solution when you are looking to get several summary statistics across a range of variables. BUT…. working out how to get those summary stats nicely formatted in a table is a bit of a challenge.
load packages
library(tidyverse)
library(palmerpenguins)
library(skimr)
library(tableone)
library(gt)
get data
Here I am using data from the palmer penguins package.
penguins <- penguins
the goal
df <- tibble(penguin_bits = c("bill", "flipper"),
mean = c(43.92, 200.92),
sd = c(5.46, 14.06),
min = c(32.1, 172),
max = c(59.6, 231))
df %>%
gt()
penguin_bits | mean | sd | min | max |
---|---|---|---|---|
bill | 43.92 | 5.46 | 32.1 | 59.6 |
flipper | 200.92 | 14.06 | 172.0 | 231.0 |
I want to get summary stats for bill and flipper length (mean, sd, min, and max) and have them display in a nice table like this. But this manual dataframe construction is a bit silly and not at all reproducable.
Is there an easy way to get summary stats in this format so they are compatible with the gt package?
create summary stats the old way
penguins %>%
summarise(bill_mean = mean(bill_length_mm, na.rm = TRUE),
bill_sd = sd(bill_length_mm, na.rm = TRUE),
bill_min = min(bill_length_mm, na.rm = TRUE),
bill_max = max(bill_length_mm, na.rm = TRUE),
flipper_mean = mean(flipper_length_mm, na.rm = TRUE),
flipper_sd = sd(flipper_length_mm, na.rm = TRUE),
flipper_min = min(flipper_length_mm, na.rm = TRUE),
flipper_max = max(flipper_length_mm, na.rm = TRUE)) %>%
gt()
bill_mean | bill_sd | bill_min | bill_max | flipper_mean | flipper_sd | flipper_min | flipper_max |
---|---|---|---|---|---|---|---|
43.92193 | 5.459584 | 32.1 | 59.6 | 200.9152 | 14.06171 | 172 | 231 |
The summarise function spits out summary stats in a SUPER wide format.
create summary stats using across()
The new across()
function gets you the same thing in many fewer lines of code, but still, it is VERY wide.
penguins %>%
summarise(across(contains("length"),
list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
gt()
bill_length_mm_mean | bill_length_mm_sd | bill_length_mm_min | bill_length_mm_max | flipper_length_mm_mean | flipper_length_mm_sd | flipper_length_mm_min | flipper_length_mm_max |
---|---|---|---|---|---|---|---|
43.92193 | 5.459584 | 32.1 | 59.6 | 200.9152 | 14.06171 | 172 | 231 |
skimr
The skimr package is nice because it is tidyverse compatible (i.e. you can select with contains) AND the output can be a dataframe, which you can then edit.
skimtable <- penguins %>%
select(contains("length")) %>%
skim()
skimtable
Name | Piped data |
Number of rows | 344 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Table 1: Data summary
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
For my purpose it is more than I need though. There is some selecting and renaming to do after the fact.
skimtable_renamed <- skimtable %>%
select(skim_variable, numeric.mean, numeric.sd, numeric.p0, numeric.p100) %>%
rename(penguin_bits = skim_variable, mean = numeric.mean, sd = numeric.sd, min = numeric.p0, max = numeric.p100)
You do end up with a gt compatible dataframe.
skimtable_renamed %>%
gt()
penguin_bits | mean | sd | min | max |
---|---|---|---|---|
bill_length_mm | 43.92193 | 5.459584 | 32.1 | 59.6 |
flipper_length_mm | 200.91520 | 14.061714 | 172.0 | 231.0 |
tableone package
The TableOne package gives you mean and SD, is there a way to add other summary stats (like min & max to TableOne?) And make the formatted nicer?
variables <- c("bill_length_mm", "flipper_length_mm")
CreateTableOne(vars = variables, data = penguins)
##
## Overall
## n 344
## bill_length_mm (mean (SD)) 43.92 (5.46)
## flipper_length_mm (mean (SD)) 200.92 (14.06)
summarise + across + pivot ?
Ideally, I want to be able to use summarise()
and across()
and somehow make the wide output long. The problem is that pivot_longer()
will take more than 1 “names_to” argument, but not more than a single “values_to” argument.
I would like to be able to pivot wide summary stats long like this…
penguins %>%
summarise(across(contains("length"),
list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
pivot_longer(names_to = "penguin_bits", values_to = c("mean", "sd", "max", "min"), 1:8, values_sep = "_")
… but not sure if that is possible.
- Posted on:
- February 26, 2021
- Length:
- 18 minute read, 3720 words
- See Also:
- joining dataframes
- recoding variables