across and summary tables

February 26, 2021

Thanks to Tom Mock for saving me from this blogpost on twitter. I was monster number 6… beginning to convince myself that there were gremlins in the across() function.

In fact, I had forgotten the golden rule:

don’t make objects that have the same names a functions

That is for another post…

the across() function

I have been playing with the new across() function in dplyr which is a magical solution when you are looking to get several summary statistics across a range of variables. BUT…. working out how to get those summary stats nicely formatted in a table is a bit of a challenge.

load packages

library(tidyverse)
library(palmerpenguins)
library(skimr)
library(tableone)
library(gt)

get data

Here I am using data from the palmer penguins package.

penguins <- penguins

the goal

df <- tibble(penguin_bits = c("bill", "flipper"),
             mean = c(43.92, 200.92), 
             sd = c(5.46, 14.06), 
             min = c(32.1, 172), 
             max = c(59.6, 231))

df %>%
  gt()

penguin_bits	mean	sd	min	max
bill	43.92	5.46	32.1	59.6
flipper	200.92	14.06	172.0	231.0

I want to get summary stats for bill and flipper length (mean, sd, min, and max) and have them display in a nice table like this. But this manual dataframe construction is a bit silly and not at all reproducable.

Is there an easy way to get summary stats in this format so they are compatible with the gt package?

create summary stats the old way

penguins %>%
  summarise(bill_mean = mean(bill_length_mm, na.rm = TRUE),
            bill_sd = sd(bill_length_mm, na.rm = TRUE),
             bill_min = min(bill_length_mm, na.rm = TRUE),
             bill_max = max(bill_length_mm, na.rm = TRUE), 
             flipper_mean = mean(flipper_length_mm, na.rm = TRUE),
            flipper_sd = sd(flipper_length_mm, na.rm = TRUE),
             flipper_min = min(flipper_length_mm, na.rm = TRUE),
             flipper_max = max(flipper_length_mm, na.rm = TRUE)) %>%
  gt()

bill_mean	bill_sd	bill_min	bill_max	flipper_mean	flipper_sd	flipper_min	flipper_max
43.92193	5.459584	32.1	59.6	200.9152	14.06171	172	231

The summarise function spits out summary stats in a SUPER wide format.

create summary stats using `across()`

The new across() function gets you the same thing in many fewer lines of code, but still, it is VERY wide.

penguins %>%
  summarise(across(contains("length"),
        list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
  gt()

bill_length_mm_mean	bill_length_mm_sd	bill_length_mm_min	bill_length_mm_max	flipper_length_mm_mean	flipper_length_mm_sd	flipper_length_mm_min	flipper_length_mm_max
43.92193	5.459584	32.1	59.6	200.9152	14.06171	172	231

skimr

The skimr package is nice because it is tidyverse compatible (i.e. you can select with contains) AND the output can be a dataframe, which you can then edit.

skimtable <- penguins %>%
  select(contains("length")) %>%
  skim()

skimtable


Name	Piped data
Number of rows	344
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Table 1: Data summary

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂

For my purpose it is more than I need though. There is some selecting and renaming to do after the fact.

skimtable_renamed <- skimtable %>%
  select(skim_variable, numeric.mean, numeric.sd, numeric.p0, numeric.p100) %>%
  rename(penguin_bits = skim_variable, mean = numeric.mean, sd = numeric.sd, min = numeric.p0, max = numeric.p100)

You do end up with a gt compatible dataframe.

skimtable_renamed %>%
  gt()

penguin_bits	mean	sd	min	max
bill_length_mm	43.92193	5.459584	32.1	59.6
flipper_length_mm	200.91520	14.061714	172.0	231.0

tableone package

The TableOne package gives you mean and SD, is there a way to add other summary stats (like min & max to TableOne?) And make the formatted nicer?

variables <- c("bill_length_mm", "flipper_length_mm")

CreateTableOne(vars = variables, data = penguins)

##                                
##                                 Overall       
##   n                                344        
##   bill_length_mm (mean (SD))     43.92 (5.46) 
##   flipper_length_mm (mean (SD)) 200.92 (14.06)

summarise + across + pivot ?

Ideally, I want to be able to use summarise() and across() and somehow make the wide output long. The problem is that pivot_longer() will take more than 1 “names_to” argument, but not more than a single “values_to” argument.

I would like to be able to pivot wide summary stats long like this…

penguins %>%
  summarise(across(contains("length"),
        list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
      pivot_longer(names_to = "penguin_bits", values_to = c("mean", "sd", "max", "min"), 1:8, values_sep = "_")

… but not sure if that is possible.

Posted on:: February 26, 2021

Length:: 18 minute read, 3720 words

Categories:: dplyr across tables

Tags:: dplyr across tables

See Also:: joining dataframes; recoding variables