across and summary tables

By Jen Richmond in dplyr across tables

February 26, 2021

Thanks to Tom Mock for saving me from this blogpost on twitter. I was monster number 6… beginning to convince myself that there were gremlins in the across() function.

In fact, I had forgotten the golden rule:

don’t make objects that have the same names a functions

That is for another post…

the across() function

I have been playing with the new across() function in dplyr which is a magical solution when you are looking to get several summary statistics across a range of variables. BUT…. working out how to get those summary stats nicely formatted in a table is a bit of a challenge.

load packages

library(tidyverse)
library(palmerpenguins)
library(skimr)
library(tableone)
library(gt)

get data

Here I am using data from the palmer penguins package.

penguins <- penguins

the goal

df <- tibble(penguin_bits = c("bill", "flipper"),
             mean = c(43.92, 200.92), 
             sd = c(5.46, 14.06), 
             min = c(32.1, 172), 
             max = c(59.6, 231))

df %>%
  gt()

penguin_bits mean sd min max
bill 43.92 5.46 32.1 59.6
flipper 200.92 14.06 172.0 231.0


I want to get summary stats for bill and flipper length (mean, sd, min, and max) and have them display in a nice table like this. But this manual dataframe construction is a bit silly and not at all reproducable.

Is there an easy way to get summary stats in this format so they are compatible with the gt package?

create summary stats the old way

penguins %>%
  summarise(bill_mean = mean(bill_length_mm, na.rm = TRUE),
            bill_sd = sd(bill_length_mm, na.rm = TRUE),
             bill_min = min(bill_length_mm, na.rm = TRUE),
             bill_max = max(bill_length_mm, na.rm = TRUE), 
             flipper_mean = mean(flipper_length_mm, na.rm = TRUE),
            flipper_sd = sd(flipper_length_mm, na.rm = TRUE),
             flipper_min = min(flipper_length_mm, na.rm = TRUE),
             flipper_max = max(flipper_length_mm, na.rm = TRUE)) %>%
  gt()

bill_mean bill_sd bill_min bill_max flipper_mean flipper_sd flipper_min flipper_max
43.92193 5.459584 32.1 59.6 200.9152 14.06171 172 231


The summarise function spits out summary stats in a SUPER wide format.

create summary stats using across()

The new across() function gets you the same thing in many fewer lines of code, but still, it is VERY wide.

penguins %>%
  summarise(across(contains("length"),
        list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
  gt()

bill_length_mm_mean bill_length_mm_sd bill_length_mm_min bill_length_mm_max flipper_length_mm_mean flipper_length_mm_sd flipper_length_mm_min flipper_length_mm_max
43.92193 5.459584 32.1 59.6 200.9152 14.06171 172 231


skimr

The skimr package is nice because it is tidyverse compatible (i.e. you can select with contains) AND the output can be a dataframe, which you can then edit.

skimtable <- penguins %>%
  select(contains("length")) %>%
  skim()

skimtable
Name Piped data
Number of rows 344
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Table 1: Data summary

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂


For my purpose it is more than I need though. There is some selecting and renaming to do after the fact.

skimtable_renamed <- skimtable %>%
  select(skim_variable, numeric.mean, numeric.sd, numeric.p0, numeric.p100) %>%
  rename(penguin_bits = skim_variable, mean = numeric.mean, sd = numeric.sd, min = numeric.p0, max = numeric.p100)

You do end up with a gt compatible dataframe.

skimtable_renamed %>%
  gt()

penguin_bits mean sd min max
bill_length_mm 43.92193 5.459584 32.1 59.6
flipper_length_mm 200.91520 14.061714 172.0 231.0


tableone package

The TableOne package gives you mean and SD, is there a way to add other summary stats (like min & max to TableOne?) And make the formatted nicer?

variables <- c("bill_length_mm", "flipper_length_mm")

CreateTableOne(vars = variables, data = penguins)
##                                
##                                 Overall       
##   n                                344        
##   bill_length_mm (mean (SD))     43.92 (5.46) 
##   flipper_length_mm (mean (SD)) 200.92 (14.06)


summarise + across + pivot ?

Ideally, I want to be able to use summarise() and across() and somehow make the wide output long. The problem is that pivot_longer() will take more than 1 “names_to” argument, but not more than a single “values_to” argument.

I would like to be able to pivot wide summary stats long like this…

penguins %>%
  summarise(across(contains("length"),
        list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
      pivot_longer(names_to = "penguin_bits", values_to = c("mean", "sd", "max", "min"), 1:8, values_sep = "_")

… but not sure if that is possible.

Posted on:
February 26, 2021
Length:
18 minute read, 3720 words
Categories:
dplyr across tables
Tags:
dplyr across tables
See Also:
joining dataframes
recoding variables