my favourite things about R

By Jen Richmond in dplyr ggplot janitor datawrangling

January 17, 2022

I am prepping a talk for R-Ladies Sydney about my favourite R things, the packages and functions that end up in every script I write.

library(tidyverse)
library(here)
library(janitor)
library(lubridate)
library(ggeasy)
library(palmerpenguins)
library(naniar)
library(gt)

Avoid filepath drama

here::here()

The here package makes dealing with file paths and telling R where your work lives really easy. If you work within a R project (always recommended), here::here() defaults to the top level of your project folder. You can refer to everything relative to there and use quotes to specify folder levels. Today I am writing in my blogdown project, so here tells me that I am currently…

here::here()
## [1] "/Users/jennyrichmond/Documents/GitHub/apero-site"

When I want to read in some data, I can refer to the location of that data relative to this starting point. In my blogdown site, I’ve put the data in the folder for this particular post (within content/blog/). The nice thing about referring to the location of things relative to the top level of your project, is that it doesn’t matter if you are working in Rmd or R script, on the computer that you wrote the code on or another one, the path doesn’t change.

practice_penguins <- read_csv(here("content", "blog", "2022-01-17-my-favourite-things-about-r", "practice_penguins.csv"))

Fix variable names

janitor::clean_names()

I messed up the penguin data to make the variable names a bit ugly so I could demo my favourite function, clean_names(). So often little thought is put into naming conventions at the time of data entry and it is really common to be given a dataset that has really longwinded and inconsistently formatted variable names.

names(practice_penguins)
## [1] "Species"        "island"         "Bill length"    "bill_depth"    
## [5] "flipper length" "Body_Mass"      "Sex"            "year"

In this case there is a mix of upper and lower case, some gaps between words, some underscores. When you are coding, you need to type the names of variables a lot, so it can save you lots of time to make the variable names consistent… enter clean_names()

clean_penguins <- practice_penguins %>%
  clean_names()

names(clean_penguins)
## [1] "species"        "island"         "bill_length"    "bill_depth"    
## [5] "flipper_length" "body_mass"      "sex"            "year"

In one line of code, everything is lower case with underscores in the gaps (aka snake case).

Count things

janitor::tabyl()

Often the first thing you want to do in R is count how many observations you have of different type. The tabyl() function from janitor works much like the count() function, but the output is more concise and user friendly and includes percentages automatically.

clean_penguins %>%
  tabyl(species) 
##    species   n   percent
##     Adelie 152 0.4418605
##  Chinstrap  68 0.1976744
##     Gentoo 124 0.3604651

You can count just one variable, or get something a bit like a cross tab with two. There are a series of adorn_ functions that also allow you to add totals.

clean_penguins %>%
  tabyl(species, sex) %>%
  adorn_totals() 
##    species female male NA_
##     Adelie     73   73   6
##  Chinstrap     34   34   0
##     Gentoo     58   61   5
##      Total    165  168  11

You can assign the output to a dataframe or pipe into gt() to get a nice looking rendered output.

clean_penguins %>%
  tabyl(species, sex) %>%
  adorn_totals() %>%
  gt()

species female male NA_
Adelie 73 73 6
Chinstrap 34 34 0
Gentoo 58 61 5
Total 165 168 11

Find missing values

naniar::vis_miss

Sometimes you know there is missing data but it can be difficult to know where it is or what to do about it. The vis_miss() function from the naniar` package helps you see where the missing values are so you can better decide what to do with them.

naniar::vis_miss(clean_penguins)

Make wide data long

tidyr::pivot_longer()

When we enter data it is usually in wide format. This is problematic when you want to use ggplot, which expects your data to be long. The new pivot functions from tidyr make it really easy to switch your data from wide to long (and back again if you need). Here I am selecting just species and the two variables that start with “bill” to make a smaller demo dataset.

penguin_bill <- clean_penguins %>%
  select(species, starts_with("bill"))

glimpse(penguin_bill)
## Rows: 344
## Columns: 3
## $ species     <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie"…
## $ bill_length <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
## $ bill_depth  <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …

Technically this bill data is in wide format (it is not the best example but lets run with it). The two columns contain bill measurements, about two different parts of the penguin bill. We could represent this data in long format by making a new column that contained info about which part of the bill we were measuring, and another column with the measurement value.

The pivot_longer() function asks you to specify what you want to call the column that will contain what is currently in the variable names (i.e. names_to), what you want to call the column that will contain the values (i.e. values to) and the range of columns that are currently wide that you want to be long.

long_bill <- penguin_bill %>%
  pivot_longer(names_to = "bill_part", 
               values_to = "measurement",   bill_length:bill_depth)

head(long_bill)
## # A tibble: 6 × 3
##   species bill_part   measurement
##   <chr>   <chr>             <dbl>
## 1 Adelie  bill_length        39.1
## 2 Adelie  bill_depth         18.7
## 3 Adelie  bill_length        39.5
## 4 Adelie  bill_depth         17.4
## 5 Adelie  bill_length        40.3
## 6 Adelie  bill_depth         18

Make ggplot easy

ggeasy::easy_remove_legend()

Once you have your head around how to construct figures in ggplot, you can spend a lot of time googling how to customise it. The ggeasy package contains a whole lot of easy to use wrappers for really common ggplot adjustments. Like removing the legend…the code to remove the legend is p + theme(legend.position = “none”) … or you can use ggeasy::easy_remove_legend()

https://www.datanovia.com/en/blog/ggplot-legend-title-position-and-labels/

long_bill %>% 
  ggplot(aes(x = species, y = measurement, colour = species)) +
  geom_jitter(width = 0.2, alpha = 0.5) +
  facet_wrap(~ bill_part) +
  easy_remove_legend()
## Warning: Removed 4 rows containing missing values (geom_point).

Make new conditional variables

dplyr::case_when()

Sometimes you need to compute a new variable based on values in other variables, case_when() is your friend. Lets say we were interested in which penguins have extremely long or short bills. Here I am filtering for just the Gentoo penguins, and calculating the mean and sd for bill length. Then I am using mutate() to make a new variable and case_when() to flag values of bill length than are more than 2sd greater than the mean as “long” and values of bill length that are more than 2sd below the mean as short. The TRUE ~ “ordinary,” puts ordinary in the cells that don’t meet those criteria.

Then we can use tabyl() to count how many penguins have extraordinarily long or short bills.

gentoo <- clean_penguins %>%
  filter(species == "Gentoo") %>%
  select(species, bill_length, sex)

mean_length <- mean(gentoo$bill_length, na.rm = TRUE)
sd_length <- sd(gentoo$bill_length, na.rm = TRUE)

gentoo <- gentoo %>%
  mutate(long_short = case_when(bill_length > mean_length + 2*sd_length ~ "long", 
                         bill_length < mean_length - 2*sd_length ~ "short",
                         TRUE ~ "ordinary"))

gentoo %>% tabyl(long_short)
##  long_short   n     percent
##        long   4 0.032258065
##    ordinary 119 0.959677419
##       short   1 0.008064516

Move new variables

dplyr::relocate()

When using mutate() to make a new variable, the default is to add it to the right side of the dataframe. With small datasets that is ok, but when you have lots of variables and you want to check whether the mutate has done what you want, it can be annoying. There is a relatively new function in dplyr that allows you to relocate a variable. Here I am moving the long_short variable we just made to the position after bill_length.

gentoo <- gentoo %>%
  relocate(long_short, .after = bill_length)

glimpse(gentoo)
## Rows: 124
## Columns: 4
## $ species     <chr> "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo"…
## $ bill_length <dbl> 46.1, 50.0, 48.7, 50.0, 47.6, 46.5, 45.4, 46.7, 43.3, 46.8…
## $ long_short  <chr> "ordinary", "ordinary", "ordinary", "ordinary", "ordinary"…
## $ sex         <chr> "female", "male", "female", "male", "male", "female", "fem…

Lets make a plot to illustrate the variability in Gentoo penguins bill length.

gentoo %>%
  ggplot(aes(x = sex, y = bill_length, colour = long_short)) + 
  geom_jitter(width = 0.2)
## Warning: Removed 1 rows containing missing values (geom_point).

Save all Rmd figures to folder

knitr options fig.path =

You might have noticed that in the default Rmd template there is a chunk at the top that controls how your document knits. The default knit settings have echo = TRUE which makes your code appear in your knitted document along with your output. But you can add other knit settings.

You can add fig.width, fig.height, and fig.path to control how big your plots appear in your knitted document. You can also add fig.path to have your plots be rendered in png format to a folder within your project. And if you want all the ggplots in your document to be the same theme, you can add that as a default.

Write new data to csv

readr::write_csv()

My data analysis process often involves reading in raw data, cleaning it up, and then writing it out to csv so that you can read the clean data in to another process (visualisation, modelling). I use write_csv() and here::here() to write out a csv that can then be used in a different script.

gentoo %>%
  write_csv("clean_gentoo.csv")
Posted on:
January 17, 2022
Length:
11 minute read, 2173 words
Categories:
dplyr ggplot janitor datawrangling
See Also: