more wrangling tips

in datawrangling dplyr tidyr

September 11, 2018

It is definitely true that it takes much longer to get your data ready for analysis than it does to actually analyse it. Apparently up to 80% of the data analysis time is spent wrangling data (and cursing and swearing).

Here is another great wrangling resource, this time by Bradley Boehmke.

And if you need a rationale for why it is a good idea to acquire some wrangling skills, a quote by Jenny Bryan

“Classroom data are like teddy bears and real data are like a grizzly bear with salmon blood dripping out its mouth”

A few things I didn’t already know about tidyr and dplyr
  1. In addition to gather() and spread(), the tidyr package can also be used to separate() i.e. pull parts of a single variable apart into separate columns and unite() i.e. combine several columns into one.

  2. When using filter() from dplyr, specify group membership using %in%. Also distinct() will remove duplicate rows and slice(3:5) will subset by particular rows.

  3. When using dplyr summarise(), sometimes you want to count the number of participants but n() will give you the number of observations. There is an n_distinct() function that might be useful in counting the number of participants.

Posted on:
September 11, 2018
Length:
2 minute read, 248 words
Categories:
datawrangling dplyr tidyr
See Also: