analysing smartwatch data

By Jen Richmond

July 13, 2022

Sometimes trying to replicate what someone is doing in a blogpost you find on twitter is a great way to learn something new. I am half heartedly thinking about trying to learn Python so when I saw this post about analysing smartwatch data on twitter I thought that it looked like interesting data and perhaps if I tried to do what they had done in R, that would be a useful way of starting to translate my R knowledge into python… maybe.

So here we go….

load packages

library(tidyverse)
library(here)
library(naniar)
library(lubridate)
library(skimr)
library(ggeasy)
library(gt)
library(janitor)

read in the data

df <- read_csv(here("content/blog/2022-07-13-analysing-smartwatch-data/dailyActivity_merged.csv"))

look at the first few rows

head(df)
## # A tibble: 6 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

check if there are NAs

A few ways to check NAs, the easiest uses naniar to visualise NAs with vis_miss()

vis_miss(df)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Alternatively you can use dplyr to summarise across the whole dataframe…

# using dplyr
df %>%
  summarise(missing = sum(is.na(.)))
## # A tibble: 1 × 1
##   missing
##     <int>
## 1       0
# or more simply w n_miss() from naniar
n_miss(df)
## [1] 0

… or separately for each variable

# using dplyr 
df %>%
  summarise_all(funs(sum(is.na(.))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

## # A tibble: 1 × 15
##      Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesD…
##   <int>        <int>      <int>         <int>           <int>              <int>
## 1     0            0          0             0               0                  0
## # … with 9 more variables: VeryActiveDistance <int>,
## #   ModeratelyActiveDistance <int>, LightActiveDistance <int>,
## #   SedentaryActiveDistance <int>, VeryActiveMinutes <int>,
## #   FairlyActiveMinutes <int>, LightlyActiveMinutes <int>,
## #   SedentaryMinutes <int>, Calories <int>
# or more simply w miss_var_summary() from naniar

miss_var_summary(df)
## # A tibble: 15 × 3
##    variable                 n_miss pct_miss
##    <chr>                     <int>    <dbl>
##  1 Id                            0        0
##  2 ActivityDate                  0        0
##  3 TotalSteps                    0        0
##  4 TotalDistance                 0        0
##  5 TrackerDistance               0        0
##  6 LoggedActivitiesDistance      0        0
##  7 VeryActiveDistance            0        0
##  8 ModeratelyActiveDistance      0        0
##  9 LightActiveDistance           0        0
## 10 SedentaryActiveDistance       0        0
## 11 VeryActiveMinutes             0        0
## 12 FairlyActiveMinutes           0        0
## 13 LightlyActiveMinutes          0        0
## 14 SedentaryMinutes              0        0
## 15 Calories                      0        0

Take home message: there are no missing values in this dataset.

look at data types

glimpse(df)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

The ActivityDate variable is characters so we need to convert that to date format

df <- df %>%
  mutate(ActivityDate = mdy(ActivityDate))

class(df$ActivityDate)
## [1] "Date"

make a new total minutes column

Lets mutate a new column that sums the activity minutes. We need to use rowwise here to let R know that we want to sum those values in each row.

df <- df %>%
  rowwise() %>%
  mutate(TotalMinutes = VeryActiveMinutes +FairlyActiveMinutes + LightlyActiveMinutes + SedentaryMinutes) %>%
  ungroup()  # remember to ungroup to make sure the next operation is not rowwise

descriptives

options(scipen = 99) # avoid scientific notation

descriptives <- df %>%
  select(TotalSteps:TotalMinutes) %>%
  skim()

gt(descriptives)

skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric TotalSteps 0 1 7637.910638298 5087.150741753 0 3789.750 7405.500 10727.0000 36019.000000 ▇▇▁▁▁
numeric TotalDistance 0 1 5.489702122 3.924605909 0 2.620 5.245 7.7125 28.030001 ▇▆▁▁▁
numeric TrackerDistance 0 1 5.475351058 3.907275943 0 2.620 5.245 7.7100 28.030001 ▇▆▁▁▁
numeric LoggedActivitiesDistance 0 1 0.108170940 0.619896518 0 0.000 0.000 0.0000 4.942142 ▇▁▁▁▁
numeric VeryActiveDistance 0 1 1.502680851 2.658941165 0 0.000 0.210 2.0525 21.920000 ▇▁▁▁▁
numeric ModeratelyActiveDistance 0 1 0.567542551 0.883580319 0 0.000 0.240 0.8000 6.480000 ▇▁▁▁▁
numeric LightActiveDistance 0 1 3.340819149 2.040655388 0 1.945 3.365 4.7825 10.710000 ▆▇▆▁▁
numeric SedentaryActiveDistance 0 1 0.001606383 0.007346176 0 0.000 0.000 0.0000 0.110000 ▇▁▁▁▁
numeric VeryActiveMinutes 0 1 21.164893617 32.844803057 0 0.000 4.000 32.0000 210.000000 ▇▁▁▁▁
numeric FairlyActiveMinutes 0 1 13.564893617 19.987403954 0 0.000 6.000 19.0000 143.000000 ▇▁▁▁▁
numeric LightlyActiveMinutes 0 1 192.812765957 109.174699751 0 127.000 199.000 264.0000 518.000000 ▅▇▇▃▁
numeric SedentaryMinutes 0 1 991.210638298 301.267436790 0 729.750 1057.500 1229.5000 1440.000000 ▁▁▇▅▇
numeric Calories 0 1 2303.609574468 718.166862134 0 1828.500 2134.000 2793.2500 4900.000000 ▁▆▇▃▁
numeric TotalMinutes 0 1 1218.753191489 265.931767055 2 989.750 1440.000 1440.0000 1440.000000 ▁▁▁▅▇

plot TotalSteps and calories burned

the goal

In the python plot they use a “ols” trendline but I don’t really know what that is so using “lm” instead. The graph in the post has the size of the points plotting very active minutes but there isn’t a legend on the plot, so I am using a function from ggeasy to remove the legend. Also worked out how to make the y axis be labelled 0 - 40k, rather than 0-40000 using the labels argument in scale_y_continuous.

df %>%
  ggplot(aes(x = Calories, y = TotalSteps, size = VeryActiveMinutes)) +
  geom_point(colour = "blue", alpha = 0.5) + 
  geom_smooth(method = "lm", se = FALSE) +
  easy_remove_legend() +
  scale_y_continuous(limits = c(0,40000), labels = c("0", "10k", "20k", "30k", "40k"))

pie chart

the goal

The next graph in the blog post is a pie chart plotting the total active minutes in the 4 categories (inactive, lightly active, very active and fairly active). First I need to replicate these values. Luckily they are in the descriptives, so I am just going to select and filter everything else out of that dataframe.

tam <- descriptives %>%
  select(skim_variable, numeric.mean) %>%
  filter(skim_variable %in% c( "SedentaryMinutes", "LightlyActiveMinutes" , "FairlyActiveMinutes", "VeryActiveMinutes")) 

gt(tam)

skim_variable numeric.mean
VeryActiveMinutes 21.16489
FairlyActiveMinutes 13.56489
LightlyActiveMinutes 192.81277
SedentaryMinutes 991.21064

OK first thing to “fix” are the labels on these categories. Inactive seems like a better label than Sedentary. Make the skim variable a factor first. Then use levels() to check that there are now levels. Then use fct_recode() to change the labels on the factor levels manually.

glimpse(tam)
## Rows: 4
## Columns: 2
## $ skim_variable <chr> "VeryActiveMinutes", "FairlyActiveMinutes", "LightlyActi…
## $ numeric.mean  <dbl> 21.16489, 13.56489, 192.81277, 991.21064
tam <- tam %>%
  mutate(skim_variable = as_factor(skim_variable))

levels(tam$skim_variable)
## [1] "VeryActiveMinutes"    "FairlyActiveMinutes"  "LightlyActiveMinutes"
## [4] "SedentaryMinutes"
tam <- tam %>%
  mutate(skim_variable = fct_recode(skim_variable, 
                                    "Very Active Minutes" =  "VeryActiveMinutes", 
                                   "Fairly Active Minutes" = "FairlyActiveMinutes", 
                                   "Lightly Active Minutes" = "LightlyActiveMinutes", 
                                    "Inactive Minutes" = "SedentaryMinutes"))

levels(tam$skim_variable)
## [1] "Very Active Minutes"    "Fairly Active Minutes"  "Lightly Active Minutes"
## [4] "Inactive Minutes"

There isn’t a geom_pie() in ggplot, probably because pie charts are the worst visualisation but you can make one by first making a stacked bar chart using geom_bar() and then adding coord_polar().

Good instructions available here https://r-graph-gallery.com/piechart-ggplot2.html

Bar graph version…

tam %>%
  ggplot(aes(x="", y=numeric.mean, fill=skim_variable)) +
  geom_bar(stat="identity") 

… add coord_polar()

tam %>%
  ggplot(aes(x="", y=numeric.mean, fill=skim_variable)) +
  geom_bar(stat="identity", color="white") +
  coord_polar("y", start = 0) 

OK the bones are there but I really don’t want the axis labels or the grey background. Add theme_void() to get rid of those.

tam %>%
  ggplot(aes(x="", y=numeric.mean, fill=skim_variable)) +
  geom_bar(stat="identity", color="white") +
  coord_polar("y", start = 0) +
  theme_void()

Awesome, now in the post they have the legend ordered by the mean (with Inactive at the top). I think you can do that within a mutate, right before your data hits ggplot [see this post] (https://r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html).

 tam %>%
  mutate(skim_variable = fct_reorder(skim_variable, desc(numeric.mean))) %>%
  ggplot(aes(x="", y=numeric.mean, fill=skim_variable)) +
  geom_bar(stat="identity", color="white") +
  coord_polar("y", start = 0) +
  theme_void()

And they have ridiculous number labels… in the spirit of reproducibility, lets do that too!

tam %>%
  mutate(skim_variable = fct_reorder(skim_variable, desc(numeric.mean))) %>%
  ggplot(aes(x="", y=numeric.mean, fill=skim_variable, label = numeric.mean)) +
  geom_bar(stat="identity", color="white") +
  coord_polar("y", start = 0) +
  theme_void() +
  geom_text(angle = 45)

Hmmmm I have overlapping numbers! I would be great to have more control over where the numbers go… I thought maybe ggannotate would help but it doesn’t work with polar coordinates. So I am stuck with position dodge. Adding a mutate to round the numbers also helps…

tam %>%
  mutate(skim_variable = fct_reorder(skim_variable, desc(numeric.mean))) %>%
  mutate(numeric.mean =  round(numeric.mean, 4)) %>%
  ggplot(aes(x="", y=numeric.mean, fill=skim_variable, label = numeric.mean)) +
  geom_bar(stat="identity", color="white") +
  coord_polar("y", start = 0) +
  theme_void() +
  geom_text(angle = 45, position = position_dodge(0.5))

Not terrible, what about colours?? The original post has blue, pink, yellow and green.

  • blue 1 0 245 (#0100F5)
  • pink 246 194 203 (#F6C2CB)
  • yellow 249 217 73 (#F9D949)
  • green 166 236 153 (#A6EC99)

I worked out how to use the Digital Colour Meter from Utilities on my Mac ot get the exact RGB codes for the colours in the graph using this resource.

Then used this RGB-Hex converter. I wonder if this step is necessary?? does ggplot know RGB codes??

Ahhh maybe not… but there is a rgb function, check this out!

rgb(1,0,245, maxColorValue = 255)
## [1] "#0100F5"
rgb(246,194,203, maxColorValue = 255)
## [1] "#F6C2CB"
rgb(249,217,73, maxColorValue = 255)
## [1] "#F9D949"
rgb(166,236,153, maxColorValue = 255)
## [1] "#A6EC99"

Adding in colours using scale_fill_manual() and removing the legend title with ggeasy.

tam %>%
  mutate(skim_variable = fct_reorder(skim_variable, desc(numeric.mean))) %>%
  mutate(numeric.mean =  round(numeric.mean, 4)) %>%
  ggplot(aes(x="", y=numeric.mean, fill=skim_variable, label = numeric.mean)) +
  geom_bar(stat="identity", color="white") +
  scale_fill_manual(values = c("#0100F5","#F6C2CB","#F9D949","#A6EC99")) +
  coord_polar("y", start = 0) +
  theme_void() +
  geom_text(angle = 45, position = position_dodge(0.5)) +
  easy_remove_legend_title()

Under the pie chart there are some summary stats…lets see if we can get those using inline code.

tam_wide <- tam %>%
  pivot_wider(names_from = skim_variable, values_from = numeric.mean) %>%
  clean_names() %>%
  rowwise() %>%
  mutate(total = very_active_minutes + fairly_active_minutes + lightly_active_minutes + inactive_minutes) %>%
  pivot_longer(names_to = "category", values_to = "minutes", very_active_minutes:inactive_minutes) %>%
  relocate(total, .after = minutes) %>%
  mutate(percent = (minutes/total)*100) %>%
  mutate(percent = round(percent, 1)) %>%
   mutate(minutes = round(minutes, 0))

Observations

  1. 81.3 of Total inactive minutes in a day
  2. 15.8 of Lightly active minutes in a day
  3. On an average, only 21 (1.7) were very active
  4. and 1.1 (14) of fairly active minutes in a day

activity by day

the goal

Next up there is a column plot that looks at activity by day of the week. The lubridate package makes it easy to pull the day out of a date. I am going back to the original data frame and making a new one that includes just the id and activity date and the activity in minutes.

glimpse(df)
## Rows: 940
## Columns: 16
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
## $ TotalMinutes             <dbl> 1094, 1033, 1440, 998, 1040, 761, 1440, 1120,…
day <- df %>%
  clean_names() %>%
  select(id:activity_date, very_active_minutes:calories) %>%
  mutate(day = wday(activity_date, label = TRUE)) %>%
  rename(inactive_minutes = sedentary_minutes) %>%
  pivot_longer(names_to = "category", values_to = "minutes", very_active_minutes:inactive_minutes) %>%
  mutate(category = str_sub(category, end = -9)) %>%
  mutate(category = fct_relevel(category, c("very_active", "fairly_active", "lightly_active", "inactive")))

day %>%
  filter(category != "inactive") %>%
  group_by(day, category) %>%
  summarise(activity = sum(minutes)) %>%
  ggplot(aes(x = day, y = activity, fill = category)) +
  geom_col(position = "dodge") +
   scale_fill_manual(values = c("purple", "darkgreen", "pink")) +
   scale_y_continuous(limits = c(0,40000), labels = c("0", "10k", "20k", "30k", "40k")) +
  easy_remove_legend_title()
## `summarise()` has grouped output by 'day'. You can override using the `.groups`
## argument.

Posted on:
July 13, 2022
Length:
19 minute read, 3854 words
See Also: