Disecting David Robinson's Day 4 Solution

Tis the season…for coding in the tidyverse

Spoilers follow for the Advent of Code 2020 Day 4 solution…thought you ought to know.

If you haven’t heard of Advent of Code, I highly recommend checking it out. It’s a fun coding challenge held every year since 2015, which provides a problem for every day of Christmas advent. A lovely project put together by Eric Wastl, you can solve them in any way you choose. It actually gets quite competitive, with top solvers completing each day usually in the first ten minutes. I’m certainly not at that level, but I have had fun giving them a go in the past.

This year, I thought I’d try to go through them as it unfolds. I have been doing things in R recently, so thought I’d try to tackle the Day 4 problem in R as well. I was having trouble getting it to work and I ended up switching to Python instead. Afterwards though, I saw David Robinson’s tidyverse solution on Twitter and thought I’d make sure to understand how it works.

David Robinson is a super smart dude and has done a lot of work with R and for it’s community. He’s actually the way I found out about TidyTuesday, checking out his explorations on YouTube.

In an ongoing quest to understand what the heck I’m doing, I thought it would be beneficial for me to take apart his concise answer and understand how it works. Hope you find it beneficial as well.

Let’s go process some passports!

Overview

What follows is code taken from David Robinson’s tweet on his approach to Day 4 of the 2020 Advent of Code, specifically using the tidyverse to solve the problems. When I saw it, I thought I’d take some time to go through it since I had trouble implementing this in R when I gave it a try (I ended up using Python instead). David’s version is super clean and illustrates some good concepts that I will certainly be using in the future. Here we take apart his code and figure out how it works.

Let’s get into it!

Setup

Read in Data

We start reading in the file as a tibble. When I tried this, I was attempting to read it in and coerce to dictionary objects…not the way to go in R (my Python was showing).

input = tibble(x = read_lines("inputs/04-input.txt"))

Wrangling

Here’s how we clean the input to get it ready to answer the questions:

required = c("byr", "iyr", "eyr", "hgt", "hcl", "ecl", "pid") 

fields = input %>% 
  mutate(passport = cumsum(x == "")) %>% 
  mutate(m = str_match_all(x, "(...)\\:([^ ]+)")) %>% 
  mutate(f = map(m, ~ .[, 2]),
         v = map(m, ~ .[, 3])) %>%
  unnest(c(f, v)) %>% 
  filter(f %in% required)

Simply Gorgeous 🤩

But…it is kinda complex so let’s break it down.

Breakdown

  1. First, we need a way to distinguish the passports from each other. cumsum (cumulative sum) can keep a running tally of how many blank lines are seen. Since we know each entry is split with a blank line, this means we can create an id for every passport.
  • This came into my head, but I didn’t know how to accomplish it. cumsum keeping a running tally is super helpful.
wrangle1 = input %>% 
  mutate(passport = cumsum(x == ""))

Now every line is linked to every passport. But we need access to all the fields.

  1. Next, we need to break up each line to get at the values. This part is the genius: str_match_all splits things up into capture groups and then we grab them with map.
wrangle2 = input %>% 
  mutate(passport = cumsum(x == "")) %>% 
  mutate(m = str_match_all(x, "(...)\\:([^ ]+)")) %>% 
  mutate(f = map(m, ~ .[, 2]),
         v = map(m, ~ .[, 3]))
  • What the regex is:
    • Two capture groups, (...) and ([^ ]+), separated by a colon
    • “Three characters” & “Anything that isn’t a space”
    • allows us to grab every key:value pair

As an example, here is one of the lines:

entry = "pid:796082981 cid:129 eyr:2030"
m = str_match_all(entry, "(...)\\:([^ ]+)")
m
## [[1]]
##      [,1]            [,2]  [,3]       
## [1,] "pid:796082981" "pid" "796082981"
## [2,] "cid:129"       "cid" "129"      
## [3,] "eyr:2030"      "eyr" "2030"

Returns a list of character matrices (m for matrix, probably). From ??str_match_all, the first column is the full match, and the other columns are the capture groups, which explains the map formulas for the second and third columns of every matrix (f for field, v for value…again, probably).

f = map(m, ~ .[, 2])
f
## [[1]]
## [1] "pid" "cid" "eyr"
v = map(m, ~ .[, 3])
v
## [[1]]
## [1] "796082981" "129"       "2030"

Unnesting leaves a tibble we can work with: every row is a value, specified by its field and passport.

  1. The final move is to just leave fields we consider “required”. The instructions say we can ignore the cid field in determining valid passports, so we filter it out.

We are left with a super clean dataset. Let’s start checking passports!

Part 1

The first part asks how many valid passports there are. Since we have each passport broken down by field:value, we can just check each one has the correct number of fields, which is told to be seven (“cid” is optional).

fields %>% 
  count(passport) %>% 
  summarise(answer = sum(n == 7))
## # A tibble: 1 x 1
##   answer
##    <int>
## 1    239

Here count groups every value in the passport column, creating a new column with a default name of n. We then use that in a summarise call, looking just at rows where n == 7. This actually gives a column of bools, which we can sum up to give the answer, which we name answer to make it nice and clean.

Part 2

The second part adds some data validation, which is a little fiddly, but since our data is so crispy clean, we can make easy work of it.

fields %>% 
  extract(v, c("height", "unit"), "(\\d+)(cm|in)", convert = TRUE, remove = FALSE) %>% 
  mutate(valid = case_when(f == "byr" ~ between(as.integer(v), 1920, 2002),
                           f == "iyr" ~ between(as.integer(v), 2010, 2020),
                           f == "eyr" ~ between(as.integer(v), 2020, 2030),
                           f == "hgt" ~ ifelse(unit == "cm", between(height, 150, 193),
                                              between(height, 59, 76)),
                           f == "hcl" ~ str_detect(v, "^#[0-9a-f]{6}$"),
                           f == "ecl" ~ v %in% c("amb", "blu", "brn", "gry", "grn", "hzl", "oth"),
                           f == "pid" ~ str_detect(v, "^[0-9]{9}$")
                          )) %>% 
  filter(valid) %>%
  count(passport) %>%
  summarise(answer = sum(n == 7))
## # A tibble: 1 x 1
##   answer
##    <int>
## 1    188

Again…lovely code, David 🤩

Let’s understand what’s going on.

Breakdown

We are pretty much ready to handle each validation separately, except for the hgt field. Here we need to split up the value we have to get at each part.

wrangle3 = fields %>% 
  extract(v, c("height", "unit"), "(\\d+)(cm|in)", convert = TRUE, remove = FALSE)

Using extract, we can turn capture groups into new columns. Here we split our value into two columns, height and unit. The regex looks for multiple digits followed by either “cm” or “in”. Finally, we convert to make the digit portion into numbers and we don’t remove so that we still have the original v column.

At this point, we are ready to break down each case. So case_when sounds like a good choice.

wrangle4 = fields %>% 
  extract(v, c("height", "unit"), "(\\d+)(cm|in)", convert = TRUE, remove = FALSE) %>% 
  mutate(valid = case_when(f == "byr" ~ between(as.integer(v), 1920, 2002),
                           f == "iyr" ~ between(as.integer(v), 2010, 2020),
                           f == "eyr" ~ between(as.integer(v), 2020, 2030),
                           f == "hgt" ~ ifelse(unit == "cm", between(height, 150, 193),
                                              between(height, 59, 76)),
                           f == "hcl" ~ str_detect(v, "^#[0-9a-f]{6}$"),
                           f == "ecl" ~ v %in% c("amb", "blu", "brn", "gry", "grn", "hzl", "oth"),
                           f == "pid" ~ str_detect(v, "^[0-9]{9}$")
                          ))

Here we take on each field separately and define the valid condition. So, for every case, we check what field we are dealing with and act accordingly.

  • byr, iyr, eyr: The first three items are all restricted by a year range. Here we can use between to ensure the value is between the two extremes, inclusive (convert to an integer since everything is a string except the new height columns we previously made).
  • hgt: For height, we have a conditional version of using between since the ranges are different depending on the units. So, we can check the unit is “cm” and specify that range, otherwise use the range for inches (we are told those are the only options)
  • hcl, pid: These are regex calls that check we see the desired behavior (was happy to see my python code lined up here!). For hcl we make sure the input is simply a number sign followed by six hexadecimal values. For pid we ensure we only see nine digits. When I did this I ran into a few problems since I didn’t specify the bounds of the regex call with ^$. These ensure the values we consider valid just contain the items we are looking for with nothing else (beginning of string, thing we are looking for, end of string).
  • ecl: We are given a list of valid eye colors. So we simply check our value is in those valid choices. %in is a super helpful way to accomplish this in an understandable way.

Okay great, we have validated every field in every passport. But if we run this code, we notice we get a bunch of warnings:

## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 1920, 2002): NAs introduced by coercion
## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion to integer range
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 1920, 2002): NAs introduced by coercion to
## integer range
## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 2010, 2020): NAs introduced by coercion
## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion to integer range
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 2010, 2020): NAs introduced by coercion to
## integer range
## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 2020, 2030): NAs introduced by coercion
## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion to integer range
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 2020, 2030): NAs introduced by coercion to
## integer range

What about those NA warnings? Should we be worried?

case_when will introduce NAs when a case fails or none of the cases match. In this situation, we have defined all possible cases, so that isn’t the culprit (although it’s important to consider). Let’s look at the cases where we now have NAs in our new valid column:

wrangle4 %>% 
  filter(is.na(valid)) %>% 
  select(f:valid)
## # A tibble: 15 x 5
##    f     v     height unit  valid
##    <chr> <chr>  <int> <chr> <lgl>
##  1 hgt   68        NA <NA>  NA   
##  2 hgt   83        NA <NA>  NA   
##  3 hgt   95        NA <NA>  NA   
##  4 hgt   174       NA <NA>  NA   
##  5 hgt   188       NA <NA>  NA   
##  6 hgt   145       NA <NA>  NA   
##  7 hgt   184       NA <NA>  NA   
##  8 hgt   103       NA <NA>  NA   
##  9 hgt   84        NA <NA>  NA   
## 10 hgt   137       NA <NA>  NA   
## 11 hgt   71        NA <NA>  NA   
## 12 hgt   141       NA <NA>  NA   
## 13 hgt   101       NA <NA>  NA   
## 14 hgt   109       NA <NA>  NA   
## 15 hgt   154       NA <NA>  NA

We see that all of the NAs are introduced for the hgt field. Specifically, there seems to be a bunch of entries in the data where the units of the height entry are not specified. However, this isn’t a problem since we know these must be invalid heights. Once we filter down to the valid entries only, the NAs will be ignored. We could fill the NAs in the valid column with FALSE, but they will get dealt with either way. Let’s finish up!

We have now checked every field for every passport. Now we just need to figure out how many are valid again.

wrangle5 = fields %>% 
  extract(v, c("height", "unit"), "(\\d+)(cm|in)", convert = TRUE, remove = FALSE) %>% 
  mutate(valid = case_when(f == "byr" ~ between(as.integer(v), 1920, 2002),
                           f == "iyr" ~ between(as.integer(v), 2010, 2020),
                           f == "eyr" ~ between(as.integer(v), 2020, 2030),
                           f == "hgt" ~ ifelse(unit == "cm", between(height, 150, 193),
                                              between(height, 59, 76)),
                           f == "hcl" ~ str_detect(v, "^#[0-9a-f]{6}$"),
                           f == "ecl" ~ v %in% c("amb", "blu", "brn", "gry", "grn", "hzl", "oth"),
                           f == "pid" ~ str_detect(v, "^[0-9]{9}$")
                          )) %>% 
  filter(valid) %>%
  count(passport) %>%
  summarise(answer = sum(n == 7))

We do a similar trick to part 1. First filter on the valid entries only. Then use count and summarise to get the number of passports with the correct number of (valid) entries. Again, n is the default name of count and we can look for seven since cid was optional and we left it out.

Tidyverse for the win!

And there we have it! We processed some passports in the tidyverse!

Hope you found this educational. I know I did.

Till next time!

Learning

  • The genius part for me (and basically the whole reason for this post) is the combination of using str_match_all and map, which can allow for complex parsing of capture groups. I basically did this by hand in Python and it was way messier. Good to know!
  • case_when makes it easy to write understandable code about multiple conditions

References

Image Credit

Tree by Zach Bogart from the Noun Project

Zach Bogart
Zach Bogart
Data Explorer

Science, Design, & Data. I’ll know it when I see it.

Related