Disecting David Robinson's Day 4 Solution
Tis the season…for coding in the tidyverse
Spoilers follow for the Advent of Code 2020 Day 4 solution…thought you ought to know.
If you haven’t heard of Advent of Code, I highly recommend checking it out. It’s a fun coding challenge held every year since 2015, which provides a problem for every day of Christmas advent. A lovely project put together by Eric Wastl, you can solve them in any way you choose. It actually gets quite competitive, with top solvers completing each day usually in the first ten minutes. I’m certainly not at that level, but I have had fun giving them a go in the past.
This year, I thought I’d try to go through them as it unfolds. I have been doing things in R recently, so thought I’d try to tackle the Day 4 problem in R as well. I was having trouble getting it to work and I ended up switching to Python instead. Afterwards though, I saw David Robinson’s tidyverse solution on Twitter and thought I’d make sure to understand how it works.
David Robinson is a super smart dude and has done a lot of work with R and for it’s community. He’s actually the way I found out about TidyTuesday, checking out his explorations on YouTube.
In an ongoing quest to understand what the heck I’m doing, I thought it would be beneficial for me to take apart his concise answer and understand how it works. Hope you find it beneficial as well.
Let’s go process some passports!
Overview
What follows is code taken from David Robinson’s tweet on his approach to Day 4 of the 2020 Advent of Code, specifically using the tidyverse to solve the problems. When I saw it, I thought I’d take some time to go through it since I had trouble implementing this in R when I gave it a try (I ended up using Python instead). David’s version is super clean and illustrates some good concepts that I will certainly be using in the future. Here we take apart his code and figure out how it works.
Let’s get into it!
Setup
Read in Data
We start reading in the file as a tibble. When I tried this, I was attempting to read it in and coerce to dictionary objects…not the way to go in R (my Python was showing).
input = tibble(x = read_lines("inputs/04-input.txt"))
Wrangling
Here’s how we clean the input to get it ready to answer the questions:
required = c("byr", "iyr", "eyr", "hgt", "hcl", "ecl", "pid")
fields = input %>%
mutate(passport = cumsum(x == "")) %>%
mutate(m = str_match_all(x, "(...)\\:([^ ]+)")) %>%
mutate(f = map(m, ~ .[, 2]),
v = map(m, ~ .[, 3])) %>%
unnest(c(f, v)) %>%
filter(f %in% required)
Simply Gorgeous 🤩
But…it is kinda complex so let’s break it down.
Breakdown
- First, we need a way to distinguish the passports from each other.
cumsum
(cumulative sum) can keep a running tally of how many blank lines are seen. Since we know each entry is split with a blank line, this means we can create an id for every passport.
- This came into my head, but I didn’t know how to accomplish it.
cumsum
keeping a running tally is super helpful.
wrangle1 = input %>%
mutate(passport = cumsum(x == ""))
Now every line is linked to every passport. But we need access to all the fields.
- Next, we need to break up each line to get at the values. This part is the genius:
str_match_all
splits things up into capture groups and then we grab them withmap
.
wrangle2 = input %>%
mutate(passport = cumsum(x == "")) %>%
mutate(m = str_match_all(x, "(...)\\:([^ ]+)")) %>%
mutate(f = map(m, ~ .[, 2]),
v = map(m, ~ .[, 3]))
- What the regex is:
- Two capture groups,
(...)
and([^ ]+)
, separated by a colon - “Three characters” & “Anything that isn’t a space”
- allows us to grab every key:value pair
- Two capture groups,
As an example, here is one of the lines:
entry = "pid:796082981 cid:129 eyr:2030"
m = str_match_all(entry, "(...)\\:([^ ]+)")
m
## [[1]]
## [,1] [,2] [,3]
## [1,] "pid:796082981" "pid" "796082981"
## [2,] "cid:129" "cid" "129"
## [3,] "eyr:2030" "eyr" "2030"
Returns a list of character matrices (m for matrix, probably). From ??str_match_all
, the first column is the full match, and the other columns are the capture groups, which explains the map
formulas for the second and third columns of every matrix (f for field, v for value…again, probably).
f = map(m, ~ .[, 2])
f
## [[1]]
## [1] "pid" "cid" "eyr"
v = map(m, ~ .[, 3])
v
## [[1]]
## [1] "796082981" "129" "2030"
Unnesting leaves a tibble we can work with: every row is a value, specified by its field and passport.
- The final move is to just leave fields we consider “required”. The instructions say we can ignore the
cid
field in determining valid passports, so we filter it out.
We are left with a super clean dataset. Let’s start checking passports!
Part 1
The first part asks how many valid passports there are. Since we have each passport broken down by field:value
, we can just check each one has the correct number of fields, which is told to be seven (“cid” is optional).
fields %>%
count(passport) %>%
summarise(answer = sum(n == 7))
## # A tibble: 1 x 1
## answer
## <int>
## 1 239
Here count
groups every value in the passport column, creating a new column with a default name of n
. We then use that in a summarise
call, looking just at rows where n == 7
. This actually gives a column of bools, which we can sum up to give the answer, which we name answer
to make it nice and clean.
Part 2
The second part adds some data validation, which is a little fiddly, but since our data is so crispy clean, we can make easy work of it.
fields %>%
extract(v, c("height", "unit"), "(\\d+)(cm|in)", convert = TRUE, remove = FALSE) %>%
mutate(valid = case_when(f == "byr" ~ between(as.integer(v), 1920, 2002),
f == "iyr" ~ between(as.integer(v), 2010, 2020),
f == "eyr" ~ between(as.integer(v), 2020, 2030),
f == "hgt" ~ ifelse(unit == "cm", between(height, 150, 193),
between(height, 59, 76)),
f == "hcl" ~ str_detect(v, "^#[0-9a-f]{6}$"),
f == "ecl" ~ v %in% c("amb", "blu", "brn", "gry", "grn", "hzl", "oth"),
f == "pid" ~ str_detect(v, "^[0-9]{9}$")
)) %>%
filter(valid) %>%
count(passport) %>%
summarise(answer = sum(n == 7))
## # A tibble: 1 x 1
## answer
## <int>
## 1 188
Again…lovely code, David 🤩
Let’s understand what’s going on.
Breakdown
We are pretty much ready to handle each validation separately, except for the hgt
field. Here we need to split up the value we have to get at each part.
wrangle3 = fields %>%
extract(v, c("height", "unit"), "(\\d+)(cm|in)", convert = TRUE, remove = FALSE)
Using extract
, we can turn capture groups into new columns. Here we split our value into two columns, height
and unit
. The regex looks for multiple digits followed by either “cm” or “in”. Finally, we convert
to make the digit portion into numbers and we don’t remove
so that we still have the original v
column.
At this point, we are ready to break down each case. So case_when
sounds like a good choice.
wrangle4 = fields %>%
extract(v, c("height", "unit"), "(\\d+)(cm|in)", convert = TRUE, remove = FALSE) %>%
mutate(valid = case_when(f == "byr" ~ between(as.integer(v), 1920, 2002),
f == "iyr" ~ between(as.integer(v), 2010, 2020),
f == "eyr" ~ between(as.integer(v), 2020, 2030),
f == "hgt" ~ ifelse(unit == "cm", between(height, 150, 193),
between(height, 59, 76)),
f == "hcl" ~ str_detect(v, "^#[0-9a-f]{6}$"),
f == "ecl" ~ v %in% c("amb", "blu", "brn", "gry", "grn", "hzl", "oth"),
f == "pid" ~ str_detect(v, "^[0-9]{9}$")
))
Here we take on each field separately and define the valid condition. So, for every case, we check what field we are dealing with and act accordingly.
- byr, iyr, eyr: The first three items are all restricted by a year range. Here we can use
between
to ensure the value is between the two extremes, inclusive (convert to an integer since everything is a string except the new height columns we previously made). - hgt: For height, we have a conditional version of using
between
since the ranges are different depending on the units. So, we can check the unit is “cm” and specify that range, otherwise use the range for inches (we are told those are the only options) - hcl, pid: These are regex calls that check we see the desired behavior (was happy to see my python code lined up here!). For
hcl
we make sure the input is simply a number sign followed by six hexadecimal values. Forpid
we ensure we only see nine digits. When I did this I ran into a few problems since I didn’t specify the bounds of the regex call with^$
. These ensure the values we consider valid just contain the items we are looking for with nothing else (beginning of string, thing we are looking for, end of string). - ecl: We are given a list of valid eye colors. So we simply check our value is in those valid choices.
%in
is a super helpful way to accomplish this in an understandable way.
Okay great, we have validated every field in every passport. But if we run this code, we notice we get a bunch of warnings:
## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 1920, 2002): NAs introduced by coercion
## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion to integer range
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 1920, 2002): NAs introduced by coercion to
## integer range
## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 2010, 2020): NAs introduced by coercion
## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion to integer range
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 2010, 2020): NAs introduced by coercion to
## integer range
## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 2020, 2030): NAs introduced by coercion
## Warning: Problem with `mutate()` input `valid`.
## ℹ NAs introduced by coercion to integer range
## ℹ Input `valid` is `case_when(...)`.
## Warning in between(as.integer(v), 2020, 2030): NAs introduced by coercion to
## integer range
What about those NA warnings? Should we be worried?
case_when
will introduce NAs when a case fails or none of the cases match. In this situation, we have defined all possible cases, so that isn’t the culprit (although it’s important to consider). Let’s look at the cases where we now have NAs in our new valid
column:
wrangle4 %>%
filter(is.na(valid)) %>%
select(f:valid)
## # A tibble: 15 x 5
## f v height unit valid
## <chr> <chr> <int> <chr> <lgl>
## 1 hgt 68 NA <NA> NA
## 2 hgt 83 NA <NA> NA
## 3 hgt 95 NA <NA> NA
## 4 hgt 174 NA <NA> NA
## 5 hgt 188 NA <NA> NA
## 6 hgt 145 NA <NA> NA
## 7 hgt 184 NA <NA> NA
## 8 hgt 103 NA <NA> NA
## 9 hgt 84 NA <NA> NA
## 10 hgt 137 NA <NA> NA
## 11 hgt 71 NA <NA> NA
## 12 hgt 141 NA <NA> NA
## 13 hgt 101 NA <NA> NA
## 14 hgt 109 NA <NA> NA
## 15 hgt 154 NA <NA> NA
We see that all of the NAs are introduced for the hgt
field. Specifically, there seems to be a bunch of entries in the data where the units of the height entry are not specified. However, this isn’t a problem since we know these must be invalid heights. Once we filter down to the valid entries only, the NAs will be ignored. We could fill the NAs in the valid
column with FALSE, but they will get dealt with either way. Let’s finish up!
We have now checked every field for every passport. Now we just need to figure out how many are valid again.
wrangle5 = fields %>%
extract(v, c("height", "unit"), "(\\d+)(cm|in)", convert = TRUE, remove = FALSE) %>%
mutate(valid = case_when(f == "byr" ~ between(as.integer(v), 1920, 2002),
f == "iyr" ~ between(as.integer(v), 2010, 2020),
f == "eyr" ~ between(as.integer(v), 2020, 2030),
f == "hgt" ~ ifelse(unit == "cm", between(height, 150, 193),
between(height, 59, 76)),
f == "hcl" ~ str_detect(v, "^#[0-9a-f]{6}$"),
f == "ecl" ~ v %in% c("amb", "blu", "brn", "gry", "grn", "hzl", "oth"),
f == "pid" ~ str_detect(v, "^[0-9]{9}$")
)) %>%
filter(valid) %>%
count(passport) %>%
summarise(answer = sum(n == 7))
We do a similar trick to part 1. First filter
on the valid entries only. Then use count
and summarise
to get the number of passports with the correct number of (valid) entries. Again, n
is the default name of count
and we can look for seven since cid
was optional and we left it out.
Tidyverse for the win!
And there we have it! We processed some passports in the tidyverse!
Hope you found this educational. I know I did.
Till next time!
Learning
- The genius part for me (and basically the whole reason for this post) is the combination of using
str_match_all
andmap
, which can allow for complex parsing of capture groups. I basically did this by hand in Python and it was way messier. Good to know! case_when
makes it easy to write understandable code about multiple conditions
References
- David Robinson’s tidyverse solution on Twitter
Image Credit
Tree by Zach Bogart from the Noun Project