Doing a quick pass at another TidyTuesday dataset, this time about the NFL. Let’s play ball!
- Source: TidyTuesday
attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')
Ask Some Questions of the Data
How Many Teams?
# how many teams?
attendance %>%
count(team, sort=TRUE)
## # A tibble: 32 x 2
## team n
## <chr> <int>
## 1 New York 680
## 2 Arizona 340
## 3 Atlanta 340
## 4 Baltimore 340
## 5 Buffalo 340
## 6 Carolina 340
## 7 Chicago 340
## 8 Cincinnati 340
## 9 Cleveland 340
## 10 Dallas 340
## # … with 22 more rows
# --> 32 teams, each with 340 rows (gotta love consistency!)
Year Range?
# what is the year range?
attendance %>%
count(year)
## # A tibble: 20 x 2
## year n
## <dbl> <int>
## 1 2000 527
## 2 2001 527
## 3 2002 544
## 4 2003 544
## 5 2004 544
## 6 2005 544
## 7 2006 544
## 8 2007 544
## 9 2008 544
## 10 2009 544
## 11 2010 544
## 12 2011 544
## 13 2012 544
## 14 2013 544
## 15 2014 544
## 16 2015 544
## 17 2016 544
## 18 2017 544
## 19 2018 544
## 20 2019 544
# --> spans 2000 to 2019
- So we have the attendance data for 32 teams for 2000 to 2019
- The README mentions that there is a week each year where no game is played by each team
# extent of each team's year span
attendance %>%
group_by(team) %>%
summarise(max_year = max(year),
min_year = min(year))
## # A tibble: 32 x 3
## team max_year min_year
## <chr> <dbl> <dbl>
## 1 Arizona 2019 2000
## 2 Atlanta 2019 2000
## 3 Baltimore 2019 2000
## 4 Buffalo 2019 2000
## 5 Carolina 2019 2000
## 6 Chicago 2019 2000
## 7 Cincinnati 2019 2000
## 8 Cleveland 2019 2000
## 9 Dallas 2019 2000
## 10 Denver 2019 2000
## # … with 22 more rows
# --> most teams seem to have a span from 2000 to 2019, with a few being shorter (2016-2019, for example for LA)
Home vs Away
Looking at the difference between home and away attendance.
home_and_away = attendance %>%
# filter(team == "Arizona") %>%
pivot_longer(cols = home:away,
names_to = "location",
values_to = "attendance")
attendance %>%
select(!week:weekly_attendance) %>%
distinct() %>%
ggplot(aes(year)) +
geom_line(aes(y = home)) +
geom_line(aes(y = away), linetype = "dotted") +
scale_y_continuous(labels = scales::comma) +
facet_wrap(~team) +
labs(
title = "Home and Away NFL Attendance",
subtitle = "Attendance for NFL games by team, Home (solid) and Away (dotted)",
x = "",
y = "Number of People")
Went down a big rabbit hole of trying to plot the difference between the two lines and conditionally color the area. But it’s a challenge to get the complete area to shade in when the crossings don’t happen at data points. Some things tried:
# Big rabbit hole of not being able to fill between two curves
test = attendance %>%
select(!week:weekly_attendance) %>%
distinct() %>%
filter(team %in% c("Dallas", "Miami"))
test %>%
ggplot(aes(year)) +
geom_ribbon(aes(ymin = home, ymax = away, fill = home > away)) +
geom_line(data = test, aes(year, home)) +
geom_line(data = test, aes(year, away)) +
facet_wrap(~team) +
labs(
title = "Example of Trying to Shade Between Lines",
subtitle = "Not completely shaded in. Bad artifacts for multiple crossings"
)
Someone did it online, but I don’t fully understand how it works. I’m surprised how challenging it is to find the difference between two geom_line()
’s. Will have to try it out some other time. Would be interested in getting more comfortable at quickly making a conditional line chart with the areas shaded.
What day of the week do games happen?
games %>%
ggplot(aes(factor(day, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
fill = day)) +
geom_bar() +
scale_fill_brewer(palette = "Set3") +
labs(title = "Sunday, Sunday, SUNDAY!",
subtitle = "NFL Games by Date Played",
x = "",
y = "# Games") +
guides(fill = FALSE)
What are the scores?
games %>%
select(year, pts_win, pts_loss) %>%
pivot_longer(cols = pts_win:pts_loss,
names_to = "win_lose",
values_to = "points") %>%
ggplot(aes(points, color = win_lose)) +
geom_freqpoly(bins = 20) +
labs(title = "Winning and Losing Scores",
subtitle = "Distribution of points scored for winning/losing teams",
color = "",
x = "Points Scored",
y = "# Games")
How are teams doing over time?
standings %>%
ggplot(aes(x = year)) +
geom_line(aes(y = wins), color = "seagreen") +
geom_line(aes(y = loss), color = "firebrick") +
facet_wrap(~team) +
labs(title = "You Win Some, You Lose Some",
subtitle = "Win/Loss for each NFL team over time",
x = "",
y = "Games")
Till next time!
Image Credit
Football by Zach Bogart from the Noun Project