Tidytuesday #5: NFL Attendance

Win one for the home team

Doing a quick pass at another TidyTuesday dataset, this time about the NFL. Let’s play ball!

attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')

Ask Some Questions of the Data

How Many Teams?

# how many teams?
attendance %>% 
  count(team, sort=TRUE)
## # A tibble: 32 x 2
##    team           n
##    <chr>      <int>
##  1 New York     680
##  2 Arizona      340
##  3 Atlanta      340
##  4 Baltimore    340
##  5 Buffalo      340
##  6 Carolina     340
##  7 Chicago      340
##  8 Cincinnati   340
##  9 Cleveland    340
## 10 Dallas       340
## # … with 22 more rows
# --> 32 teams, each with 340 rows (gotta love consistency!)

Year Range?

# what is the year range?
attendance %>% 
  count(year)
## # A tibble: 20 x 2
##     year     n
##    <dbl> <int>
##  1  2000   527
##  2  2001   527
##  3  2002   544
##  4  2003   544
##  5  2004   544
##  6  2005   544
##  7  2006   544
##  8  2007   544
##  9  2008   544
## 10  2009   544
## 11  2010   544
## 12  2011   544
## 13  2012   544
## 14  2013   544
## 15  2014   544
## 16  2015   544
## 17  2016   544
## 18  2017   544
## 19  2018   544
## 20  2019   544
# --> spans 2000 to 2019
  • So we have the attendance data for 32 teams for 2000 to 2019
  • The README mentions that there is a week each year where no game is played by each team
# extent of each team's year span
attendance %>% 
  group_by(team) %>% 
  summarise(max_year = max(year),
            min_year = min(year))
## # A tibble: 32 x 3
##    team       max_year min_year
##    <chr>         <dbl>    <dbl>
##  1 Arizona        2019     2000
##  2 Atlanta        2019     2000
##  3 Baltimore      2019     2000
##  4 Buffalo        2019     2000
##  5 Carolina       2019     2000
##  6 Chicago        2019     2000
##  7 Cincinnati     2019     2000
##  8 Cleveland      2019     2000
##  9 Dallas         2019     2000
## 10 Denver         2019     2000
## # … with 22 more rows
# --> most teams seem to have a span from 2000 to 2019, with a few being shorter (2016-2019, for example for LA)

Home vs Away

Looking at the difference between home and away attendance.

home_and_away = attendance %>% 
  # filter(team == "Arizona") %>% 
  
  pivot_longer(cols = home:away,
               names_to = "location",
               values_to = "attendance")

attendance %>% 
  select(!week:weekly_attendance) %>% 
  distinct() %>% 
  
  ggplot(aes(year)) +
    geom_line(aes(y = home)) +
    geom_line(aes(y = away), linetype = "dotted") +
    scale_y_continuous(labels = scales::comma) +
    facet_wrap(~team) +
    labs(
      title = "Home and Away NFL Attendance",
      subtitle = "Attendance for NFL games by team, Home (solid) and Away (dotted)",
      x = "",
      y = "Number of People")

Went down a big rabbit hole of trying to plot the difference between the two lines and conditionally color the area. But it’s a challenge to get the complete area to shade in when the crossings don’t happen at data points. Some things tried:

# Big rabbit hole of not being able to fill between two curves

test = attendance %>% 
  select(!week:weekly_attendance) %>% 
  distinct() %>% 
  filter(team %in% c("Dallas", "Miami")) 

test %>%   
  ggplot(aes(year)) +
    geom_ribbon(aes(ymin = home, ymax = away, fill = home > away)) +
    geom_line(data = test, aes(year, home)) +
    geom_line(data = test, aes(year, away)) +
    facet_wrap(~team) +
    labs(
      title = "Example of Trying to Shade Between Lines",
      subtitle = "Not completely shaded in. Bad artifacts for multiple crossings"
    )

Someone did it online, but I don’t fully understand how it works. I’m surprised how challenging it is to find the difference between two geom_line()’s. Will have to try it out some other time. Would be interested in getting more comfortable at quickly making a conditional line chart with the areas shaded.

What day of the week do games happen?

games %>% 
  ggplot(aes(factor(day, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
             fill = day)) +
    geom_bar() +
    scale_fill_brewer(palette = "Set3") +
    labs(title = "Sunday, Sunday, SUNDAY!",
         subtitle = "NFL Games by Date Played",
         x = "",
         y = "# Games") +
    guides(fill = FALSE)

What are the scores?

games %>% 
  select(year, pts_win, pts_loss) %>% 
  pivot_longer(cols = pts_win:pts_loss,
               names_to = "win_lose",
               values_to = "points") %>% 
  
  ggplot(aes(points, color = win_lose)) +
    geom_freqpoly(bins = 20) +
    labs(title = "Winning and Losing Scores",
         subtitle = "Distribution of points scored for winning/losing teams",
         color = "",
         x = "Points Scored",
         y = "# Games")

How are teams doing over time?

standings %>% 
  ggplot(aes(x = year)) +
    geom_line(aes(y = wins), color = "seagreen") +
    geom_line(aes(y = loss), color = "firebrick") +
    facet_wrap(~team) +
    labs(title = "You Win Some, You Lose Some",
         subtitle = "Win/Loss for each NFL team over time",
         x = "",
         y = "Games")

Till next time!

Image Credit

Football by Zach Bogart from the Noun Project

Zach Bogart
Zach Bogart
Data Explorer

Science, Design, & Data. I’ll know it when I see it.

Related