Tidytuesday #1: Australia Fires

Looking at temperature data in Australia from tidytuesday.

Goal: to get some plots up and running and learn aspects of tidyverse in action.

Gave myself an hour; hit several roadblocks, but general flow of filtering, grouping, etc. was easy going. Roadblocks generally regarding ability to minutely change visuals and other fiddly things. Turns out this exploration was less about fires and more about temperature data for a given region.

Look at Temperature Data

temperature <- tuesdata$temperature

# how many cities are there in the data?
temperature %>% 
  count(city_name)
## # A tibble: 7 x 2
##   city_name     n
##   <chr>     <int>
## 1 BRISBANE  51128
## 2 CANBERRA  77526
## 3 KENT      79924
## 4 MELBOURNE 79926
## 5 PERTH     79926
## 6 PORT      79924
## 7 SYDNEY    79924
# some prep
temperature <- temperature %>% 
  mutate(year = as.numeric(format(date, "%Y",)),
         month = as.numeric(format(date, "%m")))

How Do Different Cities in Australia Compare Over Time?

Spent a while making plots that had huge chunks missing. Realized I wasn’t removing NAs when calculating the mean. Can see a slight increase in temperature over time, but on the axis scale I chose, it’s hard to see. Would probably want to isolate each temperature type and see if the increase in temperature is more apparent.

# look at max and min temps for each city over time
temperature %>% 
  group_by(city_name, year, temp_type) %>% 
  summarise(avg_temp = mean(temperature, na.rm = TRUE)) %>% 
  
  ggplot(aes(year, avg_temp)) +
    geom_line(aes(color = temp_type)) +
    facet_wrap(~ city_name) +
    labs(x = "",
         y = "Average Temperature (Celsius)",
         color = "Temp Reading",
         title = "Australia Temps, 1910-2020",
         caption = "Source: tidytuesday") +
    guides(linetype = FALSE)

Max Temperatures By City

Looking at the distribution of temperatures by city. Would be interesting to see if these peak max temperatures correlate with the location of the city.

# max temperatures for each region over time
temperature %>% 
  filter(temp_type == "max") %>% 
  group_by(city_name, year) %>% 
  summarise(avg_max_temp = mean(temperature, na.rm = TRUE)) %>% 
  
  ggplot(aes(avg_max_temp, color = city_name)) +
    geom_freqpoly(bins = 20) +
    labs(x = "Average Temperature (Celsius)",
         title = "# Times a Temperature Occurs by City",
         color = "City",
         caption = "Source: tidytuesday")
## `summarise()` regrouping output by 'city_name' (override with `.groups` argument)
## Warning: Removed 1 rows containing non-finite values (stat_bin).

How Have Temperatures Changed in a Century?

Wanted to see how the temperature, both max and min, have changed over a long period. Filtered just for two years spanning a century and plotted the results. There are some concerns since Brisbane doesn’t seem to have data going back that far, but it does show an increase in both readings for all cities.

# then and now
temperature %>% 
  filter(year %in% c(1919, 2019)) %>% 
  group_by(city_name, year, temp_type) %>% 
  summarise(avg_temp = mean(temperature, na.rm = TRUE)) %>% 
  
  ggplot(aes(year, avg_temp, color = city_name)) +
    geom_point() +
    geom_line(aes(group = interaction(temp_type, city_name), linetype = temp_type)) +
    scale_x_continuous(breaks = c(1919, 2019)) +
    labs(x = "",
         y = "Average Temperature (Celsius)",
         linetype = "Temp Reading",
         color = "City",
         title = "Australia Temps, Then & Now",
         subtitle = "Average Max/Min Temps by City in 1919 and 2019",
         caption = "Source: tidytuesday")
## `summarise()` regrouping output by 'city_name', 'year' (override with `.groups` argument)

Learning

  • Without as.numeric, years on the x-axis got crowded since they were being plotted as distinct values. Moved this out into early data prep when I realized I was repeating myself for different plots.
  • Was a challenge to get geom_line() to work as I wanted for last plot spanning a century. Used interaction after a Stack Overflow search, but was certainly a case of grabbing a tool without knowing how it works.
  • Setting guides to FALSE for a given aesthetic is useful to suppress undesirable legends.
  • Would be interested in exploring a more pleasing default color palette to reach for.
  • I like adding a blank line between the data wrangling and the ggplot code to see the separation between the steps in the pipeline.
  • I’m not sure how to remedy the warning/message about regrouping the output of summarise() and overwritting with .groups argument. It seems to happen since I group the data then also define a group in ggplot to draw the right number of lines. Would need further investigation.
  • I hit a GitHub rate limit by rendering a blogdown post that was pinging the tidytuesday data too many times. Was initally confused since I thought blogdown was rate limited somehow! Adding cache = FALSE to the code chunk downloading the data helped to reduce this.

Image Credit

Campfire by Zach Bogart from the Noun Project

Zach Bogart
Zach Bogart
Data Explorer

Science, Design, & Data. I’ll know it when I see it.