Tidytuesday #1: Australia Fires
Looking at temperature data in Australia from tidytuesday.
Goal: to get some plots up and running and learn aspects of tidyverse in action.
Gave myself an hour; hit several roadblocks, but general flow of filtering, grouping, etc. was easy going. Roadblocks generally regarding ability to minutely change visuals and other fiddly things. Turns out this exploration was less about fires and more about temperature data for a given region.
Look at Temperature Data
temperature <- tuesdata$temperature # how many cities are there in the data? temperature %>% count(city_name)
## # A tibble: 7 x 2 ## city_name n ## <chr> <int> ## 1 BRISBANE 51128 ## 2 CANBERRA 77526 ## 3 KENT 79924 ## 4 MELBOURNE 79926 ## 5 PERTH 79926 ## 6 PORT 79924 ## 7 SYDNEY 79924
# some prep temperature <- temperature %>% mutate(year = as.numeric(format(date, "%Y",)), month = as.numeric(format(date, "%m")))
How Do Different Cities in Australia Compare Over Time?
Spent a while making plots that had huge chunks missing. Realized I wasn’t removing
NAs when calculating the mean. Can see a slight increase in temperature over time, but on the axis scale I chose, it’s hard to see. Would probably want to isolate each temperature type and see if the increase in temperature is more apparent.
# look at max and min temps for each city over time temperature %>% group_by(city_name, year, temp_type) %>% summarise(avg_temp = mean(temperature, na.rm = TRUE)) %>% ggplot(aes(year, avg_temp)) + geom_line(aes(color = temp_type)) + facet_wrap(~ city_name) + labs(x = "", y = "Average Temperature (Celsius)", color = "Temp Reading", title = "Australia Temps, 1910-2020", caption = "Source: tidytuesday") + guides(linetype = FALSE)
Max Temperatures By City
Looking at the distribution of temperatures by city. Would be interesting to see if these peak max temperatures correlate with the location of the city.
# max temperatures for each region over time temperature %>% filter(temp_type == "max") %>% group_by(city_name, year) %>% summarise(avg_max_temp = mean(temperature, na.rm = TRUE)) %>% ggplot(aes(avg_max_temp, color = city_name)) + geom_freqpoly(bins = 20) + labs(x = "Average Temperature (Celsius)", title = "# Times a Temperature Occurs by City", color = "City", caption = "Source: tidytuesday")
## `summarise()` regrouping output by 'city_name' (override with `.groups` argument)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
How Have Temperatures Changed in a Century?
Wanted to see how the temperature, both max and min, have changed over a long period. Filtered just for two years spanning a century and plotted the results. There are some concerns since Brisbane doesn’t seem to have data going back that far, but it does show an increase in both readings for all cities.
# then and now temperature %>% filter(year %in% c(1919, 2019)) %>% group_by(city_name, year, temp_type) %>% summarise(avg_temp = mean(temperature, na.rm = TRUE)) %>% ggplot(aes(year, avg_temp, color = city_name)) + geom_point() + geom_line(aes(group = interaction(temp_type, city_name), linetype = temp_type)) + scale_x_continuous(breaks = c(1919, 2019)) + labs(x = "", y = "Average Temperature (Celsius)", linetype = "Temp Reading", color = "City", title = "Australia Temps, Then & Now", subtitle = "Average Max/Min Temps by City in 1919 and 2019", caption = "Source: tidytuesday")
## `summarise()` regrouping output by 'city_name', 'year' (override with `.groups` argument)
as.numeric, years on the x-axis got crowded since they were being plotted as distinct values. Moved this out into early data prep when I realized I was repeating myself for different plots.
- Was a challenge to get
geom_line()to work as I wanted for last plot spanning a century. Used
interactionafter a Stack Overflow search, but was certainly a case of grabbing a tool without knowing how it works.
FALSEfor a given aesthetic is useful to suppress undesirable legends.
- Would be interested in exploring a more pleasing default color palette to reach for.
- I like adding a blank line between the data wrangling and the ggplot code to see the separation between the steps in the pipeline.
- I’m not sure how to remedy the warning/message about regrouping the output of
summarise()and overwritting with
.groupsargument. It seems to happen since I group the data then also define a group in ggplot to draw the right number of lines. Would need further investigation.
- I hit a GitHub rate limit by rendering a blogdown post that was pinging the tidytuesday data too many times. Was initally confused since I thought blogdown was rate limited somehow! Adding
cache = FALSEto the code chunk downloading the data helped to reduce this.