Tidytuesday #6: Food Consumption and CO2 Emissions

Simple Linear Regression with ggplot small multiples

Example of using simple linear regression for some data from TidyTuesday. Can checkout the original source too.

Setup

This week, looking at “Food Consumption and CO2 Emissions”. Let’s get dig in…get it? Anyone? Just me? Fine.

# read in raw data
food_consumption <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv')

head(food_consumption)
## # A tibble: 6 x 4
##   country   food_category consumption co2_emmission
##   <chr>     <chr>               <dbl>         <dbl>
## 1 Argentina Pork                10.5          37.2 
## 2 Argentina Poultry             38.7          41.5 
## 3 Argentina Beef                55.5        1712   
## 4 Argentina Lamb & Goat          1.56         54.6 
## 5 Argentina Fish                 4.36          6.96
## 6 Argentina Eggs                11.4          10.5

So from the raw data we see that we have four columns. We have the country, the category of the food consumed, how much is consumed, and how much CO2 is emitted. Some important units provided in the README:

  • consumption: kg/person/year (kilograms of food consumed per person per year)
  • co2_emission: kg CO2/person/year (kilograms of CO2 emitted per person per year)

Note: the people aren’t emitting the CO2. The whole food system is in order to produce the food product.

So the data provided is by weight and it is split into one person’s effect.

Let’s get a better sense of the extent of each of these variables.

Variable Extent

# how many countries?
food_consumption %>% 
  dplyr::count(country, sort=TRUE)
## # A tibble: 130 x 2
##    country        n
##    <chr>      <int>
##  1 Albania       11
##  2 Algeria       11
##  3 Angola        11
##  4 Argentina     11
##  5 Armenia       11
##  6 Australia     11
##  7 Austria       11
##  8 Bahamas       11
##  9 Bangladesh    11
## 10 Barbados      11
## # … with 120 more rows
# --> 130 countries, 11 food categories each

# extent of continuous values
# --> under thousands for most values

130 countries with 11 food categories. No missing data for any categories!

Plotting Stuff

Let’s look at the relationship between consuming food and emitting CO2.

# consumption versus co2 emission
food_consumption %>% 
  ggplot(aes(consumption, co2_emmission)) +
    geom_point(color = "seagreen", alpha = 0.5) +
    labs(title = "CO2 vs Food Consumption",
         x = "Food Consumption (kg/person/year)",
         y = "CO2 Emission (kg/person/year)",
         caption = plot_caption)

There’s a nice linear relationship between food consumption and CO2 emissions. And it seems to vary based on the type of food being consumed. Let’s split it up to see the differences.

# consumption versus co2 emission
food_consumption %>% 
  ggplot(aes(consumption, co2_emmission)) +
    geom_point(color = "seagreen", alpha = 0.5) +
    # geom_smooth(method = "lm") +
    labs(title = "CO2 vs Food Consumption",
         x = "Food Consumption (kg/person/year)",
         y = "CO2 Emission (kg/person/year)",
         caption = "Source: TidyTuesday\nzachbogart.com") +
    facet_wrap(~food_category)

Small multiples show drastically different slopes for different types of foods, with beef, lamb, and goat being the most severe. Also see different types of food are consumed at different rates. For example, wheat and milk products have a much wider distribution of consumption rates compared to eggs or poultry. Now let’s see if we can put some numbers to these lines.

Trying just one category manually

For a test, going to make a model for the beef category. What is its relationship?

beef = food_consumption %>% 
  filter(food_category == "Beef")

regressor = lm(formula = co2_emmission ~ consumption, data = beef)

summary(regressor)
## 
## Call:
## lm(formula = co2_emmission ~ consumption, data = beef)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0050706 -0.0024859  0.0003595  0.0022968  0.0053201 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -4.549e-04  4.087e-04 -1.113e+00    0.268    
## consumption  3.086e+01  2.641e-05  1.168e+06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002896 on 128 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.365e+12 on 1 and 128 DF,  p-value: < 2.2e-16

There’s a bunch of gobbledy-gook, but the point of interest above is the coefficients. The “3.086e+01” is the slope of the linear model (it also happens to have a really low p-value, which people tell me is good to have). So, for beef, we have a slope of about thirty. So, for every extra kilogram of beef consumed per person per year, there is about thirty kilograms emitted (per person per year). So that’s neat.

# consumption versus co2 emission
food_consumption %>% 
  filter(food_category == "Beef") %>% 
  ggplot(aes(consumption, co2_emmission)) +
    geom_point(color = "firebrick", alpha = 0.5, size = 5) +
    geom_abline(slope = 3.086e+01, intercept = -4.549e-04) +
    labs(title = "CO2 vs Food Consumption for Beef",
         x = "Food Consumption (kg/person/year)",
         y = "CO2 Emission (kg/person/year)") 

Let’s do this for all of the categories.

Loop through Each Small Multiple

Here I want to apply the linear model to every food category. For this, I gotta thank Hadley for this SO post. Super helpful. Will have to find ways to work with dlply and ldply more often.

# make a list of models from a dataframe
models = dlply(food_consumption, "food_category", function(df) {
  lm(co2_emmission ~ consumption, data = df)
})

# make a dataframe of the coefficients
regression_lines = ldply(models, coef) %>% 
  clean_names() %>% 
  dplyr::rename(slope = consumption)

head(regression_lines)
##             food_category     intercept      slope
## 1                    Beef -4.548720e-04 30.8579218
## 2                    Eggs  3.190021e-04  0.9186071
## 3                    Fish -2.303501e-04  1.5966796
## 4             Lamb & Goat -5.193769e-04 35.0199619
## 5      Milk - inc. cheese  8.287065e-04  1.4243964
## 6 Nuts inc. Peanut Butter  8.976932e-05  1.7700028
# merge together
food_relationships = left_join(food_consumption, regression_lines, 
                               by = "food_category")
# consumption versus co2 emission
food_relationships %>% 
  ggplot(aes(consumption, co2_emmission)) +
    geom_abline(data = regression_lines, 
                aes(slope = slope, intercept = intercept),
                alpha = 0.8, linetype = "dotted", size = 0.5) +
    geom_point(color = "seagreen", alpha = 0.5, size = 1) +
    geom_text(data = regression_lines, 
              aes(x = 450, y = 1500, label = round(slope, digits = 2)),
              size = 6, hjust = 1, color = "seagreen") +
    labs(title = "CO2 vs Food Consumption",
         subtitle = "Includes kilograms CO2 emitted per kilogram of food consumed (per person per year)",
         x = "Food Consumption (kg/person/year)",
         y = "CO2 Emission (kg/person/year)",
         caption = plot_caption) +
    facet_wrap(~fct_reorder(food_category, desc(slope)))

Schweeet. Can see that food consumption and CO2 emission are linearly correlated, and each type of food has a different slope, showing kilograms of CO2 emitted for every kilogram of food consumed (per person per year). Beef and other meats like lamb and goat are off the charts, with huge levels emitted per kilogram consumed (~30 kg). It’s interesting that pork is not as severe on CO2 emissions, with only a tenth of other meats (~3 kg). And fish are on par with nuts and dairy (~1.5 kg). Also interesting that rice has such a large emission of CO2 compared to other grains like soybeans and wheat.

Learning

  • Using plyr was helpful, but led to some problems with functions being overwritten. Had to explicitly call dplyr::count for example.
  • Accomplished small multiple reordering by merging data, but wonder if there is a way to do this keeping the dataframes separate.
  • Running a linear model is really approachable, but like a lot of data science, a lot is packed into a few lines of code. Helpful but requires both knowledge and understanding.

Till next time!

Image Credit

large pot by Zach Bogart from the Noun Project

Zach Bogart
Zach Bogart
Data Explorer

Science, Design, & Data. I’ll know it when I see it.

Related