A Python Blog Post I Often Reach For

I find I have to do a lot of grouping in pandas and I reach for this blog post by Shane Lynn all the time to remind me how to get it done. I’ve found grouping things in pandas difficult sometimes, usually when I want to create a column in the original dataframe by grouping stuff. This technique solves that and I’m super grateful to it.

So Group, Already

Learning works best when you try it out yourself, so let’s give it a go!

Say I have some census data and I want to group it together to get the sum of each group.

import pandas as pd
df = pd.read_csv("/Users/bogart/Downloads/7001_312628_bundle_archive/acs2017_county_data.csv")
df =  df[['State', 'County', 'TotalPop']]
df.head()

State County TotalPop
0 Alabama Autauga County 55036
1 Alabama Baldwin County 203360
2 Alabama Barbour County 26201
3 Alabama Bibb County 22580
4 Alabama Blount County 57667

Using the method from the blog post, I can make a grouped dataframe that sums up the populations for each county:

df.groupby('State').agg(
    state_pop = pd.NamedAgg(column='TotalPop', aggfunc='sum')
).head(7)

state_pop
State
Alabama 4850771
Alaska 738565
Arizona 6809946
Arkansas 2977944
California 38982847
Colorado 5436519
Connecticut 3594478

If I wrap it in a .join(), I can add it back to the original dataframe to use later:

df.join(df.groupby('State').agg(
    state_pop = pd.NamedAgg(column='TotalPop', aggfunc='sum')
), on='State').head(7)

State County TotalPop state_pop
0 Alabama Autauga County 55036 4850771
1 Alabama Baldwin County 203360 4850771
2 Alabama Barbour County 26201 4850771
3 Alabama Bibb County 22580 4850771
4 Alabama Blount County 57667 4850771
5 Alabama Bullock County 10478 4850771
6 Alabama Butler County 20126 4850771

Overall

Grouping data happens a bunch, but it can be complicated to remember the mechanics. It can also feel sometimes like blog posts are shouts into a void, but they can be the best teaching tools out there! Gotta say thanks again to Shane Lynn; I review the screenshot at the top of your blog post often.

Image Credit

merged dataframes by Zach Bogart from the Noun Project

Zach Bogart
Zach Bogart
Data Explorer

Science, Design, & Data