Need to calculate stats for different categories in your data?
People new to Python often write for loops.
A much faster and cleaner way is using .groupby()
Here’s the Example: Calculate the average and total sales for each product category.
In this case, we’ll just set up some sample data in a dataframe, but you could read from a CSV as well.
import pandas as pd
# Our Sample Data - you can also load a CSV file into a dataframe
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'C'],
'Sales': [100, 150, 200, 50, 300, 400]}
df = pd.DataFrame(data)
For Loop Version (the “newbie” version):
# 1. Get unique categories
unique_categories = df['Category'].unique()
# 2. Create a dictionary to store results
results = {}
# 3. Loop through each category
for category in unique_categories:
subset = df[df['Category'] == category]
total_sales = subset['Sales'].sum()
mean_sales = subset['Sales'].mean()
results[category] = {'sum': total_sales, 'mean': mean_sales}
# 4. Convert back to a DataFrame
final_df = pd.DataFrame.from_dict(results, orient='index')
print(final_df)
Pro Version:
# The PRO version: same results, less code
category_stats = df.groupby('Category')['Sales'].agg(['mean', 'sum'])
print(category_stats)
groupby() is exponentially faster on large datasets than looping.
Writing efficient code means you deliver insights faster, freeing you up for more complex analysis instead of waiting for scripts to run or wasting resources.
It may not make a huge difference on a dataset with just a few rows, but when you’re working with bigger datasets, it can make a big difference.
Same output, more optimized code.