🔗 GroupBy Operations
GroupBy is one of pandas' most powerful features! It lets you split your data into groups based on column values, then apply functions to each group. This is perfect for answering questions like "What's the average sales by department?" or "How many orders per customer?"
Think of GroupBy like organizing your music collection by artist, then asking questions about each artist's songs.
import pandas as pd
# Simple sales data
sales = pd.DataFrame({
'department': ['Sales', 'IT', 'Sales', 'HR', 'IT'],
'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'revenue': [50000, 75000, 45000, 30000, 80000]
})
print("Sales data:")
print(sales)
print()
# Group by department and sum revenue
dept_total = sales.groupby('department')['revenue'].sum()
print("Total revenue by department:")
print(dept_total)
🎯 Understanding GroupBy
GroupBy works in three steps: Split, Apply, Combine.
📊 Basic GroupBy Syntax
The basic syntax is simple: df.groupby('column')['target_column'].function()
import pandas as pd
# Customer orders
orders = pd.DataFrame({
'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Laptop'],
'amount': [999, 25, 75, 300, 999]
})
print("Customer orders:")
print(orders)
print()
# Different GroupBy operations
print("Total spent per customer:")
print(orders.groupby('customer')['amount'].sum())
print()
print("Average order per customer:")
print(orders.groupby('customer')['amount'].mean())
print()
print("Number of orders per customer:")
print(orders.groupby('customer').size())
🔢 Common Aggregation Functions
Here are the most useful functions for GroupBy operations:
Mathematical Aggregations
import pandas as pd
# Employee data
employees = pd.DataFrame({
'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'HR'],
'salary': [50000, 75000, 52000, 48000, 80000, 45000],
'experience': [2, 5, 3, 1, 7, 2]
})
print("Employee data:")
print(employees)
print()
# Mathematical aggregations by department
print("Average salary by department:")
print(employees.groupby('department')['salary'].mean())
print()
print("Salary range by department:")
salary_range = employees.groupby('department')['salary'].agg(['min', 'max'])
print(salary_range)
Counting Operations
import pandas as pd
# Product reviews
reviews = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse', 'Laptop'],
'rating': [5, 4, 4, 5, 3, 5],
'reviewer': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank']
})
print("Product reviews:")
print(reviews)
print()
# Count operations
print("Number of reviews per product:")
print(reviews.groupby('product').size())
print()
print("Average rating per product:")
print(reviews.groupby('product')['rating'].mean())
print()
print("Number of unique reviewers per product:")
print(reviews.groupby('product')['reviewer'].nunique())
🎨 Grouping by Multiple Columns
You can group by multiple columns to create more detailed analysis:
import pandas as pd
# Sales by region and quarter
sales = pd.DataFrame({
'region': ['North', 'South', 'North', 'South', 'North', 'South'],
'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q1', 'Q2'],
'sales': [100000, 85000, 110000, 90000, 95000, 88000]
})
print("Sales data:")
print(sales)
print()
# Group by multiple columns
multi_group = sales.groupby(['region', 'quarter'])['sales'].sum()
print("Sales by region and quarter:")
print(multi_group)
🔧 Multiple Aggregations with agg()
The .agg()
method lets you apply multiple functions at once:
import pandas as pd
# Store performance data
stores = pd.DataFrame({
'region': ['East', 'West', 'East', 'West', 'East'],
'revenue': [50000, 75000, 45000, 80000, 55000],
'customers': [150, 200, 120, 220, 160]
})
print("Store data:")
print(stores)
print()
# Multiple aggregations
summary = stores.groupby('region').agg({
'revenue': ['sum', 'mean'],
'customers': ['sum', 'mean']
})
print("Store summary by region:")
print(summary)
📋 Working with GroupBy Results
GroupBy results can be converted back to DataFrames for further analysis:
import pandas as pd
# Product categories
products = pd.DataFrame({
'category': ['Electronics', 'Books', 'Electronics', 'Clothing', 'Books'],
'price': [299, 15, 199, 45, 25],
'rating': [4.5, 4.0, 4.2, 3.8, 4.1]
})
print("Product data:")
print(products)
print()
# GroupBy with reset_index to get DataFrame
category_stats = products.groupby('category').agg({
'price': 'mean',
'rating': 'mean'
}).reset_index()
print("Category statistics (as DataFrame):")
print(category_stats)
print()
# Add descriptive column names
category_stats.columns = ['Category', 'Average Price', 'Average Rating']
print("With better column names:")
print(category_stats)
🎯 Filtering Groups
You can filter groups based on their characteristics:
import pandas as pd
# Customer transactions
transactions = pd.DataFrame({
'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice', 'Diana'],
'amount': [100, 50, 150, 75, 200, 80, 300]
})
print("Transaction data:")
print(transactions)
print()
# Filter groups with more than 2 transactions
frequent_customers = transactions.groupby('customer').filter(lambda x: len(x) > 2)
print("Customers with more than 2 transactions:")
print(frequent_customers)
print()
# Get customers who spent more than $200 total
big_spenders = transactions.groupby('customer')['amount'].sum()
big_spenders = big_spenders[big_spenders > 200]
print("Customers who spent more than $200:")
print(big_spenders)
📊 GroupBy with Different Data Types
GroupBy works with different types of data:
🎯 Key Takeaways
🚀 What's Next?
Excellent! You now understand the fundamentals of GroupBy operations. Next, let's explore more advanced aggregation functions and custom operations.
Continue to: Aggregation Functions
Keep grouping! 🔗📊
Was this helpful?
Track Your Learning Progress
Sign in to bookmark tutorials and keep track of your learning journey.
Your progress is saved automatically as you read.