📈 Aggregation Functions
Aggregation functions are the tools that turn groups of data into meaningful summaries! Once you've grouped your data, these functions help you calculate totals, averages, counts, and much more. Mastering aggregation functions will unlock powerful insights in your data.
Think of aggregation functions like different ways to summarize a book - you could count the pages, find the average word length, or identify the most common theme.
import pandas as pd
# Sales team performance
team = pd.DataFrame({
'region': ['North', 'South', 'North', 'South', 'North'],
'salesperson': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'sales': [85000, 92000, 78000, 88000, 95000],
'calls': [150, 180, 120, 160, 200]
})
print("Sales team data:")
print(team)
print()
# Multiple aggregation functions
summary = team.groupby('region').agg({
'sales': ['sum', 'mean', 'max'],
'calls': ['sum', 'mean']
})
print("Region summary:")
print(summary)
🎯 Types of Aggregation Functions
Pandas provides many built-in aggregation functions for different purposes:
📊 Statistical Aggregations
Statistical functions help you understand the distribution and central tendencies of your data:
import pandas as pd
# Student test scores
scores = pd.DataFrame({
'subject': ['Math', 'Science', 'Math', 'Science', 'Math', 'Science'],
'student': ['Alice', 'Alice', 'Bob', 'Bob', 'Charlie', 'Charlie'],
'score': [85, 92, 78, 88, 95, 82]
})
print("Student scores:")
print(scores)
print()
# Statistical aggregations by subject
stats = scores.groupby('subject')['score'].agg([
'mean', # Average score
'median', # Middle score
'std', # Standard deviation
'min', # Lowest score
'max' # Highest score
])
print("Score statistics by subject:")
print(stats.round(2))
🔢 Counting and Size Functions
Understanding the size and frequency of your groups is crucial:
import pandas as pd
import numpy as np
# Customer orders with some missing data
orders = pd.DataFrame({
'customer': ['Alice', 'Alice', 'Bob', 'Bob', 'Charlie', 'Charlie'],
'product': ['Laptop', 'Mouse', 'Laptop', np.nan, 'Keyboard', 'Monitor'],
'amount': [999, 25, 999, np.nan, 75, 300]
})
print("Orders (with missing data):")
print(orders)
print()
# Different counting methods
print("Count comparison:")
print("Non-null products per customer:", orders.groupby('customer')['product'].count())
print("Total orders per customer:", orders.groupby('customer').size())
print("Unique products per customer:", orders.groupby('customer')['product'].nunique())
🎨 Custom Aggregation Functions
You can create your own aggregation functions for specific business logic:
import pandas as pd
def price_range(prices):
"""Calculate the range between max and min prices"""
return prices.max() - prices.min()
def top_product(products):
"""Find the most frequent product"""
return products.mode().iloc[0] if len(products.mode()) > 0 else 'None'
# Product sales data
sales = pd.DataFrame({
'category': ['Electronics', 'Books', 'Electronics', 'Books', 'Electronics'],
'product': ['Laptop', 'Novel', 'Mouse', 'Textbook', 'Laptop'],
'price': [999, 15, 25, 85, 899]
})
print("Sales data:")
print(sales)
print()
# Apply custom functions
custom_agg = sales.groupby('category').agg({
'price': ['mean', price_range],
'product': [top_product, 'count']
})
print("Custom aggregations:")
print(custom_agg)
📋 Multiple Aggregations with Different Names
You can rename aggregation results for clarity:
import pandas as pd
# Employee performance data
performance = pd.DataFrame({
'department': ['Sales', 'IT', 'Sales', 'IT', 'HR', 'HR'],
'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
'rating': [4.5, 4.8, 4.2, 4.6, 4.0, 4.3],
'projects': [8, 12, 6, 15, 4, 5]
})
print("Performance data:")
print(performance)
print()
# Named aggregations for clarity
summary = performance.groupby('department').agg(
avg_rating=('rating', 'mean'),
max_rating=('rating', 'max'),
total_projects=('projects', 'sum'),
team_size=('employee', 'count')
).round(2)
print("Department summary:")
print(summary)
🔍 Conditional Aggregations
You can apply conditions within aggregations:
import pandas as pd
# Sales with performance targets
sales = pd.DataFrame({
'region': ['North', 'South', 'North', 'South', 'North', 'South'],
'salesperson': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
'sales': [85000, 120000, 78000, 95000, 110000, 88000],
'target': [80000, 100000, 75000, 90000, 100000, 85000]
})
print("Sales vs targets:")
print(sales)
print()
# Calculate performance metrics
def above_target_count(group):
"""Count how many salespeople exceeded their target"""
return (group['sales'] > group['target']).sum()
def avg_performance(group):
"""Calculate average performance vs target"""
return (group['sales'] / group['target']).mean()
performance = sales.groupby('region').apply(lambda x: pd.Series({
'total_sales': x['sales'].sum(),
'avg_target_achievement': avg_performance(x),
'people_above_target': above_target_count(x),
'team_size': len(x)
}))
print("Regional performance:")
print(performance.round(2))
📊 Percentile and Quantile Analysis
Percentiles help you understand data distribution:
import pandas as pd
# Customer spending data
spending = pd.DataFrame({
'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Premium', 'Standard'],
'monthly_spend': [250, 45, 380, 62, 420, 38],
'orders': [8, 3, 12, 4, 15, 2]
})
print("Customer spending:")
print(spending)
print()
# Percentile analysis
percentiles = spending.groupby('segment')['monthly_spend'].agg([
('25th_percentile', lambda x: x.quantile(0.25)),
('median', 'median'),
('75th_percentile', lambda x: x.quantile(0.75)),
('max', 'max')
])
print("Spending percentiles by segment:")
print(percentiles)
🔧 Advanced Aggregation Techniques
Transform vs Aggregation
import pandas as pd
# Sales team data
team = pd.DataFrame({
'region': ['North', 'North', 'South', 'South'],
'salesperson': ['Alice', 'Bob', 'Charlie', 'Diana'],
'sales': [85000, 78000, 92000, 88000]
})
print("Original data:")
print(team)
print()
# Aggregation: Summarize by region
region_summary = team.groupby('region')['sales'].agg(['sum', 'mean'])
print("Aggregation (summary):")
print(region_summary)
print()
# Transform: Add regional average to each row
team['region_avg'] = team.groupby('region')['sales'].transform('mean')
print("Transform (added column):")
print(team)
🎯 Key Takeaways
🚀 What's Next?
Perfect! You now have a comprehensive toolkit of aggregation functions. Next, let's learn about pivot tables and cross-tabulation - powerful ways to create summary tables and analyze relationships.
Continue to: Pivot Tables and Cross Tabulation
Keep aggregating! 📈🔢
Was this helpful?
Track Your Learning Progress
Sign in to bookmark tutorials and keep track of your learning journey.
Your progress is saved automatically as you read.