🔗 GroupBy Operations

GroupBy is one of pandas' most powerful features! It lets you split your data into groups based on column values, then apply functions to each group. This is perfect for answering questions like "What's the average sales by department?" or "How many orders per customer?"

Think of GroupBy like organizing your music collection by artist, then asking questions about each artist's songs.

import pandas as pd

# Simple sales data
sales = pd.DataFrame({
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT'],
    'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'revenue': [50000, 75000, 45000, 30000, 80000]
})

print("Sales data:")
print(sales)
print()

# Group by department and sum revenue
dept_total = sales.groupby('department')['revenue'].sum()
print("Total revenue by department:")
print(dept_total)

🎯 Understanding GroupBy

GroupBy works in three steps: Split, Apply, Combine.

📊 Basic GroupBy Syntax

The basic syntax is simple: df.groupby('column')['target_column'].function()

import pandas as pd

# Customer orders
orders = pd.DataFrame({
    'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Laptop'],
    'amount': [999, 25, 75, 300, 999]
})

print("Customer orders:")
print(orders)
print()

# Different GroupBy operations
print("Total spent per customer:")
print(orders.groupby('customer')['amount'].sum())
print()

print("Average order per customer:")
print(orders.groupby('customer')['amount'].mean())
print()

print("Number of orders per customer:")
print(orders.groupby('customer').size())

🔢 Common Aggregation Functions

Here are the most useful functions for GroupBy operations:

Mathematical Aggregations

import pandas as pd

# Employee data
employees = pd.DataFrame({
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'HR'],
    'salary': [50000, 75000, 52000, 48000, 80000, 45000],
    'experience': [2, 5, 3, 1, 7, 2]
})

print("Employee data:")
print(employees)
print()

# Mathematical aggregations by department
print("Average salary by department:")
print(employees.groupby('department')['salary'].mean())
print()

print("Salary range by department:")
salary_range = employees.groupby('department')['salary'].agg(['min', 'max'])
print(salary_range)

Counting Operations

import pandas as pd

# Product reviews
reviews = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse', 'Laptop'],
    'rating': [5, 4, 4, 5, 3, 5],
    'reviewer': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank']
})

print("Product reviews:")
print(reviews)
print()

# Count operations
print("Number of reviews per product:")
print(reviews.groupby('product').size())
print()

print("Average rating per product:")
print(reviews.groupby('product')['rating'].mean())
print()

print("Number of unique reviewers per product:")
print(reviews.groupby('product')['reviewer'].nunique())

🎨 Grouping by Multiple Columns

You can group by multiple columns to create more detailed analysis:

import pandas as pd

# Sales by region and quarter
sales = pd.DataFrame({
    'region': ['North', 'South', 'North', 'South', 'North', 'South'],
    'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q1', 'Q2'],
    'sales': [100000, 85000, 110000, 90000, 95000, 88000]
})

print("Sales data:")
print(sales)
print()

# Group by multiple columns
multi_group = sales.groupby(['region', 'quarter'])['sales'].sum()
print("Sales by region and quarter:")
print(multi_group)

🔧 Multiple Aggregations with agg()

The .agg() method lets you apply multiple functions at once:

import pandas as pd

# Store performance data
stores = pd.DataFrame({
    'region': ['East', 'West', 'East', 'West', 'East'],
    'revenue': [50000, 75000, 45000, 80000, 55000],
    'customers': [150, 200, 120, 220, 160]
})

print("Store data:")
print(stores)
print()

# Multiple aggregations
summary = stores.groupby('region').agg({
    'revenue': ['sum', 'mean'],
    'customers': ['sum', 'mean']
})
print("Store summary by region:")
print(summary)

📋 Working with GroupBy Results

GroupBy results can be converted back to DataFrames for further analysis:

import pandas as pd

# Product categories
products = pd.DataFrame({
    'category': ['Electronics', 'Books', 'Electronics', 'Clothing', 'Books'],
    'price': [299, 15, 199, 45, 25],
    'rating': [4.5, 4.0, 4.2, 3.8, 4.1]
})

print("Product data:")
print(products)
print()

# GroupBy with reset_index to get DataFrame
category_stats = products.groupby('category').agg({
    'price': 'mean',
    'rating': 'mean'
}).reset_index()

print("Category statistics (as DataFrame):")
print(category_stats)
print()

# Add descriptive column names
category_stats.columns = ['Category', 'Average Price', 'Average Rating']
print("With better column names:")
print(category_stats)

🎯 Filtering Groups

You can filter groups based on their characteristics:

import pandas as pd

# Customer transactions
transactions = pd.DataFrame({
    'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice', 'Diana'],
    'amount': [100, 50, 150, 75, 200, 80, 300]
})

print("Transaction data:")
print(transactions)
print()

# Filter groups with more than 2 transactions
frequent_customers = transactions.groupby('customer').filter(lambda x: len(x) > 2)
print("Customers with more than 2 transactions:")
print(frequent_customers)
print()

# Get customers who spent more than $200 total
big_spenders = transactions.groupby('customer')['amount'].sum()
big_spenders = big_spenders[big_spenders > 200]
print("Customers who spent more than $200:")
print(big_spenders)

📊 GroupBy with Different Data Types

GroupBy works with different types of data:

🎯 Key Takeaways

🚀 What's Next?

Excellent! You now understand the fundamentals of GroupBy operations. Next, let's explore more advanced aggregation functions and custom operations.

Continue to: Aggregation Functions

Keep grouping! 🔗📊

Online Python

🔗 GroupBy Operations

Track Your Learning Progress