📋 Pivot Tables and Cross Tabulation

Pivot tables and cross-tabulation are incredibly powerful tools for creating summary tables! They help you reorganize and summarize data in ways that reveal patterns and relationships. If you've used Excel pivot tables, pandas makes them even more powerful and flexible.

Think of pivot tables like reorganizing your data to answer specific questions - instead of looking at individual rows, you create a summary table that shows the big picture.

import pandas as pd

# Sales data
sales = pd.DataFrame({
    'region': ['North', 'South', 'North', 'South', 'North', 'South'],
    'product': ['Laptop', 'Laptop', 'Mouse', 'Mouse', 'Laptop', 'Mouse'],
    'quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q2', 'Q2'],
    'revenue': [50000, 75000, 5000, 6000, 60000, 5500]
})

print("Sales data:")
print(sales)
print()

# Create a pivot table
pivot = sales.pivot_table(
    values='revenue',
    index='region',
    columns='product',
    aggfunc='sum'
)
print("Revenue by region and product:")
print(pivot)

🎯 Understanding Pivot Tables

Pivot tables reorganize your data into a cross-tabular format, making comparisons easy:

📊 Basic Pivot Tables

Let's start with simple pivot table examples:

import pandas as pd

# Employee data
employees = pd.DataFrame({
    'department': ['Sales', 'IT', 'Sales', 'IT', 'HR', 'HR'],
    'level': ['Junior', 'Senior', 'Senior', 'Junior', 'Junior', 'Senior'],
    'salary': [45000, 80000, 65000, 70000, 40000, 55000]
})

print("Employee data:")
print(employees)
print()

# Simple pivot: Average salary by department and level
salary_pivot = employees.pivot_table(
    values='salary',
    index='department',
    columns='level',
    aggfunc='mean'
)
print("Average salary by department and level:")
print(salary_pivot)

Different Aggregation Functions

import pandas as pd

# Product reviews
reviews = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse'],
    'rating': [5, 4, 4, 5, 3],
    'price_range': ['High', 'Low', 'High', 'Medium', 'Low']
})

print("Product reviews:")
print(reviews)
print()

# Count of reviews
review_count = reviews.pivot_table(
    values='rating',
    index='product',
    columns='price_range',
    aggfunc='count',
    fill_value=0  # Replace NaN with 0
)
print("Number of reviews by product and price range:")
print(review_count)
print()

# Average rating
avg_rating = reviews.pivot_table(
    values='rating',
    index='product',
    columns='price_range',
    aggfunc='mean'
)
print("Average rating by product and price range:")
print(avg_rating)

🔄 Cross Tabulation with pd.crosstab()

Cross-tabulation is perfect for analyzing relationships between categorical variables:

import pandas as pd

# Customer survey data
survey = pd.DataFrame({
    'age_group': ['18-25', '26-35', '18-25', '36-45', '26-35', '18-25'],
    'satisfaction': ['High', 'High', 'Medium', 'High', 'Medium', 'Low'],
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'Boston']
})

print("Survey data:")
print(survey)
print()

# Cross-tabulation: Count by age group and satisfaction
crosstab = pd.crosstab(
    survey['age_group'],
    survey['satisfaction']
)
print("Count by age group and satisfaction:")
print(crosstab)
print()

# Add row and column totals
crosstab_with_totals = pd.crosstab(
    survey['age_group'],
    survey['satisfaction'],
    margins=True,
    margins_name='Total'
)
print("With totals:")
print(crosstab_with_totals)

Percentage Analysis

import pandas as pd

# Customer preferences
preferences = pd.DataFrame({
    'age_group': ['Young', 'Middle', 'Young', 'Senior', 'Middle', 'Young'],
    'product_type': ['Tech', 'Books', 'Books', 'Tech', 'Tech', 'Tech'],
    'purchase': ['Yes', 'No', 'Yes', 'Yes', 'No', 'Yes']
})

print("Customer preferences:")
print(preferences)
print()

# Percentage by row (what % of each age group purchased)
pct_by_row = pd.crosstab(
    preferences['age_group'],
    preferences['purchase'],
    normalize='index'  # Normalize by row
) * 100

print("Purchase rate by age group (%):")
print(pct_by_row.round(1))
print()

# Percentage by column (what % of purchasers are in each age group)
pct_by_col = pd.crosstab(
    preferences['age_group'],
    preferences['purchase'],
    normalize='columns'  # Normalize by column
) * 100

print("Age group distribution of purchasers (%):")
print(pct_by_col.round(1))

📈 Multi-Level Pivot Tables

You can create more complex pivot tables with multiple dimensions:

import pandas as pd

# Sales data with multiple dimensions
sales = pd.DataFrame({
    'region': ['North', 'South', 'North', 'South', 'North', 'South'],
    'product': ['A', 'A', 'B', 'B', 'A', 'B'],
    'quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q2', 'Q2'],
    'salesperson': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie', 'Diana'],
    'revenue': [10000, 15000, 8000, 12000, 11000, 9000]
})

print("Sales data:")
print(sales)
print()

# Multi-level pivot table
multi_pivot = sales.pivot_table(
    values='revenue',
    index=['region', 'quarter'],  # Multiple row levels
    columns='product',            # Single column level
    aggfunc='sum',
    fill_value=0
)
print("Revenue by region, quarter, and product:")
print(multi_pivot)

🔢 Multiple Value Columns

You can analyze multiple metrics in one pivot table:

import pandas as pd

# Store performance data
stores = pd.DataFrame({
    'region': ['East', 'West', 'East', 'West', 'East'],
    'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q1'],
    'revenue': [50000, 75000, 55000, 80000, 48000],
    'customers': [500, 750, 520, 800, 480]
})

print("Store performance:")
print(stores)
print()

# Multiple value analysis
multi_value = stores.pivot_table(
    values=['revenue', 'customers'],
    index='region',
    columns='quarter',
    aggfunc='sum',
    fill_value=0
)
print("Revenue and customers by region and quarter:")
print(multi_value)

📊 Pivot Table Styling and Formatting

Make your pivot tables more readable:

import pandas as pd

# Sales performance data
performance = pd.DataFrame({
    'team': ['Alpha', 'Beta', 'Alpha', 'Beta', 'Alpha'],
    'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q1'],
    'revenue': [120000, 95000, 135000, 110000, 115000],
    'target': [100000, 90000, 130000, 105000, 110000]
})

print("Performance data:")
print(performance)
print()

# Formatted pivot table
formatted_pivot = performance.pivot_table(
    values=['revenue', 'target'],
    index='team',
    columns='quarter',
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name='Total'
)

# Round to thousands for readability
formatted_pivot = (formatted_pivot / 1000).round(0).astype(int)

print("Revenue and targets by team and quarter (in thousands):")
print(formatted_pivot)

🎯 Real-World Applications

📝 Practice: Customer Analysis

Let's practice with a realistic customer analysis:

import pandas as pd

# Customer purchase data
purchases = pd.DataFrame({
    'customer_segment': ['Premium', 'Standard', 'Premium', 'Basic', 'Standard'],
    'product_category': ['Electronics', 'Books', 'Clothing', 'Books', 'Electronics'],
    'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q1'],
    'amount': [1200, 45, 300, 25, 150],
    'satisfaction': [5, 4, 4, 3, 4]
})

print("Customer purchases:")
print(purchases)
print()

# Analysis 1: Total spending by segment and category
spending_analysis = purchases.pivot_table(
    values='amount',
    index='customer_segment',
    columns='product_category',
    aggfunc='sum',
    fill_value=0
)
print("Total spending by segment and category:")
print(spending_analysis)
print()

# Analysis 2: Average satisfaction
satisfaction_analysis = purchases.pivot_table(
    values='satisfaction',
    index='customer_segment',
    columns='quarter',
    aggfunc='mean'
)
print("Average satisfaction by segment and quarter:")
print(satisfaction_analysis.round(1))

🎯 Key Takeaways

🚀 What's Next?

Excellent! You now know how to create powerful summary tables with pivot tables and cross-tabulation. Next, let's learn about combining data from different sources with concatenation and merging.

Continue to: Concatenating DataFrames

Keep pivoting! 📋📊

Online Python

📋 Pivot Tables and Cross Tabulation

Track Your Learning Progress