🔍 Handling Missing Data

Missing data is like having blank spots in a puzzle - you need to decide whether to fill them in or work around them. Almost every real dataset has missing values, so learning to handle them properly is essential for accurate analysis.

🔍 Identifying Missing Data

First, you need to find where data is missing:

import pandas as pd
import numpy as np

# Customer survey with missing data
survey = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
    'age': [25, None, 35, 28, 32],
    'satisfaction': [4, 5, None, 3, 4],
    'email': ['alice@email.com', 'bob@email.com', '', 'diana@email.com', None]
})

print("📊 Survey Data:")
print(survey)
print()

print("🔍 Finding Missing Data:")
print()

print("1️⃣ Check for missing values:")
print(survey.isnull())
print()

print("2️⃣ Count missing values per column:")
print(survey.isnull().sum())
print()

print("3️⃣ Total missing values:")
print(f"Total missing: {survey.isnull().sum().sum()}")
print()

print("4️⃣ Percentage missing per column:")
missing_percent = (survey.isnull().sum() / len(survey)) * 100
print(missing_percent.round(1))

📋 Missing Data Strategies

Different approaches for different situations:

StrategyWhen to UsePandas Method
Remove RowsFew missing values, lots of data.dropna()
Remove ColumnsColumn mostly missing, not importantdf.drop(columns=['col'])
Fill with ValueClear default makes sense.fillna(value)
Fill Forward/BackwardTime series or sequential data.fillna(method='ffill')
Fill with Mean/MedianNumeric data, want average value.fillna(df['col'].mean())

🗑️ Removing Missing Data

Sometimes the simplest solution is to remove missing data:

import pandas as pd
import numpy as np

# Product reviews with missing data
reviews = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Tablet'],
    'rating': [4.5, None, 4.8, 4.2, None],
    'price': [999, 25, 75, None, 450],
    'review_count': [150, 80, 200, 95, 30]
})

print("📊 Product Reviews:")
print(reviews)
print()

print("🗑️ Removing Missing Data:")
print()

print("1️⃣ Remove rows with ANY missing values:")
complete_rows = reviews.dropna()
print(complete_rows)
print(f"Rows: {len(reviews)}{len(complete_rows)}")
print()

print("2️⃣ Remove rows where 'rating' is missing:")
has_rating = reviews.dropna(subset=['rating'])
print(has_rating)
print(f"Rows: {len(reviews)}{len(has_rating)}")
print()

print("3️⃣ Keep rows with at least 3 non-missing values:")
mostly_complete = reviews.dropna(thresh=3)
print(mostly_complete)
print(f"Rows: {len(reviews)}{len(mostly_complete)}")

🔄 Filling Missing Data

Replace missing values with meaningful substitutes:

import pandas as pd
import numpy as np

# Sales data with missing values
sales = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May'],
    'revenue': [10000, None, 15000, 12000, None],
    'region': ['North', 'North', None, 'South', 'South'],
    'sales_rep': ['Alice', 'Bob', 'Alice', None, 'Charlie']
})

print("📊 Sales Data:")
print(sales)
print()

print("🔄 Filling Missing Values:")
print()

print("1️⃣ Fill revenue with 0 (no sales):")
filled_zero = sales.copy()
filled_zero['revenue'] = filled_zero['revenue'].fillna(0)
print(filled_zero)
print()

print("2️⃣ Fill revenue with average:")
filled_mean = sales.copy()
avg_revenue = sales['revenue'].mean()
filled_mean['revenue'] = filled_mean['revenue'].fillna(avg_revenue)
print(filled_mean)
print(f"Average revenue used: ${avg_revenue:,.0f}")
print()

print("3️⃣ Fill region with 'Unknown':")
filled_text = sales.copy()
filled_text['region'] = filled_text['region'].fillna('Unknown')
print(filled_text[['month', 'region']])

📈 Smart Filling Strategies

Context-aware ways to fill missing data:

import pandas as pd
import numpy as np

# Student grades with strategic missing data
students = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'math_score': [85, None, 78, 92, None],
    'science_score': [90, 88, None, 89, 85],
    'attendance': [95, 88, 92, None, 90],
    'grade_level': ['A', 'B', 'B', 'A', None]
})

print("📊 Student Data:")
print(students)
print()

print("📈 Smart Filling Strategies:")
print()

print("1️⃣ Fill math scores with class average:")
smart_fill = students.copy()
class_avg_math = students['math_score'].mean()
smart_fill['math_score'] = smart_fill['math_score'].fillna(class_avg_math)
print("Math scores filled with class average:")
print(smart_fill[['student', 'math_score']])
print(f"Class average: {class_avg_math:.1f}")
print()

print("2️⃣ Fill attendance with median (typical attendance):")
median_attendance = students['attendance'].median()
smart_fill['attendance'] = smart_fill['attendance'].fillna(median_attendance)
print("Attendance filled with median:")
print(smart_fill[['student', 'attendance']])
print(f"Median attendance: {median_attendance}%")
print()

print("3️⃣ Fill grade level based on average scores:")
# For Eve, calculate average and assign grade
eve_avg = (smart_fill.loc[4, 'science_score']) # Only science score available
if eve_avg >= 90:
    grade = 'A'
elif eve_avg >= 80:
    grade = 'B' 
else:
    grade = 'C'
smart_fill.loc[4, 'grade_level'] = grade
print("Grade level filled based on performance:")
print(smart_fill[['student', 'science_score', 'grade_level']])

🎯 Practical Missing Data Scenarios

Real-world examples of handling missing data:

import pandas as pd
import numpy as np

# E-commerce order data
orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'customer_name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Tablet'],
    'quantity': [1, 2, 1, None, 1],
    'price': [999, 25, 75, 300, None],
    'shipping_address': ['123 Main St', None, '789 Oak Ave', '456 Pine St', '321 Elm St']
})

print("📦 E-commerce Orders:")
print(orders)
print()

print("🎯 Business Logic for Missing Data:")
print()

# Strategy 1: Quantity defaults to 1
print("1️⃣ Missing quantity = 1 (standard order):")
business_logic = orders.copy()
business_logic['quantity'] = business_logic['quantity'].fillna(1)
print(business_logic[['order_id', 'quantity']])
print()

# Strategy 2: Price from product catalog (simplified)
print("2️⃣ Fill missing price from product catalog:")
# Simulate looking up tablet price
tablet_price = 450  # From product catalog
business_logic.loc[business_logic['product'] == 'Tablet', 'price'] = business_logic.loc[business_logic['product'] == 'Tablet', 'price'].fillna(tablet_price)
print(f"Tablet price filled: ${tablet_price}")
print(business_logic[['product', 'price']])
print()

# Strategy 3: Handle critical missing data
print("3️⃣ Orders missing customer name (needs attention):")
missing_customer = business_logic[business_logic['customer_name'].isnull()]
print("Orders needing customer follow-up:")
print(missing_customer[['order_id', 'product']])

⚠️ Missing Data Pitfalls

Avoid these common mistakes that can compromise your analysis:

import pandas as pd
import numpy as np

# Survey responses demonstrating good vs bad practices
responses = pd.DataFrame({
    'respondent': [1, 2, 3, 4, 5],
    'age': [25, 30, None, 35, 28],
    'income': [50000, None, 75000, 60000, 45000],
    'satisfaction': [4, 5, None, 3, 4]
})

print("📊 Original Data:")
print(responses)
print()

print("✅ Good Practice: Context-Aware Handling")
print()

# Good practice: Fill income with median (reasonable assumption)
good_fill = responses.copy()
median_income = responses['income'].median()
good_fill['income'] = good_fill['income'].fillna(median_income)
print(f"Income filled with median (${median_income:,}):")
print(good_fill[['respondent', 'income']])
print()

# Good practice: Only drop if satisfaction missing (main metric)
critical_complete = responses.dropna(subset=['satisfaction'])
print("Kept data where satisfaction is available:")
print(f"Rows: {len(responses)}{len(critical_complete)}")
print(critical_complete)

🛠️ Missing Data Workflow

A systematic approach to handling missing data:

import pandas as pd
import numpy as np

# Employee dataset for workflow demo
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5, 6],
    'name': ['Alice', 'Bob', None, 'Diana', 'Eve', 'Frank'],
    'department': ['Sales', 'IT', 'Sales', None, 'HR', 'IT'],
    'salary': [50000, 75000, 55000, None, 48000, 70000],
    'start_date': ['2020-01', None, '2019-03', '2021-06', '2020-08', '2018-12'],
    'performance': [4.2, None, 4.5, 3.8, 4.0, None]
})

print("🛠️ Missing Data Workflow:")
print()
print("Step 1: Assess the situation")
print(employees)
print()

print("Step 2: Understand missing patterns")
print("Missing values by column:")
print(employees.isnull().sum())
print()

print("Step 3: Apply appropriate strategies")
cleaned = employees.copy()

# Strategy for each column
print("Name: Remove rows (critical field)")
cleaned = cleaned.dropna(subset=['name'])
print(f"Rows after name cleaning: {len(cleaned)}")

print("Department: Fill with 'TBD' (to be determined)")
cleaned['department'] = cleaned['department'].fillna('TBD')

print("Salary: Fill with department average")
dept_avg = cleaned.groupby('department')['salary'].mean()
for dept in cleaned['department'].unique():
    if dept in dept_avg:
        mask = (cleaned['department'] == dept) & (cleaned['salary'].isnull())
        cleaned.loc[mask, 'salary'] = dept_avg[dept]

print("Performance: Fill with median")
cleaned['performance'] = cleaned['performance'].fillna(cleaned['performance'].median())

print()
print("Final cleaned dataset:")
print(cleaned)
print()
print("✅ All missing values handled appropriately!")

🎯 Key Takeaways

🎮 Practice Challenge

Apply missing data techniques to a realistic scenario:

import pandas as pd
import numpy as np

# Customer feedback dataset
feedback = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105, 106],
    'purchase_date': ['2023-01-15', '2023-02-10', None, '2023-01-25', '2023-03-05', '2023-02-20'],
    'product_rating': [5, None, 4, 3, None, 4],
    'service_rating': [4, 5, None, 2, 4, None],
    'total_spent': [150, 89, 200, None, 75, 120],
    'would_recommend': [True, None, True, False, True, None]
})

print("🎮 Customer Feedback Challenge:")
print(feedback)
print()

print("Your mission: Clean this data for analysis")
print()

# Solution approach
cleaned_feedback = feedback.copy()

print("Solution approach:")
print("1️⃣ Remove customers without purchase date (invalid records)")
cleaned_feedback = cleaned_feedback.dropna(subset=['purchase_date'])
print(f"Rows: {len(feedback)}{len(cleaned_feedback)}")

print("2️⃣ Fill missing ratings with average ratings")
avg_product = cleaned_feedback['product_rating'].mean()
avg_service = cleaned_feedback['service_rating'].mean()
cleaned_feedback['product_rating'] = cleaned_feedback['product_rating'].fillna(avg_product)
cleaned_feedback['service_rating'] = cleaned_feedback['service_rating'].fillna(avg_service)

print("3️⃣ Fill missing spend with median")
median_spend = cleaned_feedback['total_spent'].median()
cleaned_feedback['total_spent'] = cleaned_feedback['total_spent'].fillna(median_spend)

print("4️⃣ Fill missing recommendations based on ratings")
# If average rating >= 4, assume would recommend
for idx in cleaned_feedback[cleaned_feedback['would_recommend'].isnull()].index:
    avg_rating = (cleaned_feedback.loc[idx, 'product_rating'] + 
                  cleaned_feedback.loc[idx, 'service_rating']) / 2
    cleaned_feedback.loc[idx, 'would_recommend'] = avg_rating >= 4

print()
print("Cleaned dataset:")
print(cleaned_feedback)
print("✅ Ready for customer satisfaction analysis!")

🚀 What's Next?

Excellent! You now know how to handle missing data effectively. Next, let's tackle another common data quality issue: duplicate records.

Continue to: Removing Duplicates

You're building strong data cleaning skills! 🔍✨

Was this helpful?

😔Poor
🙁Fair
😊Good
😄Great
🤩Excellent