🧹 Data Cleaning

Data cleaning is like organizing a messy room before you can find anything useful! Real-world data is rarely perfect - it has missing values, duplicates, wrong types, and messy text. Learning to clean data is essential because good analysis starts with clean data.

🎯 Why Data Cleaning Matters

Real data is messy. Here's what you typically encounter:

import pandas as pd
import numpy as np

# Typical messy dataset
messy_data = pd.DataFrame({
    'name': ['Alice', '  bob  ', 'CHARLIE', 'Alice', 'diana'],
    'age': [25, None, 30, 25, '28'],
    'email': ['alice@email.com', 'BOB@EMAIL.COM', 'charlie@email.com', 'alice@email.com', ''],
    'salary': [50000, 60000, None, 50000, 45000]
})

print("📊 Messy Data (Typical Real-World Dataset):")
print(messy_data)
print()
print("Problems:")
print("- Extra spaces in names")
print("- Missing age and salary values")
print("- Inconsistent name capitalization")
print("- Age stored as text ('28')")
print("- Duplicate rows (Alice)")
print("- Empty email field")
print("- Inconsistent email case")

After cleaning, your data should look organized and consistent:

import pandas as pd

# Clean version of the same data
clean_data = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 30, 28],
    'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 'diana@email.com'],
    'salary': [50000, 60000, 55000, 45000]
})

print("✨ Clean Data (After Cleaning):")
print(clean_data)
print()
print("Fixed:")
print("✅ Consistent name formatting")
print("✅ No missing values")
print("✅ Proper data types")
print("✅ No duplicates")
print("✅ Standardized email format")
print("✅ Ready for analysis!")

🛠️ Data Cleaning Checklist

Problem	Solution	Pandas Method
Missing Values	Fill or remove	`.fillna()`, `.dropna()`
Duplicates	Remove duplicates	`.drop_duplicates()`
Wrong Types	Convert types	`.astype()`, `pd.to_numeric()`
Messy Text	Clean strings	`.str.strip()`, `.str.lower()`
Inconsistent Format	Standardize	`.str.replace()`, `.str.title()`

👀 Quick Cleaning Preview

Here's what data cleaning looks like in action:

import pandas as pd
import numpy as np

# Sample messy customer data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 2, 4],
    'name': [' Alice ', 'bob', 'CHARLIE', 'bob', '  diana  '],
    'phone': ['123-456-7890', '(555) 123-4567', '555.123.4567', '(555) 123-4567', ''],
    'purchase_amount': ['100.50', '250', None, '250', '75.25']
})

print("Messy Customer Data:")
print(customers)
print()

# Quick cleaning demo
print("🧹 Cleaning Steps:")
print()

# Step 1: Remove duplicates
print("1️⃣ Remove duplicates:")
cleaned = customers.drop_duplicates()
print(f"Rows: {len(customers)} → {len(cleaned)}")
print()

# Step 2: Clean names
print("2️⃣ Clean names:")
cleaned['name'] = cleaned['name'].str.strip().str.title()
print(cleaned[['customer_id', 'name']].to_string(index=False))
print()

# Step 3: Handle missing values
print("3️⃣ Handle missing purchase amounts:")
cleaned['purchase_amount'] = cleaned['purchase_amount'].fillna('0')
print("Missing values filled with 0")
print()

print("🎯 Result: Clean, consistent data ready for analysis!")

📊 What You'll Learn in This Section

Master the essential data cleaning techniques:

🔍 Handling Missing Data Learn to identify, fill, or remove missing values effectively.
🧹 Removing Duplicates Find and eliminate duplicate rows from your datasets.
🔄 Data Type Conversion Convert data to the right types for proper analysis.
📝 String Cleaning Operations Clean and standardize text data for consistency.

🎯 Common Data Quality Issues

Real datasets have predictable problems:

import pandas as pd
import numpy as np

# Survey data with common issues
survey = pd.DataFrame({
    'response_id': [1, 2, 3, 4, 5, 3],  # Duplicate ID
    'age': [25, None, 35, '30', 45, 35],  # Missing and wrong type
    'city': ['  NYC  ', 'los angeles', 'CHICAGO', '  NYC  ', '', 'CHICAGO'],  # Inconsistent format
    'rating': [5, 4, None, 3, 2, None],  # Missing ratings
    'feedback': ['Great!', 'good', 'EXCELLENT', 'okay', '', 'EXCELLENT']  # Inconsistent case
})

print("📋 Survey Data - Common Issues:")
print(survey)
print()

print("🔍 Data Quality Check:")
print(f"Total responses: {len(survey)}")
print(f"Duplicate IDs: {survey['response_id'].duplicated().sum()}")
print(f"Missing ages: {survey['age'].isna().sum()}")
print(f"Missing ratings: {survey['rating'].isna().sum()}")
print(f"Empty cities: {(survey['city'] == '').sum()}")
print(f"Empty feedback: {(survey['feedback'] == '').sum()}")
print()

print("🎯 This section will teach you to fix all these issues!")

🧪 Before vs After Cleaning

See the transformation power of good data cleaning:

import pandas as pd
import numpy as np

# Product data - before cleaning
products_messy = pd.DataFrame({
    'product_name': ['  laptop  ', 'MOUSE', 'keyboard', '  laptop  ', 'monitor'],
    'price': ['999.99', '25', None, '999.99', '300'],
    'category': ['computer', 'ACCESSORY', 'accessory', 'computer', ''],
    'in_stock': ['yes', 'YES', 'no', 'yes', 'No']
})

print("❌ BEFORE Cleaning:")
print(products_messy)
print()

# After cleaning (preview of what you'll learn)
products_clean = pd.DataFrame({
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [999.99, 25.0, 50.0, 300.0],  # Converted to numbers, filled missing
    'category': ['Computer', 'Accessory', 'Accessory', 'Computer'],
    'in_stock': [True, True, False, False]  # Converted to boolean
})

print("✅ AFTER Cleaning:")
print(products_clean)
print()
print("🎯 Improvements:")
print("✅ Consistent text formatting")
print("✅ Proper numeric data types")
print("✅ No missing values")
print("✅ No duplicates")
print("✅ Boolean values for yes/no")
print("✅ Ready for analysis and calculations!")

🎯 Data Cleaning Best Practices

🔍 Identifying Data Problems

Before cleaning, you need to spot the issues:

import pandas as pd
import numpy as np

# Employee data with various issues
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 2],
    'name': ['alice smith', 'BOB JONES', '  charlie brown  ', 'diana prince', 'BOB JONES'],
    'department': ['sales', 'IT', None, 'marketing', 'IT'],
    'salary': [50000, '75000', 60000, None, '75000'],
    'start_date': ['2020-01-15', '2019-03-01', '', '2021-06-01', '2019-03-01']
})

print("🔍 Employee Data Analysis:")
print(employees)
print()

print("📊 Data Quality Report:")
print(f"Shape: {employees.shape}")
print(f"Duplicates: {employees.duplicated().sum()}")
print()

print("Missing Values by Column:")
print(employees.isnull().sum())
print()

print("Data Types:")
print(employees.dtypes)
print()

print("🎯 Issues Found:")
print("- Duplicate employee (ID 2)")
print("- Inconsistent name formatting")
print("- Missing department and salary")
print("- Salary stored as text")
print("- Empty start date")

🚀 What's Next?

Ready to transform messy data into analysis-ready datasets? Let's start with one of the most common issues: missing data.

Start with: Handling Missing Data

Time to become a data cleaning expert! 🧹✨

Online Python

🧹 Data Cleaning

Track Your Learning Progress