🧹 Data Cleaning
Data cleaning is like organizing a messy room before you can find anything useful! Real-world data is rarely perfect - it has missing values, duplicates, wrong types, and messy text. Learning to clean data is essential because good analysis starts with clean data.
🎯 Why Data Cleaning Matters
Real data is messy. Here's what you typically encounter:
import pandas as pd
import numpy as np
# Typical messy dataset
messy_data = pd.DataFrame({
'name': ['Alice', ' bob ', 'CHARLIE', 'Alice', 'diana'],
'age': [25, None, 30, 25, '28'],
'email': ['alice@email.com', 'BOB@EMAIL.COM', 'charlie@email.com', 'alice@email.com', ''],
'salary': [50000, 60000, None, 50000, 45000]
})
print("📊 Messy Data (Typical Real-World Dataset):")
print(messy_data)
print()
print("Problems:")
print("- Extra spaces in names")
print("- Missing age and salary values")
print("- Inconsistent name capitalization")
print("- Age stored as text ('28')")
print("- Duplicate rows (Alice)")
print("- Empty email field")
print("- Inconsistent email case")
After cleaning, your data should look organized and consistent:
import pandas as pd
# Clean version of the same data
clean_data = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 30, 28],
'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 'diana@email.com'],
'salary': [50000, 60000, 55000, 45000]
})
print("✨ Clean Data (After Cleaning):")
print(clean_data)
print()
print("Fixed:")
print("✅ Consistent name formatting")
print("✅ No missing values")
print("✅ Proper data types")
print("✅ No duplicates")
print("✅ Standardized email format")
print("✅ Ready for analysis!")
🛠️ Data Cleaning Checklist
Problem | Solution | Pandas Method |
---|---|---|
Missing Values | Fill or remove | .fillna() , .dropna() |
Duplicates | Remove duplicates | .drop_duplicates() |
Wrong Types | Convert types | .astype() , pd.to_numeric() |
Messy Text | Clean strings | .str.strip() , .str.lower() |
Inconsistent Format | Standardize | .str.replace() , .str.title() |
👀 Quick Cleaning Preview
Here's what data cleaning looks like in action:
import pandas as pd
import numpy as np
# Sample messy customer data
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 2, 4],
'name': [' Alice ', 'bob', 'CHARLIE', 'bob', ' diana '],
'phone': ['123-456-7890', '(555) 123-4567', '555.123.4567', '(555) 123-4567', ''],
'purchase_amount': ['100.50', '250', None, '250', '75.25']
})
print("Messy Customer Data:")
print(customers)
print()
# Quick cleaning demo
print("🧹 Cleaning Steps:")
print()
# Step 1: Remove duplicates
print("1️⃣ Remove duplicates:")
cleaned = customers.drop_duplicates()
print(f"Rows: {len(customers)} → {len(cleaned)}")
print()
# Step 2: Clean names
print("2️⃣ Clean names:")
cleaned['name'] = cleaned['name'].str.strip().str.title()
print(cleaned[['customer_id', 'name']].to_string(index=False))
print()
# Step 3: Handle missing values
print("3️⃣ Handle missing purchase amounts:")
cleaned['purchase_amount'] = cleaned['purchase_amount'].fillna('0')
print("Missing values filled with 0")
print()
print("🎯 Result: Clean, consistent data ready for analysis!")
📊 What You'll Learn in This Section
Master the essential data cleaning techniques:
- 🔍 Handling Missing Data Learn to identify, fill, or remove missing values effectively.
- 🧹 Removing Duplicates Find and eliminate duplicate rows from your datasets.
- 🔄 Data Type Conversion Convert data to the right types for proper analysis.
- 📝 String Cleaning Operations Clean and standardize text data for consistency.
🎯 Common Data Quality Issues
Real datasets have predictable problems:
import pandas as pd
import numpy as np
# Survey data with common issues
survey = pd.DataFrame({
'response_id': [1, 2, 3, 4, 5, 3], # Duplicate ID
'age': [25, None, 35, '30', 45, 35], # Missing and wrong type
'city': [' NYC ', 'los angeles', 'CHICAGO', ' NYC ', '', 'CHICAGO'], # Inconsistent format
'rating': [5, 4, None, 3, 2, None], # Missing ratings
'feedback': ['Great!', 'good', 'EXCELLENT', 'okay', '', 'EXCELLENT'] # Inconsistent case
})
print("📋 Survey Data - Common Issues:")
print(survey)
print()
print("🔍 Data Quality Check:")
print(f"Total responses: {len(survey)}")
print(f"Duplicate IDs: {survey['response_id'].duplicated().sum()}")
print(f"Missing ages: {survey['age'].isna().sum()}")
print(f"Missing ratings: {survey['rating'].isna().sum()}")
print(f"Empty cities: {(survey['city'] == '').sum()}")
print(f"Empty feedback: {(survey['feedback'] == '').sum()}")
print()
print("🎯 This section will teach you to fix all these issues!")
🧪 Before vs After Cleaning
See the transformation power of good data cleaning:
import pandas as pd
import numpy as np
# Product data - before cleaning
products_messy = pd.DataFrame({
'product_name': [' laptop ', 'MOUSE', 'keyboard', ' laptop ', 'monitor'],
'price': ['999.99', '25', None, '999.99', '300'],
'category': ['computer', 'ACCESSORY', 'accessory', 'computer', ''],
'in_stock': ['yes', 'YES', 'no', 'yes', 'No']
})
print("❌ BEFORE Cleaning:")
print(products_messy)
print()
# After cleaning (preview of what you'll learn)
products_clean = pd.DataFrame({
'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price': [999.99, 25.0, 50.0, 300.0], # Converted to numbers, filled missing
'category': ['Computer', 'Accessory', 'Accessory', 'Computer'],
'in_stock': [True, True, False, False] # Converted to boolean
})
print("✅ AFTER Cleaning:")
print(products_clean)
print()
print("🎯 Improvements:")
print("✅ Consistent text formatting")
print("✅ Proper numeric data types")
print("✅ No missing values")
print("✅ No duplicates")
print("✅ Boolean values for yes/no")
print("✅ Ready for analysis and calculations!")
🎯 Data Cleaning Best Practices
🔍 Identifying Data Problems
Before cleaning, you need to spot the issues:
import pandas as pd
import numpy as np
# Employee data with various issues
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4, 2],
'name': ['alice smith', 'BOB JONES', ' charlie brown ', 'diana prince', 'BOB JONES'],
'department': ['sales', 'IT', None, 'marketing', 'IT'],
'salary': [50000, '75000', 60000, None, '75000'],
'start_date': ['2020-01-15', '2019-03-01', '', '2021-06-01', '2019-03-01']
})
print("🔍 Employee Data Analysis:")
print(employees)
print()
print("📊 Data Quality Report:")
print(f"Shape: {employees.shape}")
print(f"Duplicates: {employees.duplicated().sum()}")
print()
print("Missing Values by Column:")
print(employees.isnull().sum())
print()
print("Data Types:")
print(employees.dtypes)
print()
print("🎯 Issues Found:")
print("- Duplicate employee (ID 2)")
print("- Inconsistent name formatting")
print("- Missing department and salary")
print("- Salary stored as text")
print("- Empty start date")
🚀 What's Next?
Ready to transform messy data into analysis-ready datasets? Let's start with one of the most common issues: missing data.
Start with: Handling Missing Data
Time to become a data cleaning expert! 🧹✨
Was this helpful?
Track Your Learning Progress
Sign in to bookmark tutorials and keep track of your learning journey.
Your progress is saved automatically as you read.