🧹 Removing Duplicates
Duplicate data is like having identical photos in your album - they take up space and can skew your analysis. Whether from data entry errors, system glitches, or merging datasets, duplicates are common and need to be handled properly.
🔍 Identifying Duplicates
First, find where duplicates exist in your data:
import pandas as pd
# Customer data with duplicates
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 2, 4, 3],
'name': ['Alice', 'Bob', 'Charlie', 'Bob', 'Diana', 'Charlie'],
'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com',
'bob@email.com', 'diana@email.com', 'charlie@email.com'],
'purchase_amount': [100, 250, 150, 250, 75, 150]
})
print("📊 Customer Data with Duplicates:")
print(customers)
print()
print("🔍 Finding Duplicates:")
print()
print("1️⃣ Check which rows are duplicates:")
print(customers.duplicated())
print()
print("2️⃣ Count total duplicates:")
print(f"Number of duplicate rows: {customers.duplicated().sum()}")
print()
print("3️⃣ See the actual duplicate rows:")
duplicates = customers[customers.duplicated()]
print("Duplicate rows:")
print(duplicates)
print()
print("4️⃣ See all rows that have duplicates (including originals):")
has_duplicates = customers[customers.duplicated(keep=False)]
print("All rows involved in duplication:")
print(has_duplicates)
🗑️ Removing All Duplicates
Remove duplicate rows to clean your dataset:
import pandas as pd
# Order data with exact duplicates
orders = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1002, 1004, 1003],
'customer': ['Alice', 'Bob', 'Charlie', 'Bob', 'Diana', 'Charlie'],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Mouse', 'Monitor', 'Keyboard'],
'amount': [999, 25, 75, 25, 300, 75]
})
print("📦 Orders with Duplicates:")
print(orders)
print(f"Total orders: {len(orders)}")
print()
print("🗑️ Removing Duplicates:")
print()
print("1️⃣ Remove duplicates (keep first occurrence):")
clean_orders = orders.drop_duplicates()
print(clean_orders)
print(f"Clean orders: {len(clean_orders)} (removed {len(orders) - len(clean_orders)} duplicates)")
print()
print("2️⃣ Remove duplicates (keep last occurrence):")
clean_orders_last = orders.drop_duplicates(keep='last')
print(clean_orders_last)
print()
print("3️⃣ Remove ALL duplicates (keep none):")
clean_orders_none = orders.drop_duplicates(keep=False)
print(clean_orders_none)
print(f"Remaining after removing all duplicates: {len(clean_orders_none)}")
🎯 Partial Duplicate Detection
Sometimes only certain columns matter for identifying duplicates:
import pandas as pd
# Employee data with partial duplicates
employees = pd.DataFrame({
'emp_id': [101, 102, 103, 104, 102],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Bob'],
'department': ['Sales', 'IT', 'HR', 'Sales', 'IT'],
'salary': [50000, 75000, 60000, 52000, 78000], # Bob got a raise!
'hire_date': ['2020-01', '2019-03', '2021-06', '2020-08', '2019-03']
})
print("👥 Employee Data:")
print(employees)
print()
print("🎯 Partial Duplicate Detection:")
print()
print("1️⃣ Duplicates based on employee ID only:")
id_duplicates = employees.duplicated(subset=['emp_id'])
print("Duplicate employee IDs:")
print(employees[id_duplicates])
print()
print("2️⃣ Duplicates based on name and department:")
name_dept_duplicates = employees.duplicated(subset=['name', 'department'])
print("Same name and department:")
print(employees[name_dept_duplicates])
print()
print("3️⃣ Remove duplicates keeping employee with higher salary:")
# For business logic: keep the record with higher salary
unique_employees = employees.sort_values('salary').drop_duplicates(
subset=['emp_id'], keep='last'
)
print("Unique employees (keeping higher salary):")
print(unique_employees)
📊 Real-World Duplicate Scenarios
Practical examples of handling duplicates in business contexts:
import pandas as pd
# E-commerce transaction data
transactions = pd.DataFrame({
'transaction_id': ['T001', 'T002', 'T003', 'T002', 'T004', 'T005'],
'customer_email': ['alice@email.com', 'bob@email.com', 'charlie@email.com',
'bob@email.com', 'diana@email.com', 'eve@email.com'],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Mouse', 'Monitor', 'Tablet'],
'amount': [999, 25, 75, 25, 300, 450],
'timestamp': ['2023-01-15 10:30', '2023-01-15 11:15', '2023-01-15 14:20',
'2023-01-15 11:15', '2023-01-16 09:45', '2023-01-16 16:30']
})
print("💳 Transaction Data with Potential Duplicates:")
print(transactions)
print()
print("📊 Business Duplicate Analysis:")
print()
print("1️⃣ Exact duplicate transactions (system error):")
exact_dups = transactions.duplicated()
if exact_dups.any():
print("Exact duplicates found:")
print(transactions[exact_dups])
else:
print("No exact duplicates")
print()
print("2️⃣ Same transaction ID (should be unique):")
id_dups = transactions.duplicated(subset=['transaction_id'])
print("Duplicate transaction IDs:")
print(transactions[id_dups])
print()
print("3️⃣ Same customer, product, amount (possible double-click):")
business_dups = transactions.duplicated(subset=['customer_email', 'product', 'amount'])
print("Potential duplicate purchases:")
print(transactions[business_dups])
print()
print("4️⃣ Clean data for analysis:")
# Remove exact duplicates and transaction ID duplicates
clean_transactions = transactions.drop_duplicates()
clean_transactions = clean_transactions.drop_duplicates(subset=['transaction_id'])
print("Clean transaction data:")
print(clean_transactions)
print(f"Transactions: {len(transactions)} → {len(clean_transactions)}")
🔧 Advanced Duplicate Handling
Sophisticated techniques for complex duplicate scenarios:
import pandas as pd
# Customer database with merge conflicts
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 1, 4],
'name': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'Alice Smith', 'Diana Lee'],
'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com',
'alice.smith@email.com', 'diana@email.com'], # Alice changed email
'phone': ['123-456-7890', '555-123-4567', '555-987-6543',
'123-456-7890', '555-111-2222'],
'last_purchase': ['2023-01-15', '2023-02-10', '2023-01-25',
'2023-03-05', '2023-02-20'], # Alice made a recent purchase
'total_spent': [500, 250, 150, 750, 300] # Alice's total is higher
})
print("👤 Customer Database with Conflicts:")
print(customers)
print()
print("🔧 Advanced Duplicate Resolution:")
print()
print("1️⃣ Identify conflicts by customer ID:")
id_conflicts = customers[customers.duplicated(subset=['customer_id'], keep=False)]
print("Customer ID conflicts:")
print(id_conflicts)
print()
print("2️⃣ Resolve by keeping most recent purchase:")
recent_customers = customers.sort_values('last_purchase').drop_duplicates(
subset=['customer_id'], keep='last'
)
print("Resolved by most recent purchase:")
print(recent_customers)
print()
print("3️⃣ Resolve by keeping highest total spent:")
high_value_customers = customers.sort_values('total_spent').drop_duplicates(
subset=['customer_id'], keep='last'
)
print("Resolved by highest spending:")
print(high_value_customers)
print()
print("4️⃣ Manual resolution for critical conflicts:")
# For important data, you might need manual review
print("Manual review needed for:")
manual_review = customers[customers['customer_id'].duplicated(keep=False)]
print(manual_review[['customer_id', 'name', 'email', 'total_spent']])
📋 Duplicate Prevention Strategies
Best practices to prevent duplicates from entering your data:
import pandas as pd
# Simulating data validation during import
new_records = pd.DataFrame({
'user_id': [201, 202, 203, 201, 204], # 201 is duplicate
'username': ['user201', 'user202', 'user203', 'user201_new', 'user204'],
'email': ['user201@email.com', 'user202@email.com', 'user203@email.com',
'user201@email.com', 'user204@email.com'],
'registration_date': ['2023-03-01', '2023-03-02', '2023-03-03',
'2023-03-04', '2023-03-05']
})
existing_users = pd.DataFrame({
'user_id': [101, 102, 103],
'username': ['user101', 'user102', 'user103'],
'email': ['user101@email.com', 'user102@email.com', 'user103@email.com'],
'registration_date': ['2023-01-01', '2023-01-02', '2023-01-03']
})
print("📋 Duplicate Prevention Example:")
print()
print("New records to import:")
print(new_records)
print()
print("Existing users:")
print(existing_users)
print()
print("🛡️ Validation Steps:")
print()
print("1️⃣ Check for duplicates within new records:")
internal_dups = new_records.duplicated(subset=['user_id'])
if internal_dups.any():
print("Internal duplicates found:")
print(new_records[internal_dups])
print("Action: Remove internal duplicates before import")
clean_new = new_records.drop_duplicates(subset=['user_id'])
else:
clean_new = new_records
print("No internal duplicates")
print()
print("2️⃣ Check for conflicts with existing data:")
# Simulate checking if user IDs already exist
existing_ids = existing_users['user_id'].tolist()
conflicts = clean_new[clean_new['user_id'].isin(existing_ids)]
if not conflicts.empty:
print("Conflicts with existing data:")
print(conflicts)
print("Action: Skip conflicting records or assign new IDs")
else:
print("No conflicts with existing data")
print()
print("3️⃣ Safe records to import:")
safe_records = clean_new[~clean_new['user_id'].isin(existing_ids)]
print(safe_records)
print(f"Ready to import: {len(safe_records)} records")
⚠️ Duplicate Removal Pitfalls
Understanding when data that looks like duplicates is actually legitimate:
import pandas as pd
# Sales data demonstrating legitimate vs duplicate scenarios
sales = pd.DataFrame({
'date': ['2023-01-15', '2023-01-15', '2023-01-16', '2023-01-16', '2023-01-17'],
'customer': ['Alice', 'Alice', 'Bob', 'Bob', 'Charlie'],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Mouse', 'Monitor'],
'amount': [999, 25, 75, 25, 300]
})
print("📊 Sales Data Analysis:")
print(sales)
print()
print("✅ Correct Approach: Context-Aware Duplicate Handling")
print()
# Correct: Only remove exact duplicates
correct_approach = sales.drop_duplicates()
print("Exact duplicates removed (none found):")
print(correct_approach)
print(f"All {len(correct_approach)} transactions preserved")
print()
print("🎯 Business Logic Applied:")
print("- Same customer + date + different product = Legitimate")
print("- Same customer + date + same product = Check carefully")
print("- Exact match on all fields = Likely duplicate")
🎯 Key Takeaways
🎮 Practice Challenge
Apply duplicate removal to a realistic customer service scenario:
import pandas as pd
# Customer support tickets with various duplicate issues
tickets = pd.DataFrame({
'ticket_id': ['T001', 'T002', 'T003', 'T002', 'T004', 'T005', 'T006'],
'customer_email': ['alice@email.com', 'bob@email.com', 'charlie@email.com',
'bob@email.com', 'diana@email.com', 'alice@email.com', 'eve@email.com'],
'issue_type': ['Login Problem', 'Billing', 'Feature Request', 'Billing',
'Bug Report', 'Login Problem', 'Billing'],
'priority': ['High', 'Medium', 'Low', 'Medium', 'High', 'Medium', 'Low'],
'created_date': ['2023-03-01', '2023-03-01', '2023-03-02', '2023-03-01',
'2023-03-03', '2023-03-04', '2023-03-05'],
'status': ['Open', 'Resolved', 'Open', 'Resolved', 'Open', 'In Progress', 'Open']
})
print("🎮 Customer Support Duplicate Challenge:")
print(tickets)
print()
print("Mission: Clean this support ticket data")
print()
# Analysis and solution
print("🔍 Duplicate Analysis:")
print(f"1️⃣ Total tickets: {len(tickets)}")
print(f"2️⃣ Exact duplicates: {tickets.duplicated().sum()}")
print(f"3️⃣ Duplicate ticket IDs: {tickets.duplicated(subset=['ticket_id']).sum()}")
print()
# Solution steps
print("🧹 Cleaning Steps:")
cleaned_tickets = tickets.copy()
print("Step 1: Remove exact duplicates")
cleaned_tickets = cleaned_tickets.drop_duplicates()
print(f"After exact duplicate removal: {len(cleaned_tickets)} tickets")
print("Step 2: Handle duplicate ticket IDs (keep most recent)")
cleaned_tickets = cleaned_tickets.sort_values('created_date').drop_duplicates(
subset=['ticket_id'], keep='last'
)
print(f"After ID deduplication: {len(cleaned_tickets)} tickets")
print("Step 3: Review result")
print(cleaned_tickets)
print()
print("✅ Clean dataset ready for support team analysis!")
🚀 What's Next?
Excellent work! You now know how to identify and remove duplicates effectively. Next, let's learn about converting data to the right types for proper analysis.
Continue to: Data Type Conversion
You're mastering data quality control! 🧹✨
Was this helpful?
Track Your Learning Progress
Sign in to bookmark tutorials and keep track of your learning journey.
Your progress is saved automatically as you read.