📝 String Cleaning Operations
String cleaning is like editing a document - you fix typos, standardize formatting, and remove unnecessary content to make text data consistent and analysis-ready. Clean text data is crucial for accurate searches, grouping, and reporting.
🧹 Basic Text Cleaning
Start with the most common text cleaning operations:
import pandas as pd
# Customer data with messy text
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': [' Alice Johnson ', 'bob smith', 'CHARLIE BROWN', ' diana LEE', 'Eve Davis '],
'email': ['ALICE@EMAIL.COM', 'bob@email.com', 'Charlie@Email.Com', 'diana@EMAIL.com', 'eve@email.COM'],
'city': [' New York ', 'los angeles', 'CHICAGO', 'boston ', ' Miami']
})
print("📊 Messy Customer Data:")
print(customers)
print()
print("🧹 Basic Cleaning Operations:")
print()
print("1️⃣ Remove extra whitespace:")
customers['name_clean'] = customers['name'].str.strip()
print("Names after strip():")
print(customers[['name', 'name_clean']])
print()
print("2️⃣ Standardize case:")
customers['name_clean'] = customers['name_clean'].str.title()
customers['email_clean'] = customers['email'].str.lower()
customers['city_clean'] = customers['city'].str.strip().str.title()
print("After case standardization:")
print(customers[['name_clean', 'email_clean', 'city_clean']])
print()
print("3️⃣ Final clean data:")
clean_customers = customers[['customer_id', 'name_clean', 'email_clean', 'city_clean']]
clean_customers.columns = ['customer_id', 'name', 'email', 'city']
print(clean_customers)
🔧 Advanced Text Cleaning
Handle more complex text cleaning scenarios:
import pandas as pd
# Product data with various text issues
products = pd.DataFrame({
'product_name': [' iPhone-14-Pro ', 'Samsung Galaxy S23!', 'MacBook Air (2023)',
'iPad Pro 12.9"', 'Dell XPS-13 Laptop'],
'category': ['Electronics/Phones', 'Electronics / Phones', 'Computers/Laptops',
'Electronics / Tablets', 'Computers / Laptops'],
'brand': ['Apple Inc.', 'SAMSUNG CORP', 'apple inc.', 'Apple Inc.', 'Dell Technologies'],
'description': ['Latest iPhone model!!!', 'Amazing phone with great camera.',
'Lightweight laptop for work', 'Professional tablet device.',
'Business laptop with excellent battery life.']
})
print("📊 Product Data with Text Issues:")
print(products)
print()
print("🔧 Advanced Cleaning Operations:")
print()
print("1️⃣ Clean product names (remove special chars):")
products['product_name_clean'] = (products['product_name']
.str.strip()
.str.replace('[^a-zA-Z0-9\s]', '', regex=True)
.str.replace('\s+', ' ', regex=True))
print("Clean product names:")
print(products[['product_name', 'product_name_clean']])
print()
print("2️⃣ Standardize categories:")
products['category_clean'] = (products['category']
.str.replace('\s*/\s*', '/', regex=True)
.str.title())
print("Clean categories:")
print(products[['category', 'category_clean']])
print()
print("3️⃣ Standardize brand names:")
products['brand_clean'] = (products['brand']
.str.replace('Inc\.', '', regex=True)
.str.replace('Corp\.?', '', regex=True)
.str.replace('Technologies', '', regex=True)
.str.strip()
.str.title())
print("Clean brands:")
print(products[['brand', 'brand_clean']])
print()
print("4️⃣ Clean descriptions (remove extra punctuation):")
products['description_clean'] = (products['description']
.str.replace('[!]{2,}', '.', regex=True)
.str.strip())
print("Clean descriptions:")
print(products[['description', 'description_clean']])
📞 Standardizing Formats
Clean and standardize common data formats:
import pandas as pd
# Contact data with inconsistent formats
contacts = pd.DataFrame({
'name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown'],
'phone': ['(555) 123-4567', '555.123.4567', '5551234567'],
'email': ['Alice@Company.COM', 'bob.smith@company.com', 'CHARLIE@COMPANY.COM'],
'address': ['123 Main St, New York, NY', '456 Oak Ave, Los Angeles, CA',
'789 Pine St, Chicago, IL']
})
print("📊 Contact Data with Format Issues:")
print(contacts)
print()
print("📞 Format Standardization:")
print()
print("1️⃣ Standardize phone numbers:")
contacts['phone_clean'] = (contacts['phone']
.str.replace('[^0-9]', '', regex=True) # Remove non-digits
.str.replace(r'(\d{3})(\d{3})(\d{4})', r'(\1) \2-\3', regex=True)) # Format
print("Clean phone numbers:")
print(contacts[['phone', 'phone_clean']])
print()
print("2️⃣ Standardize emails:")
contacts['email_clean'] = contacts['email'].str.lower()
print("Clean emails:")
print(contacts[['email', 'email_clean']])
print()
print("3️⃣ Extract state codes from addresses:")
contacts['state'] = contacts['address'].str.extract(r', ([A-Z]{2})$')
print("Extracted states:")
print(contacts[['address', 'state']])
🏷️ Text Categorization and Standardization
Group similar text variations into standard categories:
import pandas as pd
# Survey data with varied responses
survey = pd.DataFrame({
'respondent_id': [1, 2, 3, 4, 5, 6, 7, 8],
'satisfaction': ['Very Satisfied', 'satisfied', 'VERY SATISFIED', 'Neutral',
'dissatisfied', 'Very Dissatisfied', 'neutral', 'Satisfied'],
'preferred_contact': ['email', 'E-mail', 'phone', 'Phone Call', 'EMAIL',
'text message', 'SMS', 'phone call'],
'frequency': ['Daily', 'weekly', 'Weekly', 'Monthly', 'DAILY', 'monthly', 'Never', 'never']
})
print("📊 Survey Data with Varied Responses:")
print(survey)
print()
print("🏷️ Text Standardization:")
print()
print("1️⃣ Standardize satisfaction levels:")
satisfaction_mapping = {
'very satisfied': 'Very Satisfied',
'satisfied': 'Satisfied',
'neutral': 'Neutral',
'dissatisfied': 'Dissatisfied',
'very dissatisfied': 'Very Dissatisfied'
}
survey['satisfaction_clean'] = (survey['satisfaction']
.str.lower()
.map(satisfaction_mapping))
print("Clean satisfaction levels:")
print(survey[['satisfaction', 'satisfaction_clean']])
print()
print("2️⃣ Standardize contact preferences:")
contact_mapping = {
'email': 'Email',
'e-mail': 'Email',
'phone': 'Phone',
'phone call': 'Phone',
'text message': 'SMS',
'sms': 'SMS'
}
survey['contact_clean'] = (survey['preferred_contact']
.str.lower()
.map(contact_mapping))
print("Clean contact preferences:")
print(survey[['preferred_contact', 'contact_clean']])
print()
print("3️⃣ Standardize frequency:")
survey['frequency_clean'] = survey['frequency'].str.title()
print("Clean frequency:")
print(survey[['frequency', 'frequency_clean']])
🔍 Text Validation and Quality Checks
Identify and flag problematic text data:
import pandas as pd
import re
# User registration data with validation needs
users = pd.DataFrame({
'username': ['alice123', 'bob@user', 'charlie_2023', 'diana!', 'eve_smith'],
'email': ['alice@email.com', 'invalid-email', 'charlie@domain.co.uk',
'diana@', 'eve@company.com'],
'full_name': ['Alice Johnson', 'B', 'Charlie Brown-Smith',
'Diana Lee-Jones III', ''],
'postal_code': ['12345', '90210-1234', 'invalid', 'K1A 0A6', '00000']
})
print("📊 User Registration Data:")
print(users)
print()
print("🔍 Text Validation:")
print()
print("1️⃣ Validate usernames (alphanumeric + underscore only):")
users['username_valid'] = users['username'].str.match(r'^[a-zA-Z0-9_]+$')
print("Username validation:")
print(users[['username', 'username_valid']])
print()
print("2️⃣ Validate email format:")
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
users['email_valid'] = users['email'].str.match(email_pattern)
print("Email validation:")
print(users[['email', 'email_valid']])
print()
print("3️⃣ Check name length (reasonable range):")
users['name_length'] = users['full_name'].str.len()
users['name_valid'] = (users['name_length'] >= 2) & (users['name_length'] <= 50)
print("Name validation:")
print(users[['full_name', 'name_length', 'name_valid']])
print()
print("4️⃣ Postal code patterns:")
# US: 5 digits or 5-4 format, Canada: A1A 1A1 format
us_postal = users['postal_code'].str.match(r'^\d{5}(-\d{4})?$')
ca_postal = users['postal_code'].str.match(r'^[A-Z]\d[A-Z] \d[A-Z]\d$')
users['postal_valid'] = us_postal | ca_postal
print("Postal code validation:")
print(users[['postal_code', 'postal_valid']])
🧽 Complete Text Cleaning Workflow
A systematic approach to cleaning an entire text dataset:
import pandas as pd
# Customer feedback dataset (typical messy real-world data)
feedback = pd.DataFrame({
'customer_name': [' Alice Johnson ', 'bob smith jr.', 'CHARLIE O\'BRIEN',
'diana lee-jones', ' Eve Davis-Smith III '],
'company': ['ABC Corp.', 'xyz COMPANY inc.', 'Tech Solutions LLC',
'ABC CORP', 'Data Systems Inc.'],
'feedback_text': ['Great product!!!', 'Good service but could be better.',
'EXCELLENT CUSTOMER SUPPORT!!!', 'Average experience',
'Poor quality... very disappointed.'],
'rating_text': ['Excellent', 'good', 'EXCELLENT', 'Average', 'poor']
})
print("📊 Raw Customer Feedback:")
print(feedback)
print()
# Step 1: Create working copy
clean_feedback = feedback.copy()
# Step 2: Apply systematic cleaning
print("🧽 Applying Systematic Cleaning:")
print()
# Clean customer names (basic text cleaning)
clean_feedback['customer_name'] = (clean_feedback['customer_name']
.str.strip()
.str.title()
.str.replace(r'\s+', ' ', regex=True))
# Standardize company names
clean_feedback['company'] = (clean_feedback['company']
.str.strip()
.str.title()
.str.replace(r'\b(Inc|Corp|LLC)\.?\b', '', regex=True)
.str.replace(r'\s+', ' ', regex=True)
.str.strip())
# Clean feedback text
clean_feedback['feedback_text'] = (clean_feedback['feedback_text']
.str.replace(r'[!]{2,}', '!', regex=True) # Multiple ! to single
.str.replace(r'\.{2,}', '.', regex=True) # Multiple . to single
.str.strip())
# Standardize rating categories
rating_mapping = {
'excellent': 'Excellent',
'good': 'Good',
'average': 'Average',
'poor': 'Poor'
}
clean_feedback['rating_text'] = (clean_feedback['rating_text']
.str.lower()
.map(rating_mapping))
print("✅ Final cleaned dataset:")
print(clean_feedback)
print()
print("🎯 Cleaning accomplished:")
print("- Professional name formatting")
print("- Standardized company names")
print("- Consistent punctuation")
print("- Unified rating categories")
print("- Analysis-ready text data!")
🎯 Key Takeaways
🎮 Practice Challenge
Apply comprehensive text cleaning to a realistic business scenario:
import pandas as pd
# Employee directory export (typical HR system data)
employee_directory = pd.DataFrame({
'emp_id': [1001, 1002, 1003, 1004, 1005],
'full_name': [' ALICE M. JOHNSON ', 'bob smith jr', 'charlie o\'brien-brown',
'diana lee jones', ' EVE MARIE DAVIS-SMITH '],
'email': ['Alice.Johnson@COMPANY.COM', 'bob.smith@company.com',
'CHARLIE.OBRIEN@COMPANY.COM', 'diana.jones@Company.com',
'eve.davis@COMPANY.COM'],
'department': ['Human Resources', 'information technology', 'SALES & MARKETING',
'human resources', 'Information Technology'],
'phone': ['(555) 123-4567', '555.987.6543', '5551234567',
'(555) 999-8888', '555-444-3333'],
'job_title': ['HR Manager', 'sr. software engineer', 'SALES REPRESENTATIVE',
'hr specialist', 'Senior Software Engineer']
})
print("🎮 Employee Directory Cleaning Challenge:")
print(employee_directory)
print()
print("Mission: Clean and standardize all text fields for professional directory")
print()
# Solution
print("🧽 Comprehensive Cleaning Solution:")
cleaned_directory = employee_directory.copy()
# Clean names
cleaned_directory['full_name'] = (cleaned_directory['full_name']
.str.strip()
.str.title()
.str.replace(r'\s+', ' ', regex=True))
# Clean emails
cleaned_directory['email'] = cleaned_directory['email'].str.lower()
# Standardize departments
dept_mapping = {
'human resources': 'Human Resources',
'information technology': 'Information Technology',
'sales & marketing': 'Sales & Marketing'
}
cleaned_directory['department'] = (cleaned_directory['department']
.str.lower()
.map(dept_mapping))
# Standardize phone numbers
cleaned_directory['phone'] = (cleaned_directory['phone']
.str.replace('[^0-9]', '', regex=True)
.str.replace(r'(\d{3})(\d{3})(\d{4})', r'(\1) \2-\3', regex=True))
# Clean job titles
cleaned_directory['job_title'] = (cleaned_directory['job_title']
.str.replace(r'\bsr\.\b', 'Senior', regex=True, flags=re.IGNORECASE)
.str.replace(r'\bhr\b', 'HR', regex=True, flags=re.IGNORECASE)
.str.title())
print("✅ Professional Employee Directory:")
print(cleaned_directory)
print()
print("🎯 Achievements:")
print("- Consistent name formatting")
print("- Standardized email format")
print("- Unified department names")
print("- Professional phone number format")
print("- Consistent job title formatting")
print("- Ready for company directory!")
🚀 What's Next?
Excellent work! You've mastered all the essential data cleaning techniques. With clean data in hand, you're ready to move on to transforming and analyzing your data.
Continue to: Data Transformation
You're now a data cleaning expert! 📝✨
Was this helpful?
Track Your Learning Progress
Sign in to bookmark tutorials and keep track of your learning journey.
Your progress is saved automatically as you read.