📝 String Cleaning Operations

String cleaning is like editing a document - you fix typos, standardize formatting, and remove unnecessary content to make text data consistent and analysis-ready. Clean text data is crucial for accurate searches, grouping, and reporting.

🧹 Basic Text Cleaning

Start with the most common text cleaning operations:

import pandas as pd

# Customer data with messy text
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['  Alice Johnson  ', 'bob smith', 'CHARLIE BROWN', '  diana LEE', 'Eve Davis  '],
    'email': ['ALICE@EMAIL.COM', 'bob@email.com', 'Charlie@Email.Com', 'diana@EMAIL.com', 'eve@email.COM'],
    'city': ['  New York  ', 'los angeles', 'CHICAGO', 'boston  ', '  Miami']
})

print("📊 Messy Customer Data:")
print(customers)
print()

print("🧹 Basic Cleaning Operations:")
print()

print("1️⃣ Remove extra whitespace:")
customers['name_clean'] = customers['name'].str.strip()
print("Names after strip():")
print(customers[['name', 'name_clean']])
print()

print("2️⃣ Standardize case:")
customers['name_clean'] = customers['name_clean'].str.title()
customers['email_clean'] = customers['email'].str.lower()
customers['city_clean'] = customers['city'].str.strip().str.title()
print("After case standardization:")
print(customers[['name_clean', 'email_clean', 'city_clean']])
print()

print("3️⃣ Final clean data:")
clean_customers = customers[['customer_id', 'name_clean', 'email_clean', 'city_clean']]
clean_customers.columns = ['customer_id', 'name', 'email', 'city']
print(clean_customers)

🔧 Advanced Text Cleaning

Handle more complex text cleaning scenarios:

import pandas as pd

# Product data with various text issues
products = pd.DataFrame({
    'product_name': ['  iPhone-14-Pro  ', 'Samsung Galaxy S23!', 'MacBook Air (2023)', 
                     'iPad Pro 12.9"', 'Dell XPS-13 Laptop'],
    'category': ['Electronics/Phones', 'Electronics / Phones', 'Computers/Laptops', 
                 'Electronics / Tablets', 'Computers / Laptops'],
    'brand': ['Apple Inc.', 'SAMSUNG CORP', 'apple inc.', 'Apple Inc.', 'Dell Technologies'],
    'description': ['Latest iPhone model!!!', 'Amazing phone with great camera.', 
                    'Lightweight laptop for work', 'Professional tablet device.', 
                    'Business laptop with excellent battery life.']
})

print("📊 Product Data with Text Issues:")
print(products)
print()

print("🔧 Advanced Cleaning Operations:")
print()

print("1️⃣ Clean product names (remove special chars):")
products['product_name_clean'] = (products['product_name']
                                  .str.strip()
                                  .str.replace('[^a-zA-Z0-9\s]', '', regex=True)
                                  .str.replace('\s+', ' ', regex=True))
print("Clean product names:")
print(products[['product_name', 'product_name_clean']])
print()

print("2️⃣ Standardize categories:")
products['category_clean'] = (products['category']
                              .str.replace('\s*/\s*', '/', regex=True)
                              .str.title())
print("Clean categories:")
print(products[['category', 'category_clean']])
print()

print("3️⃣ Standardize brand names:")
products['brand_clean'] = (products['brand']
                           .str.replace('Inc\.', '', regex=True)
                           .str.replace('Corp\.?', '', regex=True)
                           .str.replace('Technologies', '', regex=True)
                           .str.strip()
                           .str.title())
print("Clean brands:")
print(products[['brand', 'brand_clean']])
print()

print("4️⃣ Clean descriptions (remove extra punctuation):")
products['description_clean'] = (products['description']
                                 .str.replace('[!]{2,}', '.', regex=True)
                                 .str.strip())
print("Clean descriptions:")
print(products[['description', 'description_clean']])

📞 Standardizing Formats

Clean and standardize common data formats:

import pandas as pd

# Contact data with inconsistent formats
contacts = pd.DataFrame({
    'name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown'],
    'phone': ['(555) 123-4567', '555.123.4567', '5551234567'],
    'email': ['Alice@Company.COM', 'bob.smith@company.com', 'CHARLIE@COMPANY.COM'],
    'address': ['123 Main St, New York, NY', '456 Oak Ave, Los Angeles, CA', 
                '789 Pine St, Chicago, IL']
})

print("📊 Contact Data with Format Issues:")
print(contacts)
print()

print("📞 Format Standardization:")
print()

print("1️⃣ Standardize phone numbers:")
contacts['phone_clean'] = (contacts['phone']
                           .str.replace('[^0-9]', '', regex=True)  # Remove non-digits
                           .str.replace(r'(\d{3})(\d{3})(\d{4})', r'(\1) \2-\3', regex=True))  # Format
print("Clean phone numbers:")
print(contacts[['phone', 'phone_clean']])
print()

print("2️⃣ Standardize emails:")
contacts['email_clean'] = contacts['email'].str.lower()
print("Clean emails:")
print(contacts[['email', 'email_clean']])
print()

print("3️⃣ Extract state codes from addresses:")
contacts['state'] = contacts['address'].str.extract(r', ([A-Z]{2})$')
print("Extracted states:")
print(contacts[['address', 'state']])

🏷️ Text Categorization and Standardization

Group similar text variations into standard categories:

import pandas as pd

# Survey data with varied responses
survey = pd.DataFrame({
    'respondent_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'satisfaction': ['Very Satisfied', 'satisfied', 'VERY SATISFIED', 'Neutral', 
                     'dissatisfied', 'Very Dissatisfied', 'neutral', 'Satisfied'],
    'preferred_contact': ['email', 'E-mail', 'phone', 'Phone Call', 'EMAIL', 
                          'text message', 'SMS', 'phone call'],
    'frequency': ['Daily', 'weekly', 'Weekly', 'Monthly', 'DAILY', 'monthly', 'Never', 'never']
})

print("📊 Survey Data with Varied Responses:")
print(survey)
print()

print("🏷️ Text Standardization:")
print()

print("1️⃣ Standardize satisfaction levels:")
satisfaction_mapping = {
    'very satisfied': 'Very Satisfied',
    'satisfied': 'Satisfied', 
    'neutral': 'Neutral',
    'dissatisfied': 'Dissatisfied',
    'very dissatisfied': 'Very Dissatisfied'
}

survey['satisfaction_clean'] = (survey['satisfaction']
                                .str.lower()
                                .map(satisfaction_mapping))
print("Clean satisfaction levels:")
print(survey[['satisfaction', 'satisfaction_clean']])
print()

print("2️⃣ Standardize contact preferences:")
contact_mapping = {
    'email': 'Email',
    'e-mail': 'Email',
    'phone': 'Phone',
    'phone call': 'Phone',
    'text message': 'SMS',
    'sms': 'SMS'
}

survey['contact_clean'] = (survey['preferred_contact']
                           .str.lower()
                           .map(contact_mapping))
print("Clean contact preferences:")
print(survey[['preferred_contact', 'contact_clean']])
print()

print("3️⃣ Standardize frequency:")
survey['frequency_clean'] = survey['frequency'].str.title()
print("Clean frequency:")
print(survey[['frequency', 'frequency_clean']])

🔍 Text Validation and Quality Checks

Identify and flag problematic text data:

import pandas as pd
import re

# User registration data with validation needs
users = pd.DataFrame({
    'username': ['alice123', 'bob@user', 'charlie_2023', 'diana!', 'eve_smith'],
    'email': ['alice@email.com', 'invalid-email', 'charlie@domain.co.uk', 
              'diana@', 'eve@company.com'],
    'full_name': ['Alice Johnson', 'B', 'Charlie Brown-Smith', 
                  'Diana Lee-Jones III', ''],
    'postal_code': ['12345', '90210-1234', 'invalid', 'K1A 0A6', '00000']
})

print("📊 User Registration Data:")
print(users)
print()

print("🔍 Text Validation:")
print()

print("1️⃣ Validate usernames (alphanumeric + underscore only):")
users['username_valid'] = users['username'].str.match(r'^[a-zA-Z0-9_]+$')
print("Username validation:")
print(users[['username', 'username_valid']])
print()

print("2️⃣ Validate email format:")
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
users['email_valid'] = users['email'].str.match(email_pattern)
print("Email validation:")
print(users[['email', 'email_valid']])
print()

print("3️⃣ Check name length (reasonable range):")
users['name_length'] = users['full_name'].str.len()
users['name_valid'] = (users['name_length'] >= 2) & (users['name_length'] <= 50)
print("Name validation:")
print(users[['full_name', 'name_length', 'name_valid']])
print()

print("4️⃣ Postal code patterns:")
# US: 5 digits or 5-4 format, Canada: A1A 1A1 format
us_postal = users['postal_code'].str.match(r'^\d{5}(-\d{4})?$')
ca_postal = users['postal_code'].str.match(r'^[A-Z]\d[A-Z] \d[A-Z]\d$')
users['postal_valid'] = us_postal | ca_postal
print("Postal code validation:")
print(users[['postal_code', 'postal_valid']])

🧽 Complete Text Cleaning Workflow

A systematic approach to cleaning an entire text dataset:

import pandas as pd

# Customer feedback dataset (typical messy real-world data)
feedback = pd.DataFrame({
    'customer_name': ['  Alice Johnson  ', 'bob smith jr.', 'CHARLIE O\'BRIEN', 
                      'diana lee-jones', '  Eve Davis-Smith III  '],
    'company': ['ABC Corp.', 'xyz COMPANY inc.', 'Tech Solutions LLC', 
                'ABC CORP', 'Data Systems Inc.'],
    'feedback_text': ['Great product!!!', 'Good service but could be better.', 
                      'EXCELLENT CUSTOMER SUPPORT!!!', 'Average experience', 
                      'Poor quality... very disappointed.'],
    'rating_text': ['Excellent', 'good', 'EXCELLENT', 'Average', 'poor']
})

print("📊 Raw Customer Feedback:")
print(feedback)
print()

# Step 1: Create working copy
clean_feedback = feedback.copy()

# Step 2: Apply systematic cleaning
print("🧽 Applying Systematic Cleaning:")
print()

# Clean customer names (basic text cleaning)
clean_feedback['customer_name'] = (clean_feedback['customer_name']
                                   .str.strip()
                                   .str.title()
                                   .str.replace(r'\s+', ' ', regex=True))

# Standardize company names
clean_feedback['company'] = (clean_feedback['company']
                             .str.strip()
                             .str.title()
                             .str.replace(r'\b(Inc|Corp|LLC)\.?\b', '', regex=True)
                             .str.replace(r'\s+', ' ', regex=True)
                             .str.strip())

# Clean feedback text
clean_feedback['feedback_text'] = (clean_feedback['feedback_text']
                                   .str.replace(r'[!]{2,}', '!', regex=True)  # Multiple ! to single
                                   .str.replace(r'\.{2,}', '.', regex=True)   # Multiple . to single
                                   .str.strip())

# Standardize rating categories
rating_mapping = {
    'excellent': 'Excellent',
    'good': 'Good', 
    'average': 'Average',
    'poor': 'Poor'
}
clean_feedback['rating_text'] = (clean_feedback['rating_text']
                                 .str.lower()
                                 .map(rating_mapping))

print("✅ Final cleaned dataset:")
print(clean_feedback)
print()

print("🎯 Cleaning accomplished:")
print("- Professional name formatting")
print("- Standardized company names")
print("- Consistent punctuation")
print("- Unified rating categories")
print("- Analysis-ready text data!")

🎯 Key Takeaways

🎮 Practice Challenge

Apply comprehensive text cleaning to a realistic business scenario:

import pandas as pd

# Employee directory export (typical HR system data)
employee_directory = pd.DataFrame({
    'emp_id': [1001, 1002, 1003, 1004, 1005],
    'full_name': ['  ALICE M. JOHNSON  ', 'bob smith jr', 'charlie o\'brien-brown', 
                  'diana lee jones', '  EVE MARIE DAVIS-SMITH  '],
    'email': ['Alice.Johnson@COMPANY.COM', 'bob.smith@company.com', 
              'CHARLIE.OBRIEN@COMPANY.COM', 'diana.jones@Company.com', 
              'eve.davis@COMPANY.COM'],
    'department': ['Human Resources', 'information technology', 'SALES & MARKETING', 
                   'human resources', 'Information Technology'],
    'phone': ['(555) 123-4567', '555.987.6543', '5551234567', 
              '(555) 999-8888', '555-444-3333'],
    'job_title': ['HR Manager', 'sr. software engineer', 'SALES REPRESENTATIVE', 
                  'hr specialist', 'Senior Software Engineer']
})

print("🎮 Employee Directory Cleaning Challenge:")
print(employee_directory)
print()
print("Mission: Clean and standardize all text fields for professional directory")
print()

# Solution
print("🧽 Comprehensive Cleaning Solution:")
cleaned_directory = employee_directory.copy()

# Clean names
cleaned_directory['full_name'] = (cleaned_directory['full_name']
                                  .str.strip()
                                  .str.title()
                                  .str.replace(r'\s+', ' ', regex=True))

# Clean emails
cleaned_directory['email'] = cleaned_directory['email'].str.lower()

# Standardize departments
dept_mapping = {
    'human resources': 'Human Resources',
    'information technology': 'Information Technology', 
    'sales & marketing': 'Sales & Marketing'
}
cleaned_directory['department'] = (cleaned_directory['department']
                                   .str.lower()
                                   .map(dept_mapping))

# Standardize phone numbers
cleaned_directory['phone'] = (cleaned_directory['phone']
                              .str.replace('[^0-9]', '', regex=True)
                              .str.replace(r'(\d{3})(\d{3})(\d{4})', r'(\1) \2-\3', regex=True))

# Clean job titles
cleaned_directory['job_title'] = (cleaned_directory['job_title']
                                  .str.replace(r'\bsr\.\b', 'Senior', regex=True, flags=re.IGNORECASE)
                                  .str.replace(r'\bhr\b', 'HR', regex=True, flags=re.IGNORECASE)
                                  .str.title())

print("✅ Professional Employee Directory:")
print(cleaned_directory)
print()
print("🎯 Achievements:")
print("- Consistent name formatting")
print("- Standardized email format")
print("- Unified department names")
print("- Professional phone number format")
print("- Consistent job title formatting")
print("- Ready for company directory!")

🚀 What's Next?

Excellent work! You've mastered all the essential data cleaning techniques. With clean data in hand, you're ready to move on to transforming and analyzing your data.

Continue to: Data Transformation

You're now a data cleaning expert! 📝✨

Was this helpful?

😔Poor
🙁Fair
😊Good
😄Great
🤩Excellent