🔍 Column Information and Data Types
Understanding your DataFrame's columns and data types is essential for effective data analysis. Each column has a specific data type that determines what operations you can perform. Let's explore how to examine columns and work with different data types.
📊 Exploring Column Information
Start by understanding what columns you have and their types:
import pandas as pd
# Sample dataset with mixed types
employees = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'salary': [50000.0, 60000.5, 55000.0],
'active': [True, False, True],
'department': ['Sales', 'IT', 'Marketing']
})
print("📊 Employee Data:")
print(employees)
print()
print("🔍 Column Information:")
print(employees.info())
print()
print("🏷️ Data Types:")
print(employees.dtypes)
🔢 Working with Numerical Columns
Numerical columns (int, float) support mathematical operations:
import pandas as pd
# Numerical data
sales = pd.DataFrame({
'product': ['A', 'B', 'C'],
'price': [10.99, 25.50, 15.75],
'quantity': [5, 3, 8]
})
print("Sales Data:")
print(sales)
print()
print("💰 Price Column Analysis:")
print(f"Type: {sales['price'].dtype}")
print(f"Average: ${sales['price'].mean():.2f}")
print(f"Total: ${sales['price'].sum():.2f}")
print()
print("📦 Quantity Column Analysis:")
print(f"Type: {sales['quantity'].dtype}")
print(f"Total quantity: {sales['quantity'].sum()}")
print(f"Max quantity: {sales['quantity'].max()}")
📝 Working with Text Columns
Text columns (object type) have different methods:
import pandas as pd
# Text data
customers = pd.DataFrame({
'name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown'],
'city': ['New York', 'Los Angeles', 'Chicago'],
'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com']
})
print("Customer Data:")
print(customers)
print()
print("📧 Email Column Analysis:")
print(f"Type: {customers['email'].dtype}")
print(f"Unique emails: {customers['email'].nunique()}")
print("Email domains:")
for email in customers['email']:
domain = email.split('@')[1]
print(f" {domain}")
print()
print("🏙️ City Column Analysis:")
print(customers['city'].value_counts())
✅ Working with Boolean Columns
Boolean columns store True/False values:
import pandas as pd
# Boolean data
survey = pd.DataFrame({
'respondent': [1, 2, 3, 4],
'satisfied': [True, False, True, True],
'will_recommend': [True, False, False, True]
})
print("Survey Data:")
print(survey)
print()
print("😊 Satisfaction Analysis:")
print(f"Type: {survey['satisfied'].dtype}")
print(f"Satisfied customers: {survey['satisfied'].sum()}")
print(f"Satisfaction rate: {survey['satisfied'].mean():.2%}")
print()
print("👍 Recommendation Analysis:")
print(f"Will recommend: {survey['will_recommend'].sum()}")
print(f"Recommendation rate: {survey['will_recommend'].mean():.2%}")
🔍 Detailed Column Exploration
Examine individual columns thoroughly:
import pandas as pd
# Mixed dataset
products = pd.DataFrame({
'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price': [999.99, 24.99, 74.99, 299.99],
'in_stock': [True, True, False, True],
'category': ['Computer', 'Accessory', 'Accessory', 'Computer']
})
print("Product Data:")
print(products)
print()
# Analyze each column
for column in products.columns:
print(f"📊 {column.upper()} Column:")
print(f" Type: {products[column].dtype}")
print(f" Unique values: {products[column].nunique()}")
print(f" Non-null count: {products[column].count()}")
if products[column].dtype in ['int64', 'float64']:
print(f" Range: {products[column].min()} to {products[column].max()}")
elif products[column].dtype == 'bool':
print(f" True count: {products[column].sum()}")
else:
print(f" Sample values: {products[column].unique()[:3]}")
print()
🏷️ Common Data Type Issues
Watch out for these common problems:
import pandas as pd
# Data with type issues
messy_data = pd.DataFrame({
'age': ['25', '30', 'unknown', '35'], # Should be numbers
'price': ['$10.99', '$25.50', 'free', '$15.75'], # Has symbols
'active': ['yes', 'no', 'true', 'false'] # Should be boolean
})
print("❌ Messy Data:")
print(messy_data)
print()
print("🔍 Type Problems:")
print(messy_data.dtypes)
print()
print("🎯 What Should Be Fixed:")
print("- Age column: Contains 'unknown', should be numbers")
print("- Price column: Has '$' symbols and 'free'")
print("- Active column: Text instead of True/False")
print()
print("✅ After cleaning (example):")
# This is what you'd do after cleaning
clean_data = pd.DataFrame({
'age': [25, 30, None, 35], # Proper numbers with NaN
'price': [10.99, 25.50, 0.0, 15.75], # Clean numbers
'active': [True, False, True, False] # Proper booleans
})
print(clean_data)
print(clean_data.dtypes)
📋 Data Type Reference
Type | Description | Examples | Common Operations |
---|---|---|---|
int64 | Whole numbers | 1, 100, -5 | .sum() , .mean() , math |
float64 | Decimal numbers | 1.5, 99.99, -2.5 | .sum() , .mean() , math |
object | Text/strings | "Alice", "NYC" | .str.upper() , .value_counts() |
bool | True/False | True, False | .sum() , .mean() (as 1/0) |
datetime64 | Dates/times | 2023-01-15 | .dt.year , .dt.month |
🔧 Checking for Data Quality Issues
Use column exploration to spot problems:
import pandas as pd
# Dataset to check
data = pd.DataFrame({
'product': ['Laptop', 'Mouse', None, 'Monitor'],
'price': [999, 25, 75, None],
'rating': [4.5, 4.2, 5.0, 4.8]
})
print("Data Quality Check:")
print(data)
print()
print("🔍 Column-by-Column Analysis:")
for col in data.columns:
print(f"\n{col.upper()}:")
print(f" Type: {data[col].dtype}")
print(f" Missing values: {data[col].isnull().sum()}")
print(f" Unique values: {data[col].nunique()}")
if data[col].dtype in ['int64', 'float64']:
print(f" Min/Max: {data[col].min()} / {data[col].max()}")
else:
unique_vals = data[col].dropna().unique()
print(f" Sample values: {unique_vals[:3]}")
🎯 Key Takeaways
🎮 Column Analysis Practice
Let's practice analyzing columns systematically:
import pandas as pd
# Practice dataset
practice = pd.DataFrame({
'customer_id': [1001, 1002, 1003, 1004],
'purchase_amount': [99.99, 149.50, 75.25, 200.00],
'is_member': [True, False, True, True],
'product_category': ['Electronics', 'Clothing', 'Electronics', 'Books']
})
print("🎯 Column Analysis Practice:")
print(practice)
print()
print("📊 Quick Column Summary:")
for col in practice.columns:
dtype = practice[col].dtype
unique = practice[col].nunique()
missing = practice[col].isnull().sum()
print(f"{col}: {dtype} | {unique} unique | {missing} missing")
print()
print("🔍 Detailed Analysis:")
print(f"💰 Purchase amounts: ${practice['purchase_amount'].min()} - ${practice['purchase_amount'].max()}")
print(f"👥 Member percentage: {practice['is_member'].mean():.1%}")
print(f"📦 Categories: {practice['product_category'].unique()}")
🚀 What's Next?
Perfect! You now understand how to explore columns and data types. Next, let's learn how to select and filter specific data from your DataFrames.
Continue to: Selecting Columns and Rows
You're becoming a data type expert! 🔍🏷️
Was this helpful?
Track Your Learning Progress
Sign in to bookmark tutorials and keep track of your learning journey.
Your progress is saved automatically as you read.