📏 Data Shape and Statistics

Understanding your data's size and getting basic statistics is crucial for data analysis. Let's learn how to quickly assess your DataFrame's dimensions and calculate useful statistical summaries.

📐 Understanding Data Shape

The shape tells you exactly how big your DataFrame is:

import pandas as pd

# Sample dataset
sales = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [999, 25, 75, 300],
    'quantity': [2, 10, 5, 1],
    'revenue': [1998, 250, 375, 300]
})

print("📊 Dataset:")
print(sales)
print()

print("📏 Shape Information:")
print(f"Shape: {sales.shape}")
print(f"Rows: {sales.shape[0]}")
print(f"Columns: {sales.shape[1]}")
print(f"Total cells: {sales.size}")

📊 Basic Statistics

Get instant insights about your numerical data:

import pandas as pd

# Student scores
students = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'math': [85, 92, 78, 88, 95],
    'english': [90, 87, 85, 92, 89]
})

print("Student Data:")
print(students)
print()

print("📈 Complete Statistics:")
print(students.describe())
print()

print("🎯 Quick Stats for Math:")
print(f"Average: {students['math'].mean():.1f}")
print(f"Highest: {students['math'].max()}")
print(f"Lowest: {students['math'].min()}")

🔢 Individual Column Statistics

Explore specific columns in detail:

import pandas as pd

# Product ratings
products = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E'],
    'price': [10, 25, 15, 40, 30],
    'rating': [4.2, 4.8, 3.9, 4.5, 4.1]
})

print("Product Data:")
print(products)
print()

print("💰 Price Analysis:")
print(f"Mean price: ${products['price'].mean():.2f}")
print(f"Median price: ${products['price'].median():.2f}")
print(f"Price range: ${products['price'].min()} - ${products['price'].max()}")
print()

print("⭐ Rating Analysis:")
print(f"Average rating: {products['rating'].mean():.2f}")
print(f"Best rating: {products['rating'].max()}")
print(f"Worst rating: {products['rating'].min()}")

📊 Understanding describe()

The describe() method gives you 8 key statistics:

import pandas as pd

# Sample data
data = pd.DataFrame({
    'score': [65, 70, 85, 90, 75, 80, 95, 60, 88, 92]
})

print("Score Data:")
print(data)
print()

print("📊 Detailed Statistics:")
stats = data['score'].describe()
print(stats)
print()

print("🎯 What Each Statistic Means:")
print(f"Count: {stats['count']} (how many values)")
print(f"Mean: {stats['mean']:.1f} (average)")
print(f"Std: {stats['std']:.1f} (how spread out)")
print(f"Min: {stats['min']} (lowest value)")
print(f"25%: {stats['25%']} (bottom quarter)")
print(f"50%: {stats['50%']} (median/middle)")
print(f"75%: {stats['75%']} (top quarter)")
print(f"Max: {stats['max']} (highest value)")

🔍 Non-Numerical Data Statistics

Text and categorical data have different statistics:

import pandas as pd

# Survey data
survey = pd.DataFrame({
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'Boston', 'LA'],
    'satisfaction': ['Good', 'Excellent', 'Good', 'Poor', 'Good', 'Excellent']
})

print("Survey Data:")
print(survey)
print()

print("🏙️ City Statistics:")
print(survey['city'].value_counts())
print()

print("😊 Satisfaction Statistics:")
print(survey['satisfaction'].value_counts())
print()

print("📊 Unique Values:")
print(f"Unique cities: {survey['city'].nunique()}")
print(f"Unique satisfaction levels: {survey['satisfaction'].nunique()}")

📏 Size and Memory Information

Understand how much space your data uses:

import pandas as pd

# Create sample data
employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 55000],
    'active': [True, False, True]
})

print("Employee Data:")
print(employees)
print()

print("📏 Size Information:")
print(f"Shape: {employees.shape}")
print(f"Size (total elements): {employees.size}")
print(f"Memory usage:")
print(employees.memory_usage(deep=True))
print()

print("📊 Data Types:")
print(employees.dtypes)

🎯 Quick Statistical Comparisons

Compare statistics across different groups:

import pandas as pd

# Sales by department
sales = pd.DataFrame({
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales'],
    'revenue': [1000, 1500, 1200, 800, 1800, 1100]
})

print("Department Sales:")
print(sales)
print()

print("📊 Statistics by Department:")
dept_stats = sales.groupby('department')['revenue'].describe()
print(dept_stats)
print()

print("💰 Average Revenue by Department:")
for dept in sales['department'].unique():
    avg = sales[sales['department'] == dept]['revenue'].mean()
    print(f"{dept}: ${avg:,.0f}")

📋 Essential Statistics Methods

MethodWhat It CalculatesExample
.mean()Average valuedf['score'].mean()
.median()Middle valuedf['score'].median()
.min()Smallest valuedf['score'].min()
.max()Largest valuedf['score'].max()
.std()Standard deviationdf['score'].std()
.count()Non-null valuesdf['score'].count()
.sum()Totaldf['score'].sum()
.describe()All key statisticsdf.describe()

🎯 Key Takeaways

🎮 Practice with Statistics

Let's practice calculating and interpreting statistics:

import pandas as pd

# Practice dataset
practice = pd.DataFrame({
    'product': ['A', 'B', 'C', 'A', 'B', 'C'],
    'sales': [100, 150, 200, 120, 180, 220],
    'rating': [4.1, 4.5, 4.8, 4.2, 4.6, 4.9]
})

print("📊 Practice Dataset:")
print(practice)
print()

print("1️⃣ Basic Shape:")
print(f"   Shape: {practice.shape}")
print()

print("2️⃣ Sales Statistics:")
print(f"   Average sales: {practice['sales'].mean():.0f}")
print(f"   Sales range: {practice['sales'].min()} - {practice['sales'].max()}")
print()

print("3️⃣ Rating Statistics:")
print(f"   Average rating: {practice['rating'].mean():.2f}")
print(f"   Best rating: {practice['rating'].max()}")
print()

print("4️⃣ Product Analysis:")
print(practice['product'].value_counts())

🚀 What's Next?

Excellent! You now understand data shape and basic statistics. Next, let's dive deeper into individual columns and understand different data types.

Continue to: Column Information and Data Types

You're mastering data exploration! 📏📊

Was this helpful?

😔Poor
🙁Fair
😊Good
😄Great
🤩Excellent