🗄️ Loading from Databases
Databases are incredibly powerful for data analysis! Instead of loading entire files into memory, you can query databases to get exactly the data you need. This makes working with large datasets much faster and more efficient.
🛠️ Basic Database Connection
First, let's create a simple SQLite database to practice with:
import pandas as pd
import sqlite3
# Create a simple SQLite database
conn = sqlite3.connect('sample.db')
# Create sample data
sample_data = pd.DataFrame({
'employee_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'department': ['Sales', 'IT', 'Sales', 'HR', 'IT'],
'salary': [50000, 75000, 52000, 48000, 80000]
})
# Save DataFrame to database
sample_data.to_sql('employees', conn, if_exists='replace', index=False)
print("✅ Created sample database with employees table")
print(f"Saved {len(sample_data)} records")
# Close connection
conn.close()
📊 Loading Data from Database
Now let's load data from our database:
import pandas as pd
import sqlite3
# Connect to database
conn = sqlite3.connect('sample.db')
# Load all data
df = pd.read_sql('SELECT * FROM employees', conn)
print("All employees:")
print(df)
print()
# Load specific data with WHERE clause
sales_df = pd.read_sql('SELECT * FROM employees WHERE department = "Sales"', conn)
print("Sales department only:")
print(sales_df)
print()
# Load with calculations
summary = pd.read_sql("""
SELECT department,
COUNT(*) as employee_count,
AVG(salary) as avg_salary
FROM employees
GROUP BY department
""", conn)
print("Department summary:")
print(summary)
conn.close()
🔍 Different Types of Queries
You can use SQL to filter and process data before loading:
import pandas as pd
import sqlite3
# Reconnect to our database
conn = sqlite3.connect('sample.db')
# Different query examples
queries = {
"High earners": "SELECT * FROM employees WHERE salary > 60000",
"IT department": "SELECT name, salary FROM employees WHERE department = 'IT'",
"Sorted by salary": "SELECT * FROM employees ORDER BY salary DESC",
"Name and department": "SELECT name, department FROM employees"
}
for description, query in queries.items():
print(f"📊 {description}:")
df = pd.read_sql(query, conn)
print(df)
print()
conn.close()
🌐 Connecting to Different Databases
Different databases need different connection methods:
SQLite (File-based)
import pandas as pd
import sqlite3
# SQLite - simple file database
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)
conn.close()
print("✅ SQLite connection established and closed safely")
PostgreSQL / MySQL Examples
import pandas as pd
# Note: These require additional packages
# Example connection patterns (see InfoBox above for details)
print("Database connection patterns moved to InfoBox above")
print("Remember to install required packages and use environment variables for credentials")
🔧 Safe Database Connections
Always handle database connections safely:
import pandas as pd
import sqlite3
def safe_db_query(db_file, query):
"""Safely execute database query"""
try:
# Connect to database
conn = sqlite3.connect(db_file)
# Execute query
df = pd.read_sql(query, conn)
print(f"✅ Query successful: {len(df)} rows returned")
return df
except sqlite3.Error as e:
print(f"❌ Database error: {e}")
return None
except Exception as e:
print(f"❌ Error: {e}")
return None
finally:
# Always close connection
if 'conn' in locals():
conn.close()
# Test safe query
df = safe_db_query('sample.db', 'SELECT * FROM employees')
if df is not None:
print(df.head())
📋 Database Loading Best Practices
Practice | Why | Example |
---|---|---|
Use LIMIT | Don't load huge datasets | SELECT * FROM table LIMIT 1000 |
Filter with WHERE | Only get needed data | WHERE date > '2023-01-01' |
Close connections | Prevent database locks | conn.close() |
Handle errors | Graceful failure | try/except blocks |
Test queries first | Avoid long waits | Run in database tool first |
🎯 Common SQL Queries for Analysis
import pandas as pd
import sqlite3
# Useful SQL patterns for data analysis
conn = sqlite3.connect('sample.db')
queries = [
# Get recent data
"SELECT * FROM employees ORDER BY employee_id DESC LIMIT 5",
# Count by category
"SELECT department, COUNT(*) as count FROM employees GROUP BY department",
# Basic statistics
"SELECT AVG(salary) as avg_salary, MAX(salary) as max_salary FROM employees",
# Filter and sort
"SELECT name, salary FROM employees WHERE salary > 50000 ORDER BY salary"
]
for i, query in enumerate(queries, 1):
print(f"Query {i}:")
df = pd.read_sql(query, conn)
print(df)
print()
conn.close()
🎯 Key Takeaways
🚀 What's Next?
Great! You can now load data from databases. Next, let's learn how to handle common errors that occur when loading files and databases.
Continue to: Handling File Errors
You're mastering data loading! 🗄️💪
Was this helpful?
Track Your Learning Progress
Sign in to bookmark tutorials and keep track of your learning journey.
Your progress is saved automatically as you read.