Data Cleaning and Transformation with Pandas
Data cleaning is one of the most important and time-consuming tasks in data analysis. Real-world data is messy, incomplete, and often inconsistent. Pandas provides powerful tools to clean, transform, and prepare your data for analysis. This comprehensive guide covers essential data cleaning techniques that every data analyst should master.
Why Data Cleaning Matters
- Quality Analysis: Clean data leads to reliable insights and conclusions
- Accurate Results: Removes errors that could skew statistical analysis
- Efficiency: Well-structured data makes analysis faster and easier
- Reproducibility: Documented cleaning steps make analysis reproducible
- Machine Learning: Clean data is essential for effective model training
According to data scientists, 80% of their time is spent cleaning and preparing data, while only 20% is spent on actual analysis.
Common Data Quality Issues
Let's start by creating a messy dataset that demonstrates common problems:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create a messy dataset
messy_data = {
'customer_id': [1001, 1002, 1003, 1003, 1004, 1005, '', 1007, 1008, 1009],
'name': ['Alice Johnson', 'bob smith', 'CHARLIE BROWN', 'Charlie Brown',
'diana ross', None, 'Eve Wilson', 'Frank Miller', 'grace kelly', 'Henry Ford'],
'age': [25, 35, 'forty-two', 42, 28, -5, 150, 30, 25, 65],
'email': ['alice@email.com', 'BOB@EMAIL.COM', 'charlie@gmail.com', 'charlie@gmail.com',
'diana@', 'eve@email.com', 'invalid-email', 'frank@email.com',
'grace@email.com', 'henry@email.com'],
'phone': ['555-1234', '(555) 567-8900', '555.111.2222', '5551112222',
'555-333-4444', '', '+1-555-555-5555', '555 666 7777',
'5558889999', '555-000-0000'],
'purchase_amount': [150.50, 250.75, None, 300.00, 75.25, 0, 999999,
125.00, 200.50, 50.00],
'purchase_date': ['2023-01-15', '2023/02/20', '2023-03-25', '2023-03-25',
'invalid_date', '2023-05-10', '2023-06-15', '2023-07-20',
'2023-08-25', '2023-09-30'],
'product_category': ['Electronics', 'electronics', 'CLOTHING', 'Clothing',
'Books', 'Electronics', '', 'books', 'Electronics', 'Clothing'],
'rating': [5, 4, 3, 3, 5, 0, 10, 4, 5, 3]
}
df = pd.DataFrame(messy_data)
print("Original messy dataset:")
print(df)
print(f"\nDataset shape: {df.shape}")
print(f"Data types:\n{df.dtypes}")
This dataset contains multiple issues:
- Duplicate records
- Missing values (None, empty strings)
- Inconsistent formatting (names, emails)
- Invalid data (negative age, future dates)
- Inconsistent categories
- Data type issues
Step 1: Initial Data Assessment
def assess_data_quality(df):
"""
Comprehensive data quality assessment
"""
print("=== DATA QUALITY ASSESSMENT ===")
# Basic information
print(f"Dataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
# Missing values analysis
print("\n--- Missing Values ---")
missing_counts = df.isnull().sum()
missing_percentages = (missing_counts / len(df)) * 100
missing_summary = pd.DataFrame({
'Missing Count': missing_counts,
'Missing %': missing_percentages
})
print(missing_summary[missing_summary['Missing Count'] > 0])
# Data types
print("\n--- Data Types ---")
print(df.dtypes)
# Unique values per column
print("\n--- Unique Values ---")
for col in df.columns:
unique_count = df[col].nunique()
total_count = len(df)
print(f"{col}: {unique_count}/{total_count} unique values ({unique_count/total_count*100:.1f}%)")
# Potential duplicates
print(f"\n--- Duplicates ---")
duplicate_rows = df.duplicated().sum()
print(f"Duplicate rows: {duplicate_rows}")
# Value ranges for numeric columns
print("\n--- Numeric Ranges ---")
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
print(f"{col}: min={df[col].min()}, max={df[col].max()}")
return missing_summary
# Assess our messy data
quality_report = assess_data_quality(df)
Step 2: Handling Missing Values
print("=== HANDLING MISSING VALUES ===")
# Identify different types of missing values
def identify_missing_patterns(df):
"""
Identify different patterns of missing data
"""
print("--- Missing Value Patterns ---")
# Standard missing values (NaN, None)
standard_missing = df.isnull().sum()
print("Standard missing values:")
print(standard_missing[standard_missing > 0])
# Empty strings
empty_strings = df.eq('').sum()
print("\nEmpty string values:")
print(empty_strings[empty_strings > 0])
# Whitespace-only strings
whitespace_only = df.apply(lambda x: x.astype(str).str.strip().eq('').sum() if x.dtype == 'object' else 0)
print("\nWhitespace-only values:")
print(whitespace_only[whitespace_only > 0])
identify_missing_patterns(df)
# Clean the dataset step by step
df_clean = df.copy()
# Replace empty strings with NaN for consistent handling
df_clean = df_clean.replace('', np.nan)
print(f"\nAfter replacing empty strings with NaN:")
print(f"Missing values:\n{df_clean.isnull().sum()}")
# Handle missing values by column
print("\n--- Handling Missing Values by Column ---")
# Customer ID: Drop rows with missing customer_id (required field)
print(f"Rows before dropping missing customer_id: {len(df_clean)}")
df_clean = df_clean.dropna(subset=['customer_id'])
print(f"Rows after: {len(df_clean)}")
# Name: Fill missing names with placeholder
df_clean['name'] = df_clean['name'].fillna('Unknown Customer')
# Age: Fill with median age (more robust than mean)
median_age = pd.to_numeric(df_clean['age'], errors='coerce').median()
print(f"Median age for filling missing values: {median_age}")
# Purchase amount: Fill with 0 (assuming no purchase)
df_clean['purchase_amount'] = df_clean['purchase_amount'].fillna(0)
# Product category: Fill with 'Uncategorized'
df_clean['product_category'] = df_clean['product_category'].fillna('Uncategorized')
print("\nAfter handling missing values:")
print(df_clean.isnull().sum())
Step 3: Data Type Conversion and Validation
print("=== DATA TYPE CONVERSION AND VALIDATION ===")
# Convert customer_id to proper integer
df_clean['customer_id'] = pd.to_numeric(df_clean['customer_id'], errors='coerce').astype('Int64')
# Handle age column with mixed data types
def clean_age_column(age_series):
"""
Clean age column with mixed data types
"""
def convert_age(age_val):
if pd.isna(age_val):
return np.nan
age_str = str(age_val).strip().lower()
# Handle text representations
text_to_num = {
'forty-two': 42, 'forty two': 42,
'thirty': 30, 'twenty': 20,
# Add more as needed
}
if age_str in text_to_num:
return text_to_num[age_str]
# Try to extract numbers from string
import re
numbers = re.findall(r'\d+', age_str)
if numbers:
return int(numbers[0])
return np.nan
return age_series.apply(convert_age)
# Apply age cleaning
df_clean['age'] = clean_age_column(df_clean['age'])
print("Age column after cleaning:")
print(df_clean['age'].value_counts().sort_index())
# Validate age ranges
def validate_age(age):
"""Validate age is within reasonable range"""
if pd.isna(age):
return np.nan
age = float(age)
if 0 <= age <= 120:
return age
else:
return np.nan # Invalid ages become missing
df_clean['age'] = df_clean['age'].apply(validate_age)
# Fill remaining missing ages with median
age_median = df_clean['age'].median()
df_clean['age'] = df_clean['age'].fillna(age_median)
print(f"Age statistics after validation:")
print(df_clean['age'].describe())
# Convert purchase_amount to float
df_clean['purchase_amount'] = pd.to_numeric(df_clean['purchase_amount'], errors='coerce')
# Validate purchase amounts (remove obvious outliers)
def validate_purchase_amount(amount):
"""Validate purchase amount is reasonable"""
if pd.isna(amount) or amount < 0:
return 0
elif amount > 10000: # Assuming $10k is our upper limit
return np.nan # Flag for review
else:
return amount
df_clean['purchase_amount'] = df_clean['purchase_amount'].apply(validate_purchase_amount)
print("Purchase amount after validation:")
print(df_clean['purchase_amount'].describe())
Step 4: String Data Cleaning
print("=== STRING DATA CLEANING ===")
# Clean name column
def clean_name(name):
"""
Standardize name formatting
"""
if pd.isna(name) or name == 'Unknown Customer':
return name
# Convert to title case and strip whitespace
clean_name = str(name).strip().title()
# Handle multiple spaces
import re
clean_name = re.sub(r'\s+', ' ', clean_name)
return clean_name
df_clean['name'] = df_clean['name'].apply(clean_name)
print("Names after cleaning:")
print(df_clean['name'].tolist())
# Clean and validate email addresses
def clean_email(email):
"""
Clean and validate email addresses
"""
if pd.isna(email):
return np.nan
email = str(email).strip().lower()
# Basic email validation pattern
import re
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
if re.match(email_pattern, email):
return email
else:
return np.nan # Invalid emails become missing
df_clean['email'] = df_clean['email'].apply(clean_email)
print("\nValid emails after cleaning:")
print(df_clean[df_clean['email'].notna()]['email'].tolist())
# Clean phone numbers
def clean_phone(phone):
"""
Standardize phone number format
"""
if pd.isna(phone):
return np.nan
# Remove all non-digit characters
import re
digits_only = re.sub(r'\D', '', str(phone))
# Handle US phone numbers (10 digits, optionally with country code)
if len(digits_only) == 10:
return f"({digits_only[:3]}) {digits_only[3:6]}-{digits_only[6:]}"
elif len(digits_only) == 11 and digits_only.startswith('1'):
digits_only = digits_only[1:] # Remove country code
return f"({digits_only[:3]}) {digits_only[3:6]}-{digits_only[6:]}"
else:
return np.nan # Invalid phone numbers
df_clean['phone'] = df_clean['phone'].apply(clean_phone)
print("\nPhone numbers after cleaning:")
print(df_clean[df_clean['phone'].notna()]['phone'].tolist())
# Clean product categories
def standardize_category(category):
"""
Standardize product category names
"""
if pd.isna(category):
return 'Uncategorized'
category = str(category).strip().lower()
# Standardization mapping
category_mapping = {
'electronics': 'Electronics',
'clothing': 'Clothing',
'books': 'Books',
'book': 'Books',
'electronic': 'Electronics',
'cloth': 'Clothing',
'uncategorized': 'Uncategorized'
}
return category_mapping.get(category, category.title())
df_clean['product_category'] = df_clean['product_category'].apply(standardize_category)
print("\nProduct categories after standardization:")
print(df_clean['product_category'].value_counts())
Step 5: Date and Time Cleaning
print("=== DATE AND TIME CLEANING ===")
def clean_date_column(date_series):
"""
Clean and standardize date column
"""
def parse_date(date_val):
if pd.isna(date_val):
return pd.NaT
date_str = str(date_val).strip()
# Try different date formats
date_formats = [
'%Y-%m-%d',
'%Y/%m/%d',
'%m/%d/%Y',
'%m-%d-%Y',
'%d/%m/%Y',
'%d-%m-%Y'
]
for fmt in date_formats:
try:
return pd.to_datetime(date_str, format=fmt)
except ValueError:
continue
# If none of the formats work, return NaT
return pd.NaT
return date_series.apply(parse_date)
# Clean purchase dates
df_clean['purchase_date'] = clean_date_column(df_clean['purchase_date'])
print("Purchase dates after cleaning:")
print(df_clean[df_clean['purchase_date'].notna()]['purchase_date'])
# Validate date ranges (shouldn't be in the future)
today = pd.Timestamp.now().normalize()
future_dates = df_clean['purchase_date'] > today
if future_dates.any():
print(f"\nFound {future_dates.sum()} future dates - setting to NaT")
df_clean.loc[future_dates, 'purchase_date'] = pd.NaT
# Add derived date features
df_clean['purchase_year'] = df_clean['purchase_date'].dt.year
df_clean['purchase_month'] = df_clean['purchase_date'].dt.month
df_clean['days_since_purchase'] = (today - df_clean['purchase_date']).dt.days
print("\nDate-derived features:")
print(df_clean[['purchase_date', 'purchase_year', 'purchase_month', 'days_since_purchase']].head())
Step 6: Handling Duplicates
print("=== HANDLING DUPLICATES ===")
# Identify duplicates
print("Checking for duplicates...")
print(f"Total rows: {len(df_clean)}")
print(f"Duplicate rows (all columns): {df_clean.duplicated().sum()}")
# Check duplicates based on key columns
key_columns = ['customer_id', 'email']
duplicate_customers = df_clean.duplicated(subset=key_columns, keep=False)
print(f"Duplicate customers (based on ID/email): {duplicate_customers.sum()}")
if duplicate_customers.any():
print("\nDuplicate customer records:")
duplicates_df = df_clean[duplicate_customers].sort_values(['customer_id', 'email'])
print(duplicates_df[['customer_id', 'name', 'email', 'purchase_date', 'purchase_amount']])
# Strategy: Keep the most recent purchase for each customer
df_clean = df_clean.sort_values('purchase_date', ascending=False)
df_clean = df_clean.drop_duplicates(subset=key_columns, keep='first')
print(f"\nAfter removing duplicates: {len(df_clean)} rows")
# Reset index after dropping duplicates
df_clean = df_clean.reset_index(drop=True)
Step 7: Outlier Detection and Treatment
print("=== OUTLIER DETECTION AND TREATMENT ===")
def detect_outliers_iqr(series, k=1.5):
"""
Detect outliers using Interquartile Range (IQR) method
"""
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - k * IQR
upper_bound = Q3 + k * IQR
outliers = (series < lower_bound) | (series > upper_bound)
return outliers, lower_bound, upper_bound
# Check for outliers in numerical columns
numerical_columns = ['age', 'purchase_amount', 'rating']
for col in numerical_columns:
if col in df_clean.columns:
outliers, lower_bound, upper_bound = detect_outliers_iqr(df_clean[col])
print(f"\n--- {col.upper()} Outliers ---")
print(f"Valid range: {lower_bound:.2f} to {upper_bound:.2f}")
print(f"Outliers found: {outliers.sum()}")
if outliers.any():
print("Outlier values:")
print(df_clean[outliers][col].tolist())
# Treatment strategies
if col == 'rating':
# Rating should be 1-5, cap at boundaries
df_clean.loc[df_clean[col] < 1, col] = 1
df_clean.loc[df_clean[col] > 5, col] = 5
print("Capped rating values to 1-5 range")
elif col == 'age':
# For age, we already handled this in validation
pass
elif col == 'purchase_amount':
# For purchase amount, cap extreme values or mark for review
extreme_purchases = df_clean[col] > 5000
if extreme_purchases.any():
print(f"Flagging {extreme_purchases.sum()} high-value purchases for review")
df_clean['high_value_flag'] = extreme_purchases
print("\nFinal outlier summary:")
for col in numerical_columns:
if col in df_clean.columns:
print(f"{col}: min={df_clean[col].min():.2f}, max={df_clean[col].max():.2f}")
Step 8: Data Consistency Checks
print("=== DATA CONSISTENCY CHECKS ===")
def run_consistency_checks(df):
"""
Run various consistency checks on the cleaned data
"""
issues = []
# Check 1: Customer ID uniqueness
if df['customer_id'].duplicated().any():
issues.append("Duplicate customer IDs found")
# Check 2: Email format consistency
invalid_emails = df[df['email'].notna()]['email'].apply(
lambda x: '@' not in str(x) or '.' not in str(x).split('@')[-1]
).sum()
if invalid_emails > 0:
issues.append(f"{invalid_emails} emails with invalid format")
# Check 3: Age reasonableness
unreasonable_ages = ((df['age'] < 0) | (df['age'] > 120)).sum()
if unreasonable_ages > 0:
issues.append(f"{unreasonable_ages} unreasonable age values")
# Check 4: Purchase amount reasonableness
negative_purchases = (df['purchase_amount'] < 0).sum()
if negative_purchases > 0:
issues.append(f"{negative_purchases} negative purchase amounts")
# Check 5: Rating range
invalid_ratings = ((df['rating'] < 1) | (df['rating'] > 5)).sum()
if invalid_ratings > 0:
issues.append(f"{invalid_ratings} ratings outside 1-5 range")
# Check 6: Date consistency
future_dates = (df['purchase_date'] > pd.Timestamp.now()).sum()
if future_dates > 0:
issues.append(f"{future_dates} future purchase dates")
return issues
# Run consistency checks
consistency_issues = run_consistency_checks(df_clean)
print("Consistency check results:")
if consistency_issues:
for issue in consistency_issues:
print(f"ā ļø {issue}")
else:
print("ā
All consistency checks passed!")
# Data quality score
def calculate_data_quality_score(df):
"""
Calculate overall data quality score
"""
total_cells = df.shape[0] * df.shape[1]
missing_cells = df.isnull().sum().sum()
completeness = (total_cells - missing_cells) / total_cells
# Additional quality factors
duplicate_rate = df.duplicated().sum() / len(df)
# Simple quality score (0-100)
quality_score = (completeness * 0.7 + (1 - duplicate_rate) * 0.3) * 100
return quality_score, completeness, duplicate_rate
quality_score, completeness, duplicate_rate = calculate_data_quality_score(df_clean)
print(f"\n=== DATA QUALITY METRICS ===")
print(f"Overall Quality Score: {quality_score:.1f}/100")
print(f"Completeness: {completeness:.1%}")
print(f"Duplicate Rate: {duplicate_rate:.1%}")
Step 9: Final Data Validation and Export
print("=== FINAL DATA VALIDATION ===")
# Final dataset summary
def summarize_cleaned_data(df):
"""
Provide comprehensive summary of cleaned data
"""
print("--- CLEANED DATASET SUMMARY ---")
print(f"Shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
print("\n--- COLUMN SUMMARY ---")
for col in df.columns:
dtype = df[col].dtype
non_null = df[col].count()
null_pct = (df[col].isnull().sum() / len(df)) * 100
unique = df[col].nunique()
print(f"{col}: {dtype} | {non_null}/{len(df)} non-null ({100-null_pct:.1f}%) | {unique} unique")
print("\n--- SAMPLE DATA ---")
print(df.head())
return df
# Summarize our cleaned data
df_final = summarize_cleaned_data(df_clean)
# Create data cleaning report
def create_cleaning_report(original_df, cleaned_df):
"""
Create a comprehensive data cleaning report
"""
report = {
'original_shape': original_df.shape,
'cleaned_shape': cleaned_df.shape,
'rows_removed': original_df.shape[0] - cleaned_df.shape[0],
'columns_added': cleaned_df.shape[1] - original_df.shape[1],
'original_missing': original_df.isnull().sum().sum(),
'cleaned_missing': cleaned_df.isnull().sum().sum(),
'missing_reduction': original_df.isnull().sum().sum() - cleaned_df.isnull().sum().sum()
}
print("\n=== DATA CLEANING REPORT ===")
print(f"Original dataset: {report['original_shape'][0]} rows Ć {report['original_shape'][1]} columns")
print(f"Cleaned dataset: {report['cleaned_shape'][0]} rows Ć {report['cleaned_shape'][1]} columns")
print(f"Rows removed: {report['rows_removed']}")
print(f"Columns added: {report['columns_added']}")
print(f"Missing values reduced: {report['missing_reduction']} ({report['missing_reduction']/report['original_missing']*100:.1f}%)")
return report
# Generate cleaning report
cleaning_report = create_cleaning_report(df, df_clean)
# Save cleaned data
print("\n=== EXPORTING CLEANED DATA ===")
# Export to multiple formats
df_clean.to_csv('cleaned_customer_data.csv', index=False)
print("ā
Exported to CSV")
df_clean.to_excel('cleaned_customer_data.xlsx', index=False)
print("ā
Exported to Excel")
# Save cleaning metadata
metadata = {
'cleaning_date': datetime.now().isoformat(),
'original_rows': len(df),
'cleaned_rows': len(df_clean),
'quality_score': quality_score,
'cleaning_steps': [
'Replaced empty strings with NaN',
'Converted data types',
'Standardized name formatting',
'Validated and cleaned email addresses',
'Standardized phone number format',
'Cleaned and validated dates',
'Removed duplicates',
'Handled outliers',
'Added derived features'
]
}
import json
with open('cleaning_metadata.json', 'w') as f:
json.dump(metadata, f, indent=2)
print("ā
Saved cleaning metadata")
Advanced Data Cleaning Techniques
Custom Validation Rules
def create_validation_rules():
"""
Define custom validation rules for different data types
"""
rules = {
'customer_id': {
'required': True,
'type': 'integer',
'min': 1000,
'max': 999999
},
'email': {
'required': False,
'type': 'string',
'pattern': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
},
'age': {
'required': True,
'type': 'integer',
'min': 0,
'max': 120
},
'purchase_amount': {
'required': True,
'type': 'float',
'min': 0,
'max': 10000
},
'rating': {
'required': False,
'type': 'integer',
'min': 1,
'max': 5
}
}
return rules
def validate_against_rules(df, rules):
"""
Validate DataFrame against custom rules
"""
validation_results = {}
for column, rule in rules.items():
if column not in df.columns:
validation_results[column] = {'error': 'Column not found'}
continue
col_results = {'passed': 0, 'failed': 0, 'issues': []}
for idx, value in df[column].items():
# Check if required
if rule.get('required', False) and pd.isna(value):
col_results['failed'] += 1
col_results['issues'].append(f"Row {idx}: Required field is missing")
continue
# Skip validation for optional missing values
if pd.isna(value) and not rule.get('required', False):
col_results['passed'] += 1
continue
# Type and range validation
try:
if rule['type'] == 'integer':
val = int(float(value))
if val < rule.get('min', float('-inf')) or val > rule.get('max', float('inf')):
col_results['failed'] += 1
col_results['issues'].append(f"Row {idx}: Value {val} outside range [{rule.get('min')}, {rule.get('max')}]")
else:
col_results['passed'] += 1
elif rule['type'] == 'float':
val = float(value)
if val < rule.get('min', float('-inf')) or val > rule.get('max', float('inf')):
col_results['failed'] += 1
col_results['issues'].append(f"Row {idx}: Value {val} outside range [{rule.get('min')}, {rule.get('max')}]")
else:
col_results['passed'] += 1
elif rule['type'] == 'string':
if 'pattern' in rule:
import re
if not re.match(rule['pattern'], str(value)):
col_results['failed'] += 1
col_results['issues'].append(f"Row {idx}: Value doesn't match pattern")
else:
col_results['passed'] += 1
else:
col_results['passed'] += 1
except (ValueError, TypeError):
col_results['failed'] += 1
col_results['issues'].append(f"Row {idx}: Invalid data type")
validation_results[column] = col_results
return validation_results
# Apply validation rules
validation_rules = create_validation_rules()
validation_results = validate_against_rules(df_clean, validation_rules)
print("=== VALIDATION RESULTS ===")
for column, results in validation_results.items():
if 'error' in results:
print(f"{column}: {results['error']}")
else:
total = results['passed'] + results['failed']
pass_rate = results['passed'] / total * 100 if total > 0 else 0
print(f"{column}: {pass_rate:.1f}% pass rate ({results['passed']}/{total})")
if results['issues']:
print(f" Issues: {len(results['issues'])}")
for issue in results['issues'][:3]: # Show first 3 issues
print(f" - {issue}")
if len(results['issues']) > 3:
print(f" ... and {len(results['issues']) - 3} more")
Automated Data Profiling
def create_data_profile(df):
"""
Create comprehensive data profile
"""
profile = {
'dataset_info': {
'name': 'Customer Purchase Data',
'rows': len(df),
'columns': len(df.columns),
'memory_usage_mb': df.memory_usage(deep=True).sum() / (1024**2)
},
'columns': {}
}
for col in df.columns:
col_profile = {
'dtype': str(df[col].dtype),
'non_null_count': df[col].count(),
'null_count': df[col].isnull().sum(),
'null_percentage': (df[col].isnull().sum() / len(df)) * 100,
'unique_count': df[col].nunique(),
'unique_percentage': (df[col].nunique() / len(df)) * 100
}
# Add type-specific statistics
if df[col].dtype in ['int64', 'float64', 'Int64']:
col_profile.update({
'min': df[col].min(),
'max': df[col].max(),
'mean': df[col].mean(),
'median': df[col].median(),
'std': df[col].std(),
'q25': df[col].quantile(0.25),
'q75': df[col].quantile(0.75)
})
elif df[col].dtype == 'object':
col_profile.update({
'avg_length': df[col].astype(str).str.len().mean(),
'max_length': df[col].astype(str).str.len().max(),
'most_common': df[col].value_counts().head(3).to_dict()
})
elif df[col].dtype.name.startswith('datetime'):
col_profile.update({
'min_date': df[col].min(),
'max_date': df[col].max(),
'date_range_days': (df[col].max() - df[col].min()).days if df[col].count() > 0 else 0
})
profile['columns'][col] = col_profile
return profile
# Create and display data profile
data_profile = create_data_profile(df_clean)
print("=== DATA PROFILE SUMMARY ===")
print(f"Dataset: {data_profile['dataset_info']['name']}")
print(f"Dimensions: {data_profile['dataset_info']['rows']} rows Ć {data_profile['dataset_info']['columns']} columns")
print(f"Memory usage: {data_profile['dataset_info']['memory_usage_mb']:.2f} MB")
print("\n--- COLUMN PROFILES ---")
for col, profile in data_profile['columns'].items():
print(f"\n{col.upper()}:")
print(f" Type: {profile['dtype']}")
print(f" Completeness: {100 - profile['null_percentage']:.1f}%")
print(f" Unique values: {profile['unique_count']} ({profile['unique_percentage']:.1f}%)")
if 'mean' in profile:
print(f" Range: {profile['min']:.2f} - {profile['max']:.2f}")
print(f" Mean: {profile['mean']:.2f}, Median: {profile['median']:.2f}")
elif 'most_common' in profile:
print(f" Avg length: {profile['avg_length']:.1f} chars")
print(f" Most common: {list(profile['most_common'].keys())[:2]}")
elif 'date_range_days' in profile:
print(f" Date range: {profile['date_range_days']} days")
# Save data profile
with open('data_profile.json', 'w') as f:
# Convert numpy types to Python types for JSON serialization
def convert_numpy_types(obj):
if isinstance(obj, np.integer):
return int(obj)
elif isinstance(obj, np.floating):
return float(obj)
elif isinstance(obj, np.ndarray):
return obj.tolist()
elif pd.isna(obj):
return None
return obj
import json
json.dump(data_profile, f, indent=2, default=str)
print("\nā
Data profile saved to data_profile.json")
Data Cleaning Best Practices
1. Document Everything
# Create a cleaning log
cleaning_log = [
{
'step': 1,
'action': 'Replace empty strings with NaN',
'columns_affected': ['customer_id', 'product_category'],
'rows_affected': 2,
'reason': 'Standardize missing value representation'
},
{
'step': 2,
'action': 'Remove duplicate records',
'columns_affected': ['customer_id', 'email'],
'rows_affected': 1,
'reason': 'Prevent double counting in analysis'
},
{
'step': 3,
'action': 'Standardize name format',
'columns_affected': ['name'],
'rows_affected': 10,
'reason': 'Ensure consistent capitalization'
}
# Add more steps as needed
]
# Save cleaning log
cleaning_df = pd.DataFrame(cleaning_log)
cleaning_df.to_csv('data_cleaning_log.csv', index=False)
print("ā
Cleaning log saved")
2. Create Reusable Cleaning Functions
class DataCleaner:
"""
Reusable data cleaning class
"""
def __init__(self):
self.cleaning_log = []
def log_action(self, action, columns_affected=None, rows_affected=0, reason=""):
"""Log cleaning action"""
self.cleaning_log.append({
'timestamp': datetime.now().isoformat(),
'action': action,
'columns_affected': columns_affected or [],
'rows_affected': rows_affected,
'reason': reason
})
def standardize_names(self, df, name_column):
"""Standardize name formatting"""
original_count = df[name_column].count()
df[name_column] = df[name_column].apply(
lambda x: str(x).strip().title() if pd.notna(x) else x
)
self.log_action(
'Standardize name format',
[name_column],
original_count,
'Ensure consistent capitalization'
)
return df
def clean_email_addresses(self, df, email_column):
"""Clean and validate email addresses"""
import re
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
def clean_email(email):
if pd.isna(email):
return np.nan
email = str(email).strip().lower()
return email if re.match(email_pattern, email) else np.nan
original_valid = df[email_column].notna().sum()
df[email_column] = df[email_column].apply(clean_email)
new_valid = df[email_column].notna().sum()
self.log_action(
'Clean email addresses',
[email_column],
original_valid - new_valid,
'Remove invalid email formats'
)
return df
def remove_outliers_iqr(self, df, column, k=1.5):
"""Remove outliers using IQR method"""
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - k * IQR
upper_bound = Q3 + k * IQR
outliers = (df[column] < lower_bound) | (df[column] > upper_bound)
outlier_count = outliers.sum()
df = df[~outliers].reset_index(drop=True)
self.log_action(
f'Remove outliers (IQR method)',
[column],
outlier_count,
f'Values outside {lower_bound:.2f} - {upper_bound:.2f} range'
)
return df
def get_cleaning_summary(self):
"""Get summary of cleaning actions"""
return pd.DataFrame(self.cleaning_log)
# Example usage
cleaner = DataCleaner()
df_example = df.copy()
# Apply cleaning functions
df_example = cleaner.standardize_names(df_example, 'name')
df_example = cleaner.clean_email_addresses(df_example, 'email')
# Get cleaning summary
cleaning_summary = cleaner.get_cleaning_summary()
print("Cleaning summary:")
print(cleaning_summary)
Key Takeaways
Essential Data Cleaning Steps:
- Assess data quality - Understand what you're working with
- Handle missing values - Choose appropriate strategies for each column
- Validate data types - Ensure columns have correct types
- Standardize formats - Make data consistent across records
- Remove duplicates - Prevent double counting and analysis errors
- Handle outliers - Decide whether to remove, cap, or flag extreme values
- Validate consistency - Ensure data makes logical sense
- Document everything - Keep track of all cleaning steps
Best Practices:
- Always work on a copy of the original data
- Document every cleaning decision and the reasoning behind it
- Validate results after each cleaning step
- Create reusable functions for common cleaning tasks
- Generate data quality reports to track improvements
- Save intermediate versions of cleaned data
- Test your cleaning pipeline on different datasets
Common Pitfalls to Avoid:
- Removing too much data without understanding the impact
- Making assumptions about data without validation
- Not documenting cleaning decisions
- Applying the same cleaning rules to different contexts
- Not preserving original data for reference
- Ignoring domain-specific validation rules
What's Next?
Now that you've mastered data cleaning, explore:
- Exploratory Data Analysis with Pandas
- Statistical Operations
- Data Visualization with Pandas
- Advanced Data Manipulation
Clean data is the foundation of reliable analysis. Invest time in thorough data cleaning - it will pay dividends in the quality and trustworthiness of your insights.
