Dev In The Mountain Header
A Developer In The mountains having fun

Data Cleaning and Transformation with Pandas

Data cleaning is one of the most important and time-consuming tasks in data analysis. Real-world data is messy, incomplete, and often inconsistent. Pandas provides powerful tools to clean, transform, and prepare your data for analysis. This comprehensive guide covers essential data cleaning techniques that every data analyst should master.

Why Data Cleaning Matters

  • Quality Analysis: Clean data leads to reliable insights and conclusions
  • Accurate Results: Removes errors that could skew statistical analysis
  • Efficiency: Well-structured data makes analysis faster and easier
  • Reproducibility: Documented cleaning steps make analysis reproducible
  • Machine Learning: Clean data is essential for effective model training

According to data scientists, 80% of their time is spent cleaning and preparing data, while only 20% is spent on actual analysis.

Common Data Quality Issues

Let's start by creating a messy dataset that demonstrates common problems:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create a messy dataset
messy_data = {
    'customer_id': [1001, 1002, 1003, 1003, 1004, 1005, '', 1007, 1008, 1009],
    'name': ['Alice Johnson', 'bob smith', 'CHARLIE BROWN', 'Charlie Brown', 
             'diana ross', None, 'Eve Wilson', 'Frank Miller', 'grace kelly', 'Henry Ford'],
    'age': [25, 35, 'forty-two', 42, 28, -5, 150, 30, 25, 65],
    'email': ['alice@email.com', 'BOB@EMAIL.COM', 'charlie@gmail.com', 'charlie@gmail.com',
              'diana@', 'eve@email.com', 'invalid-email', 'frank@email.com', 
              'grace@email.com', 'henry@email.com'],
    'phone': ['555-1234', '(555) 567-8900', '555.111.2222', '5551112222', 
              '555-333-4444', '', '+1-555-555-5555', '555 666 7777', 
              '5558889999', '555-000-0000'],
    'purchase_amount': [150.50, 250.75, None, 300.00, 75.25, 0, 999999, 
                       125.00, 200.50, 50.00],
    'purchase_date': ['2023-01-15', '2023/02/20', '2023-03-25', '2023-03-25',
                     'invalid_date', '2023-05-10', '2023-06-15', '2023-07-20',
                     '2023-08-25', '2023-09-30'],
    'product_category': ['Electronics', 'electronics', 'CLOTHING', 'Clothing',
                        'Books', 'Electronics', '', 'books', 'Electronics', 'Clothing'],
    'rating': [5, 4, 3, 3, 5, 0, 10, 4, 5, 3]
}

df = pd.DataFrame(messy_data)
print("Original messy dataset:")
print(df)
print(f"\nDataset shape: {df.shape}")
print(f"Data types:\n{df.dtypes}")

This dataset contains multiple issues:

  • Duplicate records
  • Missing values (None, empty strings)
  • Inconsistent formatting (names, emails)
  • Invalid data (negative age, future dates)
  • Inconsistent categories
  • Data type issues

Step 1: Initial Data Assessment

def assess_data_quality(df):
    """
    Comprehensive data quality assessment
    """
    print("=== DATA QUALITY ASSESSMENT ===")
    
    # Basic information
    print(f"Dataset shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
    
    # Missing values analysis
    print("\n--- Missing Values ---")
    missing_counts = df.isnull().sum()
    missing_percentages = (missing_counts / len(df)) * 100
    missing_summary = pd.DataFrame({
        'Missing Count': missing_counts,
        'Missing %': missing_percentages
    })
    print(missing_summary[missing_summary['Missing Count'] > 0])
    
    # Data types
    print("\n--- Data Types ---")
    print(df.dtypes)
    
    # Unique values per column
    print("\n--- Unique Values ---")
    for col in df.columns:
        unique_count = df[col].nunique()
        total_count = len(df)
        print(f"{col}: {unique_count}/{total_count} unique values ({unique_count/total_count*100:.1f}%)")
    
    # Potential duplicates
    print(f"\n--- Duplicates ---")
    duplicate_rows = df.duplicated().sum()
    print(f"Duplicate rows: {duplicate_rows}")
    
    # Value ranges for numeric columns
    print("\n--- Numeric Ranges ---")
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        print(f"{col}: min={df[col].min()}, max={df[col].max()}")
    
    return missing_summary

# Assess our messy data
quality_report = assess_data_quality(df)

Step 2: Handling Missing Values

print("=== HANDLING MISSING VALUES ===")

# Identify different types of missing values
def identify_missing_patterns(df):
    """
    Identify different patterns of missing data
    """
    print("--- Missing Value Patterns ---")
    
    # Standard missing values (NaN, None)
    standard_missing = df.isnull().sum()
    print("Standard missing values:")
    print(standard_missing[standard_missing > 0])
    
    # Empty strings
    empty_strings = df.eq('').sum()
    print("\nEmpty string values:")
    print(empty_strings[empty_strings > 0])
    
    # Whitespace-only strings
    whitespace_only = df.apply(lambda x: x.astype(str).str.strip().eq('').sum() if x.dtype == 'object' else 0)
    print("\nWhitespace-only values:")
    print(whitespace_only[whitespace_only > 0])

identify_missing_patterns(df)

# Clean the dataset step by step
df_clean = df.copy()

# Replace empty strings with NaN for consistent handling
df_clean = df_clean.replace('', np.nan)
print(f"\nAfter replacing empty strings with NaN:")
print(f"Missing values:\n{df_clean.isnull().sum()}")

# Handle missing values by column
print("\n--- Handling Missing Values by Column ---")

# Customer ID: Drop rows with missing customer_id (required field)
print(f"Rows before dropping missing customer_id: {len(df_clean)}")
df_clean = df_clean.dropna(subset=['customer_id'])
print(f"Rows after: {len(df_clean)}")

# Name: Fill missing names with placeholder
df_clean['name'] = df_clean['name'].fillna('Unknown Customer')

# Age: Fill with median age (more robust than mean)
median_age = pd.to_numeric(df_clean['age'], errors='coerce').median()
print(f"Median age for filling missing values: {median_age}")

# Purchase amount: Fill with 0 (assuming no purchase)
df_clean['purchase_amount'] = df_clean['purchase_amount'].fillna(0)

# Product category: Fill with 'Uncategorized'
df_clean['product_category'] = df_clean['product_category'].fillna('Uncategorized')

print("\nAfter handling missing values:")
print(df_clean.isnull().sum())

Step 3: Data Type Conversion and Validation

print("=== DATA TYPE CONVERSION AND VALIDATION ===")

# Convert customer_id to proper integer
df_clean['customer_id'] = pd.to_numeric(df_clean['customer_id'], errors='coerce').astype('Int64')

# Handle age column with mixed data types
def clean_age_column(age_series):
    """
    Clean age column with mixed data types
    """
    def convert_age(age_val):
        if pd.isna(age_val):
            return np.nan
        
        age_str = str(age_val).strip().lower()
        
        # Handle text representations
        text_to_num = {
            'forty-two': 42, 'forty two': 42,
            'thirty': 30, 'twenty': 20,
            # Add more as needed
        }
        
        if age_str in text_to_num:
            return text_to_num[age_str]
        
        # Try to extract numbers from string
        import re
        numbers = re.findall(r'\d+', age_str)
        if numbers:
            return int(numbers[0])
        
        return np.nan
    
    return age_series.apply(convert_age)

# Apply age cleaning
df_clean['age'] = clean_age_column(df_clean['age'])
print("Age column after cleaning:")
print(df_clean['age'].value_counts().sort_index())

# Validate age ranges
def validate_age(age):
    """Validate age is within reasonable range"""
    if pd.isna(age):
        return np.nan
    age = float(age)
    if 0 <= age <= 120:
        return age
    else:
        return np.nan  # Invalid ages become missing

df_clean['age'] = df_clean['age'].apply(validate_age)

# Fill remaining missing ages with median
age_median = df_clean['age'].median()
df_clean['age'] = df_clean['age'].fillna(age_median)

print(f"Age statistics after validation:")
print(df_clean['age'].describe())

# Convert purchase_amount to float
df_clean['purchase_amount'] = pd.to_numeric(df_clean['purchase_amount'], errors='coerce')

# Validate purchase amounts (remove obvious outliers)
def validate_purchase_amount(amount):
    """Validate purchase amount is reasonable"""
    if pd.isna(amount) or amount < 0:
        return 0
    elif amount > 10000:  # Assuming $10k is our upper limit
        return np.nan  # Flag for review
    else:
        return amount

df_clean['purchase_amount'] = df_clean['purchase_amount'].apply(validate_purchase_amount)

print("Purchase amount after validation:")
print(df_clean['purchase_amount'].describe())

Step 4: String Data Cleaning

print("=== STRING DATA CLEANING ===")

# Clean name column
def clean_name(name):
    """
    Standardize name formatting
    """
    if pd.isna(name) or name == 'Unknown Customer':
        return name
    
    # Convert to title case and strip whitespace
    clean_name = str(name).strip().title()
    
    # Handle multiple spaces
    import re
    clean_name = re.sub(r'\s+', ' ', clean_name)
    
    return clean_name

df_clean['name'] = df_clean['name'].apply(clean_name)
print("Names after cleaning:")
print(df_clean['name'].tolist())

# Clean and validate email addresses
def clean_email(email):
    """
    Clean and validate email addresses
    """
    if pd.isna(email):
        return np.nan
    
    email = str(email).strip().lower()
    
    # Basic email validation pattern
    import re
    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    
    if re.match(email_pattern, email):
        return email
    else:
        return np.nan  # Invalid emails become missing

df_clean['email'] = df_clean['email'].apply(clean_email)
print("\nValid emails after cleaning:")
print(df_clean[df_clean['email'].notna()]['email'].tolist())

# Clean phone numbers
def clean_phone(phone):
    """
    Standardize phone number format
    """
    if pd.isna(phone):
        return np.nan
    
    # Remove all non-digit characters
    import re
    digits_only = re.sub(r'\D', '', str(phone))
    
    # Handle US phone numbers (10 digits, optionally with country code)
    if len(digits_only) == 10:
        return f"({digits_only[:3]}) {digits_only[3:6]}-{digits_only[6:]}"
    elif len(digits_only) == 11 and digits_only.startswith('1'):
        digits_only = digits_only[1:]  # Remove country code
        return f"({digits_only[:3]}) {digits_only[3:6]}-{digits_only[6:]}"
    else:
        return np.nan  # Invalid phone numbers

df_clean['phone'] = df_clean['phone'].apply(clean_phone)
print("\nPhone numbers after cleaning:")
print(df_clean[df_clean['phone'].notna()]['phone'].tolist())

# Clean product categories
def standardize_category(category):
    """
    Standardize product category names
    """
    if pd.isna(category):
        return 'Uncategorized'
    
    category = str(category).strip().lower()
    
    # Standardization mapping
    category_mapping = {
        'electronics': 'Electronics',
        'clothing': 'Clothing',
        'books': 'Books',
        'book': 'Books',
        'electronic': 'Electronics',
        'cloth': 'Clothing',
        'uncategorized': 'Uncategorized'
    }
    
    return category_mapping.get(category, category.title())

df_clean['product_category'] = df_clean['product_category'].apply(standardize_category)
print("\nProduct categories after standardization:")
print(df_clean['product_category'].value_counts())

Step 5: Date and Time Cleaning

print("=== DATE AND TIME CLEANING ===")

def clean_date_column(date_series):
    """
    Clean and standardize date column
    """
    def parse_date(date_val):
        if pd.isna(date_val):
            return pd.NaT
        
        date_str = str(date_val).strip()
        
        # Try different date formats
        date_formats = [
            '%Y-%m-%d',
            '%Y/%m/%d',
            '%m/%d/%Y',
            '%m-%d-%Y',
            '%d/%m/%Y',
            '%d-%m-%Y'
        ]
        
        for fmt in date_formats:
            try:
                return pd.to_datetime(date_str, format=fmt)
            except ValueError:
                continue
        
        # If none of the formats work, return NaT
        return pd.NaT
    
    return date_series.apply(parse_date)

# Clean purchase dates
df_clean['purchase_date'] = clean_date_column(df_clean['purchase_date'])
print("Purchase dates after cleaning:")
print(df_clean[df_clean['purchase_date'].notna()]['purchase_date'])

# Validate date ranges (shouldn't be in the future)
today = pd.Timestamp.now().normalize()
future_dates = df_clean['purchase_date'] > today

if future_dates.any():
    print(f"\nFound {future_dates.sum()} future dates - setting to NaT")
    df_clean.loc[future_dates, 'purchase_date'] = pd.NaT

# Add derived date features
df_clean['purchase_year'] = df_clean['purchase_date'].dt.year
df_clean['purchase_month'] = df_clean['purchase_date'].dt.month
df_clean['days_since_purchase'] = (today - df_clean['purchase_date']).dt.days

print("\nDate-derived features:")
print(df_clean[['purchase_date', 'purchase_year', 'purchase_month', 'days_since_purchase']].head())

Step 6: Handling Duplicates

print("=== HANDLING DUPLICATES ===")

# Identify duplicates
print("Checking for duplicates...")
print(f"Total rows: {len(df_clean)}")
print(f"Duplicate rows (all columns): {df_clean.duplicated().sum()}")

# Check duplicates based on key columns
key_columns = ['customer_id', 'email']
duplicate_customers = df_clean.duplicated(subset=key_columns, keep=False)
print(f"Duplicate customers (based on ID/email): {duplicate_customers.sum()}")

if duplicate_customers.any():
    print("\nDuplicate customer records:")
    duplicates_df = df_clean[duplicate_customers].sort_values(['customer_id', 'email'])
    print(duplicates_df[['customer_id', 'name', 'email', 'purchase_date', 'purchase_amount']])
    
    # Strategy: Keep the most recent purchase for each customer
    df_clean = df_clean.sort_values('purchase_date', ascending=False)
    df_clean = df_clean.drop_duplicates(subset=key_columns, keep='first')
    print(f"\nAfter removing duplicates: {len(df_clean)} rows")

# Reset index after dropping duplicates
df_clean = df_clean.reset_index(drop=True)

Step 7: Outlier Detection and Treatment

print("=== OUTLIER DETECTION AND TREATMENT ===")

def detect_outliers_iqr(series, k=1.5):
    """
    Detect outliers using Interquartile Range (IQR) method
    """
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - k * IQR
    upper_bound = Q3 + k * IQR
    
    outliers = (series < lower_bound) | (series > upper_bound)
    return outliers, lower_bound, upper_bound

# Check for outliers in numerical columns
numerical_columns = ['age', 'purchase_amount', 'rating']

for col in numerical_columns:
    if col in df_clean.columns:
        outliers, lower_bound, upper_bound = detect_outliers_iqr(df_clean[col])
        
        print(f"\n--- {col.upper()} Outliers ---")
        print(f"Valid range: {lower_bound:.2f} to {upper_bound:.2f}")
        print(f"Outliers found: {outliers.sum()}")
        
        if outliers.any():
            print("Outlier values:")
            print(df_clean[outliers][col].tolist())
            
            # Treatment strategies
            if col == 'rating':
                # Rating should be 1-5, cap at boundaries
                df_clean.loc[df_clean[col] < 1, col] = 1
                df_clean.loc[df_clean[col] > 5, col] = 5
                print("Capped rating values to 1-5 range")
            
            elif col == 'age':
                # For age, we already handled this in validation
                pass
                
            elif col == 'purchase_amount':
                # For purchase amount, cap extreme values or mark for review
                extreme_purchases = df_clean[col] > 5000
                if extreme_purchases.any():
                    print(f"Flagging {extreme_purchases.sum()} high-value purchases for review")
                    df_clean['high_value_flag'] = extreme_purchases

print("\nFinal outlier summary:")
for col in numerical_columns:
    if col in df_clean.columns:
        print(f"{col}: min={df_clean[col].min():.2f}, max={df_clean[col].max():.2f}")

Step 8: Data Consistency Checks

print("=== DATA CONSISTENCY CHECKS ===")

def run_consistency_checks(df):
    """
    Run various consistency checks on the cleaned data
    """
    issues = []
    
    # Check 1: Customer ID uniqueness
    if df['customer_id'].duplicated().any():
        issues.append("Duplicate customer IDs found")
    
    # Check 2: Email format consistency
    invalid_emails = df[df['email'].notna()]['email'].apply(
        lambda x: '@' not in str(x) or '.' not in str(x).split('@')[-1]
    ).sum()
    if invalid_emails > 0:
        issues.append(f"{invalid_emails} emails with invalid format")
    
    # Check 3: Age reasonableness
    unreasonable_ages = ((df['age'] < 0) | (df['age'] > 120)).sum()
    if unreasonable_ages > 0:
        issues.append(f"{unreasonable_ages} unreasonable age values")
    
    # Check 4: Purchase amount reasonableness
    negative_purchases = (df['purchase_amount'] < 0).sum()
    if negative_purchases > 0:
        issues.append(f"{negative_purchases} negative purchase amounts")
    
    # Check 5: Rating range
    invalid_ratings = ((df['rating'] < 1) | (df['rating'] > 5)).sum()
    if invalid_ratings > 0:
        issues.append(f"{invalid_ratings} ratings outside 1-5 range")
    
    # Check 6: Date consistency
    future_dates = (df['purchase_date'] > pd.Timestamp.now()).sum()
    if future_dates > 0:
        issues.append(f"{future_dates} future purchase dates")
    
    return issues

# Run consistency checks
consistency_issues = run_consistency_checks(df_clean)

print("Consistency check results:")
if consistency_issues:
    for issue in consistency_issues:
        print(f"āš ļø  {issue}")
else:
    print("āœ… All consistency checks passed!")

# Data quality score
def calculate_data_quality_score(df):
    """
    Calculate overall data quality score
    """
    total_cells = df.shape[0] * df.shape[1]
    missing_cells = df.isnull().sum().sum()
    
    completeness = (total_cells - missing_cells) / total_cells
    
    # Additional quality factors
    duplicate_rate = df.duplicated().sum() / len(df)
    
    # Simple quality score (0-100)
    quality_score = (completeness * 0.7 + (1 - duplicate_rate) * 0.3) * 100
    
    return quality_score, completeness, duplicate_rate

quality_score, completeness, duplicate_rate = calculate_data_quality_score(df_clean)
print(f"\n=== DATA QUALITY METRICS ===")
print(f"Overall Quality Score: {quality_score:.1f}/100")
print(f"Completeness: {completeness:.1%}")
print(f"Duplicate Rate: {duplicate_rate:.1%}")

Step 9: Final Data Validation and Export

print("=== FINAL DATA VALIDATION ===")

# Final dataset summary
def summarize_cleaned_data(df):
    """
    Provide comprehensive summary of cleaned data
    """
    print("--- CLEANED DATASET SUMMARY ---")
    print(f"Shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
    
    print("\n--- COLUMN SUMMARY ---")
    for col in df.columns:
        dtype = df[col].dtype
        non_null = df[col].count()
        null_pct = (df[col].isnull().sum() / len(df)) * 100
        unique = df[col].nunique()
        
        print(f"{col}: {dtype} | {non_null}/{len(df)} non-null ({100-null_pct:.1f}%) | {unique} unique")
    
    print("\n--- SAMPLE DATA ---")
    print(df.head())
    
    return df

# Summarize our cleaned data
df_final = summarize_cleaned_data(df_clean)

# Create data cleaning report
def create_cleaning_report(original_df, cleaned_df):
    """
    Create a comprehensive data cleaning report
    """
    report = {
        'original_shape': original_df.shape,
        'cleaned_shape': cleaned_df.shape,
        'rows_removed': original_df.shape[0] - cleaned_df.shape[0],
        'columns_added': cleaned_df.shape[1] - original_df.shape[1],
        'original_missing': original_df.isnull().sum().sum(),
        'cleaned_missing': cleaned_df.isnull().sum().sum(),
        'missing_reduction': original_df.isnull().sum().sum() - cleaned_df.isnull().sum().sum()
    }
    
    print("\n=== DATA CLEANING REPORT ===")
    print(f"Original dataset: {report['original_shape'][0]} rows Ɨ {report['original_shape'][1]} columns")
    print(f"Cleaned dataset: {report['cleaned_shape'][0]} rows Ɨ {report['cleaned_shape'][1]} columns")
    print(f"Rows removed: {report['rows_removed']}")
    print(f"Columns added: {report['columns_added']}")
    print(f"Missing values reduced: {report['missing_reduction']} ({report['missing_reduction']/report['original_missing']*100:.1f}%)")
    
    return report

# Generate cleaning report
cleaning_report = create_cleaning_report(df, df_clean)

# Save cleaned data
print("\n=== EXPORTING CLEANED DATA ===")

# Export to multiple formats
df_clean.to_csv('cleaned_customer_data.csv', index=False)
print("āœ… Exported to CSV")

df_clean.to_excel('cleaned_customer_data.xlsx', index=False)
print("āœ… Exported to Excel")

# Save cleaning metadata
metadata = {
    'cleaning_date': datetime.now().isoformat(),
    'original_rows': len(df),
    'cleaned_rows': len(df_clean),
    'quality_score': quality_score,
    'cleaning_steps': [
        'Replaced empty strings with NaN',
        'Converted data types',
        'Standardized name formatting',
        'Validated and cleaned email addresses',
        'Standardized phone number format',
        'Cleaned and validated dates',
        'Removed duplicates',
        'Handled outliers',
        'Added derived features'
    ]
}

import json
with open('cleaning_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)
print("āœ… Saved cleaning metadata")

Advanced Data Cleaning Techniques

Custom Validation Rules

def create_validation_rules():
    """
    Define custom validation rules for different data types
    """
    rules = {
        'customer_id': {
            'required': True,
            'type': 'integer',
            'min': 1000,
            'max': 999999
        },
        'email': {
            'required': False,
            'type': 'string',
            'pattern': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
        },
        'age': {
            'required': True,
            'type': 'integer',
            'min': 0,
            'max': 120
        },
        'purchase_amount': {
            'required': True,
            'type': 'float',
            'min': 0,
            'max': 10000
        },
        'rating': {
            'required': False,
            'type': 'integer',
            'min': 1,
            'max': 5
        }
    }
    return rules

def validate_against_rules(df, rules):
    """
    Validate DataFrame against custom rules
    """
    validation_results = {}
    
    for column, rule in rules.items():
        if column not in df.columns:
            validation_results[column] = {'error': 'Column not found'}
            continue
        
        col_results = {'passed': 0, 'failed': 0, 'issues': []}
        
        for idx, value in df[column].items():
            # Check if required
            if rule.get('required', False) and pd.isna(value):
                col_results['failed'] += 1
                col_results['issues'].append(f"Row {idx}: Required field is missing")
                continue
            
            # Skip validation for optional missing values
            if pd.isna(value) and not rule.get('required', False):
                col_results['passed'] += 1
                continue
            
            # Type and range validation
            try:
                if rule['type'] == 'integer':
                    val = int(float(value))
                    if val < rule.get('min', float('-inf')) or val > rule.get('max', float('inf')):
                        col_results['failed'] += 1
                        col_results['issues'].append(f"Row {idx}: Value {val} outside range [{rule.get('min')}, {rule.get('max')}]")
                    else:
                        col_results['passed'] += 1
                        
                elif rule['type'] == 'float':
                    val = float(value)
                    if val < rule.get('min', float('-inf')) or val > rule.get('max', float('inf')):
                        col_results['failed'] += 1
                        col_results['issues'].append(f"Row {idx}: Value {val} outside range [{rule.get('min')}, {rule.get('max')}]")
                    else:
                        col_results['passed'] += 1
                        
                elif rule['type'] == 'string':
                    if 'pattern' in rule:
                        import re
                        if not re.match(rule['pattern'], str(value)):
                            col_results['failed'] += 1
                            col_results['issues'].append(f"Row {idx}: Value doesn't match pattern")
                        else:
                            col_results['passed'] += 1
                    else:
                        col_results['passed'] += 1
                        
            except (ValueError, TypeError):
                col_results['failed'] += 1
                col_results['issues'].append(f"Row {idx}: Invalid data type")
        
        validation_results[column] = col_results
    
    return validation_results

# Apply validation rules
validation_rules = create_validation_rules()
validation_results = validate_against_rules(df_clean, validation_rules)

print("=== VALIDATION RESULTS ===")
for column, results in validation_results.items():
    if 'error' in results:
        print(f"{column}: {results['error']}")
    else:
        total = results['passed'] + results['failed']
        pass_rate = results['passed'] / total * 100 if total > 0 else 0
        print(f"{column}: {pass_rate:.1f}% pass rate ({results['passed']}/{total})")
        
        if results['issues']:
            print(f"  Issues: {len(results['issues'])}")
            for issue in results['issues'][:3]:  # Show first 3 issues
                print(f"    - {issue}")
            if len(results['issues']) > 3:
                print(f"    ... and {len(results['issues']) - 3} more")

Automated Data Profiling

def create_data_profile(df):
    """
    Create comprehensive data profile
    """
    profile = {
        'dataset_info': {
            'name': 'Customer Purchase Data',
            'rows': len(df),
            'columns': len(df.columns),
            'memory_usage_mb': df.memory_usage(deep=True).sum() / (1024**2)
        },
        'columns': {}
    }
    
    for col in df.columns:
        col_profile = {
            'dtype': str(df[col].dtype),
            'non_null_count': df[col].count(),
            'null_count': df[col].isnull().sum(),
            'null_percentage': (df[col].isnull().sum() / len(df)) * 100,
            'unique_count': df[col].nunique(),
            'unique_percentage': (df[col].nunique() / len(df)) * 100
        }
        
        # Add type-specific statistics
        if df[col].dtype in ['int64', 'float64', 'Int64']:
            col_profile.update({
                'min': df[col].min(),
                'max': df[col].max(),
                'mean': df[col].mean(),
                'median': df[col].median(),
                'std': df[col].std(),
                'q25': df[col].quantile(0.25),
                'q75': df[col].quantile(0.75)
            })
        elif df[col].dtype == 'object':
            col_profile.update({
                'avg_length': df[col].astype(str).str.len().mean(),
                'max_length': df[col].astype(str).str.len().max(),
                'most_common': df[col].value_counts().head(3).to_dict()
            })
        elif df[col].dtype.name.startswith('datetime'):
            col_profile.update({
                'min_date': df[col].min(),
                'max_date': df[col].max(),
                'date_range_days': (df[col].max() - df[col].min()).days if df[col].count() > 0 else 0
            })
        
        profile['columns'][col] = col_profile
    
    return profile

# Create and display data profile
data_profile = create_data_profile(df_clean)

print("=== DATA PROFILE SUMMARY ===")
print(f"Dataset: {data_profile['dataset_info']['name']}")
print(f"Dimensions: {data_profile['dataset_info']['rows']} rows Ɨ {data_profile['dataset_info']['columns']} columns")
print(f"Memory usage: {data_profile['dataset_info']['memory_usage_mb']:.2f} MB")

print("\n--- COLUMN PROFILES ---")
for col, profile in data_profile['columns'].items():
    print(f"\n{col.upper()}:")
    print(f"  Type: {profile['dtype']}")
    print(f"  Completeness: {100 - profile['null_percentage']:.1f}%")
    print(f"  Unique values: {profile['unique_count']} ({profile['unique_percentage']:.1f}%)")
    
    if 'mean' in profile:
        print(f"  Range: {profile['min']:.2f} - {profile['max']:.2f}")
        print(f"  Mean: {profile['mean']:.2f}, Median: {profile['median']:.2f}")
    elif 'most_common' in profile:
        print(f"  Avg length: {profile['avg_length']:.1f} chars")
        print(f"  Most common: {list(profile['most_common'].keys())[:2]}")
    elif 'date_range_days' in profile:
        print(f"  Date range: {profile['date_range_days']} days")

# Save data profile
with open('data_profile.json', 'w') as f:
    # Convert numpy types to Python types for JSON serialization
    def convert_numpy_types(obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        elif pd.isna(obj):
            return None
        return obj
    
    import json
    json.dump(data_profile, f, indent=2, default=str)

print("\nāœ… Data profile saved to data_profile.json")

Data Cleaning Best Practices

1. Document Everything

# Create a cleaning log
cleaning_log = [
    {
        'step': 1,
        'action': 'Replace empty strings with NaN',
        'columns_affected': ['customer_id', 'product_category'],
        'rows_affected': 2,
        'reason': 'Standardize missing value representation'
    },
    {
        'step': 2,
        'action': 'Remove duplicate records',
        'columns_affected': ['customer_id', 'email'],
        'rows_affected': 1,
        'reason': 'Prevent double counting in analysis'
    },
    {
        'step': 3,
        'action': 'Standardize name format',
        'columns_affected': ['name'],
        'rows_affected': 10,
        'reason': 'Ensure consistent capitalization'
    }
    # Add more steps as needed
]

# Save cleaning log
cleaning_df = pd.DataFrame(cleaning_log)
cleaning_df.to_csv('data_cleaning_log.csv', index=False)
print("āœ… Cleaning log saved")

2. Create Reusable Cleaning Functions

class DataCleaner:
    """
    Reusable data cleaning class
    """
    
    def __init__(self):
        self.cleaning_log = []
        
    def log_action(self, action, columns_affected=None, rows_affected=0, reason=""):
        """Log cleaning action"""
        self.cleaning_log.append({
            'timestamp': datetime.now().isoformat(),
            'action': action,
            'columns_affected': columns_affected or [],
            'rows_affected': rows_affected,
            'reason': reason
        })
    
    def standardize_names(self, df, name_column):
        """Standardize name formatting"""
        original_count = df[name_column].count()
        df[name_column] = df[name_column].apply(
            lambda x: str(x).strip().title() if pd.notna(x) else x
        )
        self.log_action(
            'Standardize name format',
            [name_column],
            original_count,
            'Ensure consistent capitalization'
        )
        return df
    
    def clean_email_addresses(self, df, email_column):
        """Clean and validate email addresses"""
        import re
        email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
        
        def clean_email(email):
            if pd.isna(email):
                return np.nan
            email = str(email).strip().lower()
            return email if re.match(email_pattern, email) else np.nan
        
        original_valid = df[email_column].notna().sum()
        df[email_column] = df[email_column].apply(clean_email)
        new_valid = df[email_column].notna().sum()
        
        self.log_action(
            'Clean email addresses',
            [email_column],
            original_valid - new_valid,
            'Remove invalid email formats'
        )
        return df
    
    def remove_outliers_iqr(self, df, column, k=1.5):
        """Remove outliers using IQR method"""
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        
        lower_bound = Q1 - k * IQR
        upper_bound = Q3 + k * IQR
        
        outliers = (df[column] < lower_bound) | (df[column] > upper_bound)
        outlier_count = outliers.sum()
        
        df = df[~outliers].reset_index(drop=True)
        
        self.log_action(
            f'Remove outliers (IQR method)',
            [column],
            outlier_count,
            f'Values outside {lower_bound:.2f} - {upper_bound:.2f} range'
        )
        return df
    
    def get_cleaning_summary(self):
        """Get summary of cleaning actions"""
        return pd.DataFrame(self.cleaning_log)

# Example usage
cleaner = DataCleaner()
df_example = df.copy()

# Apply cleaning functions
df_example = cleaner.standardize_names(df_example, 'name')
df_example = cleaner.clean_email_addresses(df_example, 'email')

# Get cleaning summary
cleaning_summary = cleaner.get_cleaning_summary()
print("Cleaning summary:")
print(cleaning_summary)

Key Takeaways

Essential Data Cleaning Steps:

  1. Assess data quality - Understand what you're working with
  2. Handle missing values - Choose appropriate strategies for each column
  3. Validate data types - Ensure columns have correct types
  4. Standardize formats - Make data consistent across records
  5. Remove duplicates - Prevent double counting and analysis errors
  6. Handle outliers - Decide whether to remove, cap, or flag extreme values
  7. Validate consistency - Ensure data makes logical sense
  8. Document everything - Keep track of all cleaning steps

Best Practices:

  • Always work on a copy of the original data
  • Document every cleaning decision and the reasoning behind it
  • Validate results after each cleaning step
  • Create reusable functions for common cleaning tasks
  • Generate data quality reports to track improvements
  • Save intermediate versions of cleaned data
  • Test your cleaning pipeline on different datasets

Common Pitfalls to Avoid:

  • Removing too much data without understanding the impact
  • Making assumptions about data without validation
  • Not documenting cleaning decisions
  • Applying the same cleaning rules to different contexts
  • Not preserving original data for reference
  • Ignoring domain-specific validation rules

What's Next?

Now that you've mastered data cleaning, explore:


Clean data is the foundation of reliable analysis. Invest time in thorough data cleaning - it will pay dividends in the quality and trustworthiness of your insights.

More places to find me
Mental Health
follow me on Mastodon