Dev In The Mountain Header
A Developer In The mountains having fun

Reshaping Data: Pivot, Melt, and Stack Operations

Data doesn't always come in the format you need for analysis. Reshaping - transforming data between wide and long formats - is a critical skill for any data analyst. Pandas provides powerful tools for pivoting, melting, stacking, and unstacking DataFrames to get your data into the perfect shape for visualization and analysis.

Why Reshape Data?

Different analysis tasks require different data structures:

  • Wide format: Good for readability, spreadsheet-style analysis
  • Long format: Required for many plotting libraries, statistical models
  • Pivoted data: Essential for cross-tabulation and summary tables
  • Melted data: Needed for tidy data principles

Understanding Wide vs Long Format

import pandas as pd
import numpy as np

# Wide format: Each variable in its own column
wide_sales = pd.DataFrame({
    'store': ['Store A', 'Store B', 'Store C'],
    'Jan': [100, 150, 120],
    'Feb': [110, 160, 130],
    'Mar': [120, 170, 140]
})

print("Wide Format:")
print(wide_sales)

# Long format: Single column for values, additional column identifies variable
long_sales = pd.DataFrame({
    'store': ['Store A', 'Store A', 'Store A', 
              'Store B', 'Store B', 'Store B',
              'Store C', 'Store C', 'Store C'],
    'month': ['Jan', 'Feb', 'Mar'] * 3,
    'sales': [100, 110, 120, 150, 160, 170, 120, 130, 140]
})

print("\nLong Format:")
print(long_sales)

Melt: Wide to Long

The melt() function transforms wide data into long format.

Basic Melt

# Melt wide format to long format
melted = pd.melt(wide_sales, 
                 id_vars=['store'],
                 value_vars=['Jan', 'Feb', 'Mar'],
                 var_name='month',
                 value_name='sales')

print("Melted Data:")
print(melted)

# Melt with automatic column detection
melted_auto = wide_sales.melt(id_vars=['store'], 
                              var_name='month', 
                              value_name='sales')
print("\nMelted (automatic):")
print(melted_auto)

Advanced Melt Examples

# Multiple ID variables
employee_data = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'department': ['Sales', 'Engineering', 'Sales'],
    'Q1_sales': [100, 0, 120],
    'Q2_sales': [110, 0, 130],
    'Q1_expenses': [20, 50, 25],
    'Q2_expenses': [22, 55, 27]
})

print("Original Employee Data:")
print(employee_data)

# Melt sales and expenses separately
melted_employees = employee_data.melt(
    id_vars=['name', 'department'],
    var_name='metric_quarter',
    value_name='amount'
)

print("\nMelted Employee Data:")
print(melted_employees.head(10))

# Split the metric_quarter column
melted_employees[['quarter', 'metric']] = melted_employees['metric_quarter'].str.split('_', expand=True)
melted_employees = melted_employees.drop('metric_quarter', axis=1)

print("\nCleaned Melted Data:")
print(melted_employees.head(10))

Pivot: Long to Wide

The pivot() function transforms long data into wide format.

Basic Pivot

# Pivot long format back to wide
pivoted = long_sales.pivot(index='store', 
                          columns='month', 
                          values='sales')

print("Pivoted Data:")
print(pivoted)

# Reset index to get store as a column
pivoted_reset = pivoted.reset_index()
print("\nPivoted with Reset Index:")
print(pivoted_reset)

Pivot with Multiple Values

# Data with multiple value columns
sales_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=6, freq='D'),
    'product': ['A', 'B'] * 3,
    'sales': [100, 150, 110, 160, 120, 170],
    'profit': [20, 30, 22, 32, 24, 34]
})

print("Sales Data:")
print(sales_data)

# Pivot with multiple value columns requires pivot_table
pivoted_multi = sales_data.pivot_table(
    index='date',
    columns='product',
    values=['sales', 'profit']
)

print("\nPivoted Multiple Values:")
print(pivoted_multi)

Pivot Table: Aggregating While Pivoting

The pivot_table() function is more flexible and handles duplicate entries through aggregation.

Basic Pivot Table

# Create data with duplicate combinations
transactions = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=20, freq='D'),
    'store': np.random.choice(['Store A', 'Store B', 'Store C'], 20),
    'product': np.random.choice(['Product X', 'Product Y'], 20),
    'sales': np.random.randint(50, 200, 20)
})

print("Transaction Data (first 10 rows):")
print(transactions.head(10))

# Create pivot table with aggregation
pivot_agg = pd.pivot_table(
    transactions,
    values='sales',
    index='store',
    columns='product',
    aggfunc='sum'  # Default is 'mean'
)

print("\nPivot Table (sum of sales):")
print(pivot_agg)

Advanced Pivot Table Features

# Multiple aggregation functions
pivot_multi_agg = pd.pivot_table(
    transactions,
    values='sales',
    index='store',
    columns='product',
    aggfunc=['sum', 'mean', 'count']
)

print("Pivot Table with Multiple Aggregations:")
print(pivot_multi_agg)

# Multiple values and functions
pivot_complex = pd.pivot_table(
    transactions,
    values='sales',
    index='store',
    columns='product',
    aggfunc={'sales': ['sum', 'mean']},
    fill_value=0,  # Fill missing values
    margins=True,  # Add row/column totals
    margins_name='Total'
)

print("\nPivot Table with Margins:")
print(pivot_complex)

Pivot Table with Multiple Indexes

# Add more dimensions
transactions['month'] = transactions['date'].dt.month
transactions['day_of_week'] = transactions['date'].dt.day_name()

# Multi-level index pivot
pivot_multi_index = pd.pivot_table(
    transactions,
    values='sales',
    index=['store', 'month'],
    columns='product',
    aggfunc='sum',
    fill_value=0
)

print("Multi-level Index Pivot:")
print(pivot_multi_index)

# Access specific levels
print("\nStore A data:")
print(pivot_multi_index.loc['Store A'])

Stata Commands Comparison

For Stata users:

# Stata: reshape wide sales, i(store) j(month) string
wide_sales.melt(id_vars=['store'], var_name='month', value_name='sales')

# Stata: reshape long
long_sales.pivot(index='store', columns='month', values='sales')

# Stata: collapse (sum) sales, by(store product)
pd.pivot_table(df, values='sales', index='store', columns='product', aggfunc='sum')

# Stata: table store product, c(mean sales)
pd.pivot_table(df, values='sales', index='store', columns='product', aggfunc='mean')

# Stata: tabulate store product
pd.crosstab(df['store'], df['product'])

Stack and Unstack

Stack and unstack operations work with DataFrame indexes to reshape data.

Unstack: Convert Row Index to Columns

# Create multi-index DataFrame
data = {
    ('Sales', 'Q1'): [100, 150, 120],
    ('Sales', 'Q2'): [110, 160, 130],
    ('Profit', 'Q1'): [20, 30, 24],
    ('Profit', 'Q2'): [22, 32, 26]
}

df_multi = pd.DataFrame(data, 
                        index=['Store A', 'Store B', 'Store C'])

print("Multi-level Column DataFrame:")
print(df_multi)

# Unstack the index (pivot innermost level)
unstacked = long_sales.set_index(['store', 'month']).unstack()
print("\nUnstacked:")
print(unstacked)

# Unstack specific level
unstacked_level = long_sales.set_index(['store', 'month']).unstack(level='month')
print("\nUnstacked by Month:")
print(unstacked_level)

Stack: Convert Columns to Row Index

# Stack columns into index
stacked = unstacked.stack()
print("Stacked:")
print(stacked)

# Stack with multi-level columns
stacked_multi = df_multi.stack()
print("\nStacked Multi-level:")
print(stacked_multi)

# Remove level names
stacked_clean = stacked.reset_index()
print("\nStacked with Reset Index:")
print(stacked_clean)

Cross-Tabulation

Create frequency tables showing relationships between variables.

Basic Crosstab

# Create survey data
survey = pd.DataFrame({
    'age_group': np.random.choice(['18-25', '26-35', '36-50', '50+'], 200),
    'product_preference': np.random.choice(['Product A', 'Product B', 'Product C'], 200),
    'satisfaction': np.random.choice(['Low', 'Medium', 'High'], 200)
})

# Basic crosstab
crosstab = pd.crosstab(survey['age_group'], survey['product_preference'])
print("Product Preference by Age Group:")
print(crosstab)

# Crosstab with percentages
crosstab_pct = pd.crosstab(
    survey['age_group'], 
    survey['product_preference'],
    normalize='index'  # or 'columns', 'all'
) * 100

print("\nProduct Preference by Age (%):")
print(crosstab_pct.round(1))

Advanced Crosstab

# Crosstab with multiple variables
crosstab_multi = pd.crosstab(
    [survey['age_group'], survey['satisfaction']],
    survey['product_preference'],
    margins=True,
    margins_name='Total'
)

print("Multi-variable Crosstab:")
print(crosstab_multi)

# Crosstab with aggregation
crosstab_agg = pd.crosstab(
    survey['age_group'],
    survey['product_preference'],
    values=np.random.randint(1, 100, len(survey)),
    aggfunc='mean'
)

print("\nCrosstab with Aggregation:")
print(crosstab_agg.round(1))

Practical Examples

Example 1: Sales Analysis Transformation

# Raw sales data
raw_sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=90, freq='D'),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 90),
    'product': np.random.choice(['A', 'B', 'C'], 90),
    'sales': np.random.randint(100, 1000, 90)
})

# Add month column
raw_sales['month'] = raw_sales['date'].dt.month

print("Raw Sales Data (first 5 rows):")
print(raw_sales.head())

# Pivot to see sales by region and product
sales_matrix = pd.pivot_table(
    raw_sales,
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0
)

print("\nSales Matrix (Region x Product):")
print(sales_matrix)

# Calculate totals
sales_matrix['Total'] = sales_matrix.sum(axis=1)
sales_matrix.loc['Total'] = sales_matrix.sum(axis=0)

print("\nWith Totals:")
print(sales_matrix)

# Monthly trends
monthly_trends = pd.pivot_table(
    raw_sales,
    values='sales',
    index='month',
    columns='product',
    aggfunc='sum'
)

print("\nMonthly Sales Trends:")
print(monthly_trends)

Example 2: Student Grades Reshaping

# Student grades in wide format
grades_wide = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Math': [85, 92, 78, 88],
    'English': [90, 85, 82, 92],
    'Science': [88, 90, 85, 86],
    'History': [82, 88, 90, 84]
})

print("Grades (Wide Format):")
print(grades_wide)

# Melt to long format
grades_long = grades_wide.melt(
    id_vars=['student'],
    var_name='subject',
    value_name='grade'
)

print("\nGrades (Long Format):")
print(grades_long)

# Calculate statistics per subject
subject_stats = grades_long.groupby('subject')['grade'].agg([
    'mean', 'median', 'std', 'min', 'max'
]).round(2)

print("\nSubject Statistics:")
print(subject_stats)

# Pivot to see student rankings
grades_pivot = grades_long.pivot(
    index='student',
    columns='subject',
    values='grade'
)

grades_pivot['Average'] = grades_pivot.mean(axis=1).round(2)
grades_pivot['Rank'] = grades_pivot['Average'].rank(ascending=False)

print("\nStudent Rankings:")
print(grades_pivot.sort_values('Rank'))

Example 3: Time Series Reshaping

# Sensor readings from multiple sensors
sensors = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=100, freq='H'),
    'sensor_id': np.random.choice(['S1', 'S2', 'S3'], 100),
    'temperature': np.random.normal(70, 5, 100),
    'humidity': np.random.normal(60, 10, 100)
})

print("Sensor Data (first 10 rows):")
print(sensors.head(10))

# Pivot to compare sensors side-by-side
temp_comparison = pd.pivot_table(
    sensors,
    values='temperature',
    index='timestamp',
    columns='sensor_id',
    aggfunc='mean'
)

print("\nTemperature Comparison:")
print(temp_comparison.head(10))

# Calculate correlations between sensors
correlations = temp_comparison.corr()
print("\nSensor Correlations:")
print(correlations)

# Melt back for plotting
sensors_melted = temp_comparison.reset_index().melt(
    id_vars=['timestamp'],
    var_name='sensor',
    value_name='temperature'
)

print("\nMelted for Plotting (first 10 rows):")
print(sensors_melted.head(10))

Example 4: Survey Response Analysis

# Survey responses
responses = pd.DataFrame({
    'respondent_id': range(1, 101),
    'age': np.random.randint(18, 70, 100),
    'Q1_rating': np.random.randint(1, 6, 100),
    'Q2_rating': np.random.randint(1, 6, 100),
    'Q3_rating': np.random.randint(1, 6, 100),
    'Q4_rating': np.random.randint(1, 6, 100)
})

# Create age groups
responses['age_group'] = pd.cut(
    responses['age'],
    bins=[0, 25, 35, 50, 100],
    labels=['18-25', '26-35', '36-50', '50+']
)

print("Survey Responses (first 5):")
print(responses.head())

# Melt rating columns
responses_long = responses.melt(
    id_vars=['respondent_id', 'age_group'],
    value_vars=['Q1_rating', 'Q2_rating', 'Q3_rating', 'Q4_rating'],
    var_name='question',
    value_name='rating'
)

# Average ratings by age group and question
age_question_avg = pd.pivot_table(
    responses_long,
    values='rating',
    index='age_group',
    columns='question',
    aggfunc='mean'
)

print("\nAverage Ratings by Age Group:")
print(age_question_avg.round(2))

# Response distribution
response_dist = pd.crosstab(
    responses_long['question'],
    responses_long['rating'],
    normalize='index'
) * 100

print("\nResponse Distribution (%):")
print(response_dist.round(1))

Common Patterns

From Wide to Long and Back

# Start with wide format
original = wide_sales.copy()
print("Original Wide Format:")
print(original)

# Convert to long
long = original.melt(id_vars=['store'], var_name='month', value_name='sales')
print("\nConvert to Long:")
print(long)

# Convert back to wide
back_to_wide = long.pivot(index='store', columns='month', values='sales')
print("\nBack to Wide:")
print(back_to_wide)

# Verify they're the same (except column order might differ)
print("\nAre they equal?", original.set_index('store').equals(back_to_wide))

Multiple Value Columns

# Data with multiple metrics
metrics = pd.DataFrame({
    'store': ['A', 'B', 'C'] * 3,
    'month': ['Jan'] * 3 + ['Feb'] * 3 + ['Mar'] * 3,
    'sales': [100, 150, 120, 110, 160, 130, 120, 170, 140],
    'customers': [50, 75, 60, 55, 80, 65, 60, 85, 70]
})

# Pivot both metrics
pivot_both = metrics.pivot(
    index='store',
    columns='month',
    values=['sales', 'customers']
)

print("Pivoted Multiple Metrics:")
print(pivot_both)

# Flatten multi-level columns
pivot_both.columns = ['_'.join(col) for col in pivot_both.columns]
pivot_both = pivot_both.reset_index()

print("\nFlattened Columns:")
print(pivot_both)

Performance Tips

# For large datasets, consider:

# 1. Use categorical data types for repeated values
df['category_col'] = df['category_col'].astype('category')

# 2. Specify dtypes when melting
melted = df.melt(id_vars=['id'], var_name='variable', value_name='value')

# 3. Use pivot instead of pivot_table when you don't need aggregation
# (pivot is faster for unique index-column combinations)

# 4. Pre-sort data before pivoting if possible
df_sorted = df.sort_values(['index_col', 'column_col'])
pivoted = df_sorted.pivot(index='index_col', columns='column_col', values='value_col')

Practice Exercises

Exercise 1: Reshape Product Data

Given monthly product sales in wide format:

  • Melt to long format
  • Pivot to show products as rows, months as columns
  • Calculate quarterly totals

Exercise 2: Student Attendance

With daily attendance data:

  • Pivot to show students as rows, dates as columns
  • Calculate attendance percentage per student
  • Identify perfect attendance

Exercise 3: Multi-Region Analysis

Given sales by region and product:

  • Create pivot table showing total sales
  • Add percentage of total for each cell
  • Find top-performing region-product combinations

Sample Solutions

Exercise 1:

# Assuming wide_format has columns: product, Jan, Feb, Mar, ...

# Melt to long
long = wide_format.melt(id_vars=['product'], var_name='month', value_name='sales')

# Pivot back
pivoted = long.pivot(index='product', columns='month', values='sales')

# Calculate quarterly totals
pivoted['Q1'] = pivoted[['Jan', 'Feb', 'Mar']].sum(axis=1)
pivoted['Q2'] = pivoted[['Apr', 'May', 'Jun']].sum(axis=1)

print("Quarterly Sales:")
print(pivoted[['Q1', 'Q2']])

Key Takeaways

  • Use melt() to convert wide format to long format (unpivot)
  • Use pivot() to convert long format to wide format (requires unique index-column combinations)
  • Use pivot_table() for aggregation while pivoting (handles duplicates)
  • Use stack() and unstack() for index-based reshaping
  • Use crosstab() for frequency tables
  • Choose format based on analysis needs: long for plotting/modeling, wide for readability
  • Wide format is better for human reading, long format for machine processing
  • Remember: most plotting libraries prefer long format data

What's Next?

Now that you can reshape data effectively:

Resources


Reshaping is a superpower in data analysis. Master these techniques to transform any dataset into the format you need.

More places to find me
Mental Health
follow me on Mastodon