Reshaping Data: Pivot, Melt, and Stack Operations
Data doesn't always come in the format you need for analysis. Reshaping - transforming data between wide and long formats - is a critical skill for any data analyst. Pandas provides powerful tools for pivoting, melting, stacking, and unstacking DataFrames to get your data into the perfect shape for visualization and analysis.
Why Reshape Data?
Different analysis tasks require different data structures:
- Wide format: Good for readability, spreadsheet-style analysis
- Long format: Required for many plotting libraries, statistical models
- Pivoted data: Essential for cross-tabulation and summary tables
- Melted data: Needed for tidy data principles
Understanding Wide vs Long Format
import pandas as pd
import numpy as np
# Wide format: Each variable in its own column
wide_sales = pd.DataFrame({
'store': ['Store A', 'Store B', 'Store C'],
'Jan': [100, 150, 120],
'Feb': [110, 160, 130],
'Mar': [120, 170, 140]
})
print("Wide Format:")
print(wide_sales)
# Long format: Single column for values, additional column identifies variable
long_sales = pd.DataFrame({
'store': ['Store A', 'Store A', 'Store A',
'Store B', 'Store B', 'Store B',
'Store C', 'Store C', 'Store C'],
'month': ['Jan', 'Feb', 'Mar'] * 3,
'sales': [100, 110, 120, 150, 160, 170, 120, 130, 140]
})
print("\nLong Format:")
print(long_sales)
Melt: Wide to Long
The melt() function transforms wide data into long format.
Basic Melt
# Melt wide format to long format
melted = pd.melt(wide_sales,
id_vars=['store'],
value_vars=['Jan', 'Feb', 'Mar'],
var_name='month',
value_name='sales')
print("Melted Data:")
print(melted)
# Melt with automatic column detection
melted_auto = wide_sales.melt(id_vars=['store'],
var_name='month',
value_name='sales')
print("\nMelted (automatic):")
print(melted_auto)
Advanced Melt Examples
# Multiple ID variables
employee_data = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'department': ['Sales', 'Engineering', 'Sales'],
'Q1_sales': [100, 0, 120],
'Q2_sales': [110, 0, 130],
'Q1_expenses': [20, 50, 25],
'Q2_expenses': [22, 55, 27]
})
print("Original Employee Data:")
print(employee_data)
# Melt sales and expenses separately
melted_employees = employee_data.melt(
id_vars=['name', 'department'],
var_name='metric_quarter',
value_name='amount'
)
print("\nMelted Employee Data:")
print(melted_employees.head(10))
# Split the metric_quarter column
melted_employees[['quarter', 'metric']] = melted_employees['metric_quarter'].str.split('_', expand=True)
melted_employees = melted_employees.drop('metric_quarter', axis=1)
print("\nCleaned Melted Data:")
print(melted_employees.head(10))
Pivot: Long to Wide
The pivot() function transforms long data into wide format.
Basic Pivot
# Pivot long format back to wide
pivoted = long_sales.pivot(index='store',
columns='month',
values='sales')
print("Pivoted Data:")
print(pivoted)
# Reset index to get store as a column
pivoted_reset = pivoted.reset_index()
print("\nPivoted with Reset Index:")
print(pivoted_reset)
Pivot with Multiple Values
# Data with multiple value columns
sales_data = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=6, freq='D'),
'product': ['A', 'B'] * 3,
'sales': [100, 150, 110, 160, 120, 170],
'profit': [20, 30, 22, 32, 24, 34]
})
print("Sales Data:")
print(sales_data)
# Pivot with multiple value columns requires pivot_table
pivoted_multi = sales_data.pivot_table(
index='date',
columns='product',
values=['sales', 'profit']
)
print("\nPivoted Multiple Values:")
print(pivoted_multi)
Pivot Table: Aggregating While Pivoting
The pivot_table() function is more flexible and handles duplicate entries through aggregation.
Basic Pivot Table
# Create data with duplicate combinations
transactions = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=20, freq='D'),
'store': np.random.choice(['Store A', 'Store B', 'Store C'], 20),
'product': np.random.choice(['Product X', 'Product Y'], 20),
'sales': np.random.randint(50, 200, 20)
})
print("Transaction Data (first 10 rows):")
print(transactions.head(10))
# Create pivot table with aggregation
pivot_agg = pd.pivot_table(
transactions,
values='sales',
index='store',
columns='product',
aggfunc='sum' # Default is 'mean'
)
print("\nPivot Table (sum of sales):")
print(pivot_agg)
Advanced Pivot Table Features
# Multiple aggregation functions
pivot_multi_agg = pd.pivot_table(
transactions,
values='sales',
index='store',
columns='product',
aggfunc=['sum', 'mean', 'count']
)
print("Pivot Table with Multiple Aggregations:")
print(pivot_multi_agg)
# Multiple values and functions
pivot_complex = pd.pivot_table(
transactions,
values='sales',
index='store',
columns='product',
aggfunc={'sales': ['sum', 'mean']},
fill_value=0, # Fill missing values
margins=True, # Add row/column totals
margins_name='Total'
)
print("\nPivot Table with Margins:")
print(pivot_complex)
Pivot Table with Multiple Indexes
# Add more dimensions
transactions['month'] = transactions['date'].dt.month
transactions['day_of_week'] = transactions['date'].dt.day_name()
# Multi-level index pivot
pivot_multi_index = pd.pivot_table(
transactions,
values='sales',
index=['store', 'month'],
columns='product',
aggfunc='sum',
fill_value=0
)
print("Multi-level Index Pivot:")
print(pivot_multi_index)
# Access specific levels
print("\nStore A data:")
print(pivot_multi_index.loc['Store A'])
Stata Commands Comparison
For Stata users:
# Stata: reshape wide sales, i(store) j(month) string
wide_sales.melt(id_vars=['store'], var_name='month', value_name='sales')
# Stata: reshape long
long_sales.pivot(index='store', columns='month', values='sales')
# Stata: collapse (sum) sales, by(store product)
pd.pivot_table(df, values='sales', index='store', columns='product', aggfunc='sum')
# Stata: table store product, c(mean sales)
pd.pivot_table(df, values='sales', index='store', columns='product', aggfunc='mean')
# Stata: tabulate store product
pd.crosstab(df['store'], df['product'])
Stack and Unstack
Stack and unstack operations work with DataFrame indexes to reshape data.
Unstack: Convert Row Index to Columns
# Create multi-index DataFrame
data = {
('Sales', 'Q1'): [100, 150, 120],
('Sales', 'Q2'): [110, 160, 130],
('Profit', 'Q1'): [20, 30, 24],
('Profit', 'Q2'): [22, 32, 26]
}
df_multi = pd.DataFrame(data,
index=['Store A', 'Store B', 'Store C'])
print("Multi-level Column DataFrame:")
print(df_multi)
# Unstack the index (pivot innermost level)
unstacked = long_sales.set_index(['store', 'month']).unstack()
print("\nUnstacked:")
print(unstacked)
# Unstack specific level
unstacked_level = long_sales.set_index(['store', 'month']).unstack(level='month')
print("\nUnstacked by Month:")
print(unstacked_level)
Stack: Convert Columns to Row Index
# Stack columns into index
stacked = unstacked.stack()
print("Stacked:")
print(stacked)
# Stack with multi-level columns
stacked_multi = df_multi.stack()
print("\nStacked Multi-level:")
print(stacked_multi)
# Remove level names
stacked_clean = stacked.reset_index()
print("\nStacked with Reset Index:")
print(stacked_clean)
Cross-Tabulation
Create frequency tables showing relationships between variables.
Basic Crosstab
# Create survey data
survey = pd.DataFrame({
'age_group': np.random.choice(['18-25', '26-35', '36-50', '50+'], 200),
'product_preference': np.random.choice(['Product A', 'Product B', 'Product C'], 200),
'satisfaction': np.random.choice(['Low', 'Medium', 'High'], 200)
})
# Basic crosstab
crosstab = pd.crosstab(survey['age_group'], survey['product_preference'])
print("Product Preference by Age Group:")
print(crosstab)
# Crosstab with percentages
crosstab_pct = pd.crosstab(
survey['age_group'],
survey['product_preference'],
normalize='index' # or 'columns', 'all'
) * 100
print("\nProduct Preference by Age (%):")
print(crosstab_pct.round(1))
Advanced Crosstab
# Crosstab with multiple variables
crosstab_multi = pd.crosstab(
[survey['age_group'], survey['satisfaction']],
survey['product_preference'],
margins=True,
margins_name='Total'
)
print("Multi-variable Crosstab:")
print(crosstab_multi)
# Crosstab with aggregation
crosstab_agg = pd.crosstab(
survey['age_group'],
survey['product_preference'],
values=np.random.randint(1, 100, len(survey)),
aggfunc='mean'
)
print("\nCrosstab with Aggregation:")
print(crosstab_agg.round(1))
Practical Examples
Example 1: Sales Analysis Transformation
# Raw sales data
raw_sales = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=90, freq='D'),
'region': np.random.choice(['North', 'South', 'East', 'West'], 90),
'product': np.random.choice(['A', 'B', 'C'], 90),
'sales': np.random.randint(100, 1000, 90)
})
# Add month column
raw_sales['month'] = raw_sales['date'].dt.month
print("Raw Sales Data (first 5 rows):")
print(raw_sales.head())
# Pivot to see sales by region and product
sales_matrix = pd.pivot_table(
raw_sales,
values='sales',
index='region',
columns='product',
aggfunc='sum',
fill_value=0
)
print("\nSales Matrix (Region x Product):")
print(sales_matrix)
# Calculate totals
sales_matrix['Total'] = sales_matrix.sum(axis=1)
sales_matrix.loc['Total'] = sales_matrix.sum(axis=0)
print("\nWith Totals:")
print(sales_matrix)
# Monthly trends
monthly_trends = pd.pivot_table(
raw_sales,
values='sales',
index='month',
columns='product',
aggfunc='sum'
)
print("\nMonthly Sales Trends:")
print(monthly_trends)
Example 2: Student Grades Reshaping
# Student grades in wide format
grades_wide = pd.DataFrame({
'student': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Math': [85, 92, 78, 88],
'English': [90, 85, 82, 92],
'Science': [88, 90, 85, 86],
'History': [82, 88, 90, 84]
})
print("Grades (Wide Format):")
print(grades_wide)
# Melt to long format
grades_long = grades_wide.melt(
id_vars=['student'],
var_name='subject',
value_name='grade'
)
print("\nGrades (Long Format):")
print(grades_long)
# Calculate statistics per subject
subject_stats = grades_long.groupby('subject')['grade'].agg([
'mean', 'median', 'std', 'min', 'max'
]).round(2)
print("\nSubject Statistics:")
print(subject_stats)
# Pivot to see student rankings
grades_pivot = grades_long.pivot(
index='student',
columns='subject',
values='grade'
)
grades_pivot['Average'] = grades_pivot.mean(axis=1).round(2)
grades_pivot['Rank'] = grades_pivot['Average'].rank(ascending=False)
print("\nStudent Rankings:")
print(grades_pivot.sort_values('Rank'))
Example 3: Time Series Reshaping
# Sensor readings from multiple sensors
sensors = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=100, freq='H'),
'sensor_id': np.random.choice(['S1', 'S2', 'S3'], 100),
'temperature': np.random.normal(70, 5, 100),
'humidity': np.random.normal(60, 10, 100)
})
print("Sensor Data (first 10 rows):")
print(sensors.head(10))
# Pivot to compare sensors side-by-side
temp_comparison = pd.pivot_table(
sensors,
values='temperature',
index='timestamp',
columns='sensor_id',
aggfunc='mean'
)
print("\nTemperature Comparison:")
print(temp_comparison.head(10))
# Calculate correlations between sensors
correlations = temp_comparison.corr()
print("\nSensor Correlations:")
print(correlations)
# Melt back for plotting
sensors_melted = temp_comparison.reset_index().melt(
id_vars=['timestamp'],
var_name='sensor',
value_name='temperature'
)
print("\nMelted for Plotting (first 10 rows):")
print(sensors_melted.head(10))
Example 4: Survey Response Analysis
# Survey responses
responses = pd.DataFrame({
'respondent_id': range(1, 101),
'age': np.random.randint(18, 70, 100),
'Q1_rating': np.random.randint(1, 6, 100),
'Q2_rating': np.random.randint(1, 6, 100),
'Q3_rating': np.random.randint(1, 6, 100),
'Q4_rating': np.random.randint(1, 6, 100)
})
# Create age groups
responses['age_group'] = pd.cut(
responses['age'],
bins=[0, 25, 35, 50, 100],
labels=['18-25', '26-35', '36-50', '50+']
)
print("Survey Responses (first 5):")
print(responses.head())
# Melt rating columns
responses_long = responses.melt(
id_vars=['respondent_id', 'age_group'],
value_vars=['Q1_rating', 'Q2_rating', 'Q3_rating', 'Q4_rating'],
var_name='question',
value_name='rating'
)
# Average ratings by age group and question
age_question_avg = pd.pivot_table(
responses_long,
values='rating',
index='age_group',
columns='question',
aggfunc='mean'
)
print("\nAverage Ratings by Age Group:")
print(age_question_avg.round(2))
# Response distribution
response_dist = pd.crosstab(
responses_long['question'],
responses_long['rating'],
normalize='index'
) * 100
print("\nResponse Distribution (%):")
print(response_dist.round(1))
Common Patterns
From Wide to Long and Back
# Start with wide format
original = wide_sales.copy()
print("Original Wide Format:")
print(original)
# Convert to long
long = original.melt(id_vars=['store'], var_name='month', value_name='sales')
print("\nConvert to Long:")
print(long)
# Convert back to wide
back_to_wide = long.pivot(index='store', columns='month', values='sales')
print("\nBack to Wide:")
print(back_to_wide)
# Verify they're the same (except column order might differ)
print("\nAre they equal?", original.set_index('store').equals(back_to_wide))
Multiple Value Columns
# Data with multiple metrics
metrics = pd.DataFrame({
'store': ['A', 'B', 'C'] * 3,
'month': ['Jan'] * 3 + ['Feb'] * 3 + ['Mar'] * 3,
'sales': [100, 150, 120, 110, 160, 130, 120, 170, 140],
'customers': [50, 75, 60, 55, 80, 65, 60, 85, 70]
})
# Pivot both metrics
pivot_both = metrics.pivot(
index='store',
columns='month',
values=['sales', 'customers']
)
print("Pivoted Multiple Metrics:")
print(pivot_both)
# Flatten multi-level columns
pivot_both.columns = ['_'.join(col) for col in pivot_both.columns]
pivot_both = pivot_both.reset_index()
print("\nFlattened Columns:")
print(pivot_both)
Performance Tips
# For large datasets, consider:
# 1. Use categorical data types for repeated values
df['category_col'] = df['category_col'].astype('category')
# 2. Specify dtypes when melting
melted = df.melt(id_vars=['id'], var_name='variable', value_name='value')
# 3. Use pivot instead of pivot_table when you don't need aggregation
# (pivot is faster for unique index-column combinations)
# 4. Pre-sort data before pivoting if possible
df_sorted = df.sort_values(['index_col', 'column_col'])
pivoted = df_sorted.pivot(index='index_col', columns='column_col', values='value_col')
Practice Exercises
Exercise 1: Reshape Product Data
Given monthly product sales in wide format:
- Melt to long format
- Pivot to show products as rows, months as columns
- Calculate quarterly totals
Exercise 2: Student Attendance
With daily attendance data:
- Pivot to show students as rows, dates as columns
- Calculate attendance percentage per student
- Identify perfect attendance
Exercise 3: Multi-Region Analysis
Given sales by region and product:
- Create pivot table showing total sales
- Add percentage of total for each cell
- Find top-performing region-product combinations
Sample Solutions
Exercise 1:
# Assuming wide_format has columns: product, Jan, Feb, Mar, ...
# Melt to long
long = wide_format.melt(id_vars=['product'], var_name='month', value_name='sales')
# Pivot back
pivoted = long.pivot(index='product', columns='month', values='sales')
# Calculate quarterly totals
pivoted['Q1'] = pivoted[['Jan', 'Feb', 'Mar']].sum(axis=1)
pivoted['Q2'] = pivoted[['Apr', 'May', 'Jun']].sum(axis=1)
print("Quarterly Sales:")
print(pivoted[['Q1', 'Q2']])
Key Takeaways
- Use
melt()to convert wide format to long format (unpivot) - Use
pivot()to convert long format to wide format (requires unique index-column combinations) - Use
pivot_table()for aggregation while pivoting (handles duplicates) - Use
stack()andunstack()for index-based reshaping - Use
crosstab()for frequency tables - Choose format based on analysis needs: long for plotting/modeling, wide for readability
- Wide format is better for human reading, long format for machine processing
- Remember: most plotting libraries prefer long format data
What's Next?
Now that you can reshape data effectively:
- GroupBy Operations - Aggregate reshaped data
- Data Visualization - Plot your pivoted data
- Statistical Operations - Analyze reshaped data
Resources
- Pandas Reshaping Documentation: pandas.pydata.org/docs/user_guide/reshaping.html
- Tidy Data Principles: Understanding long vs wide format
- Pivot Table Examples: Real-world pivot table use cases
Reshaping is a superpower in data analysis. Master these techniques to transform any dataset into the format you need.
