Dev In The Mountain Header
A Developer In The mountains having fun

Getting Started with Pandas

Welcome to your first steps with pandas! This guide will help you install pandas, understand its basic concepts, and write your first pandas code. By the end of this tutorial, you'll be comfortable creating and manipulating DataFrames and Series.

Installation

Method 1: Using pip (Recommended)

# Install pandas with essential dependencies
pip install pandas

# Install with all optional dependencies for full functionality
pip install pandas[all]

# For working with Excel files
pip install pandas openpyxl xlrd

# For working with Stata files
pip install pandas pyreadstat

Method 2: Using conda

# Install pandas via conda
conda install pandas

# Install from conda-forge (often more up-to-date)
conda install -c conda-forge pandas

Method 3: Using package managers

# On Ubuntu/Debian
sudo apt-get install python3-pandas

# On macOS with Homebrew
brew install python
pip3 install pandas

Verifying Installation

import pandas as pd
print(f"Pandas version: {pd.__version__}")

# Check if installation is working
df = pd.DataFrame({'test': [1, 2, 3]})
print(df)

Essential Imports

Always start your pandas scripts with these imports:

import pandas as pd
import numpy as np  # Often used together with pandas

# Optional but useful
import matplotlib.pyplot as plt  # For basic plotting

The pd alias is the standard convention used throughout the pandas community.

Your First DataFrame

Creating DataFrames from Dictionaries

# Simple dictionary to DataFrame
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['New York', 'London', 'Tokyo', 'Paris']
}

df = pd.DataFrame(data)
print(df)

Output:

      name  age      city
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Tokyo
3    Diana   28     Paris

Creating DataFrames from Lists

# From list of lists
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'London'],
    ['Charlie', 35, 'Tokyo'],
    ['Diana', 28, 'Paris']
]

df = pd.DataFrame(data, columns=['name', 'age', 'city'])
print(df)

Creating DataFrames with Custom Index

data = {
    'temperature': [72, 75, 68, 80],
    'humidity': [65, 70, 80, 60]
}

df = pd.DataFrame(data, index=['Monday', 'Tuesday', 'Wednesday', 'Thursday'])
print(df)

Your First Series

A Series is a one-dimensional labeled array:

# Creating a Series from a list
temperatures = pd.Series([72, 75, 68, 80])
print(temperatures)

# Creating a Series with custom index
temperatures = pd.Series([72, 75, 68, 80], 
                        index=['Mon', 'Tue', 'Wed', 'Thu'])
print(temperatures)

# Creating a Series from a dictionary
temp_dict = {'Mon': 72, 'Tue': 75, 'Wed': 68, 'Thu': 80}
temperatures = pd.Series(temp_dict)
print(temperatures)

Basic DataFrame Operations

Viewing Your Data

# First and last rows
print(df.head())        # First 5 rows (default)
print(df.head(3))       # First 3 rows
print(df.tail(2))       # Last 2 rows

# Random sampling
print(df.sample(2))     # 2 random rows
print(df.sample(frac=0.5))  # 50% of the data randomly

Getting Information About Your Data

# Basic information
print(df.info())        # Data types, non-null counts, memory usage
print(df.shape)         # Dimensions: (rows, columns)
print(df.columns)       # Column names
print(df.index)         # Row index
print(df.dtypes)        # Data types of each column

# Statistical summary
print(df.describe())    # For numeric columns
print(df.describe(include='all'))  # For all columns

Selecting Data

# Select a single column (returns a Series)
ages = df['age']
print(type(ages))       # <class 'pandas.core.series.Series'>

# Select multiple columns (returns a DataFrame)
subset = df[['name', 'age']]
print(type(subset))     # <class 'pandas.core.frame.DataFrame'>

# Select rows by index
first_row = df.iloc[0]          # First row
first_three = df.iloc[0:3]      # First three rows
last_row = df.iloc[-1]          # Last row

Basic Filtering

# Simple conditions
young_people = df[df['age'] < 30]
print(young_people)

# Multiple conditions
young_in_europe = df[(df['age'] < 30) & (df['city'].isin(['London', 'Paris']))]
print(young_in_europe)

# String operations
names_with_a = df[df['name'].str.contains('a', case=False)]
print(names_with_a)

Working with Real Data

Reading from Files

# CSV files (most common)
df = pd.read_csv('data.csv')

# Excel files
df = pd.read_excel('data.xlsx')
df = pd.read_excel('data.xlsx', sheet_name='Sheet2')  # Specific sheet

# Stata files
df = pd.read_stata('data.dta')

# JSON files
df = pd.read_json('data.json')

# Reading with custom options
df = pd.read_csv('data.csv', 
                 sep=';',           # Different separator
                 header=0,          # Row to use as column names
                 index_col=0,       # Column to use as row index
                 na_values=['N/A', 'NULL'])  # Custom missing values

Saving Data

# Save to CSV
df.to_csv('output.csv', index=False)  # index=False excludes row numbers

# Save to Excel
df.to_excel('output.xlsx', index=False)

# Save to Stata
df.to_stata('output.dta')

# Save to JSON
df.to_json('output.json')

Common Beginner Tasks

Adding New Columns

# Simple calculation
df['age_in_months'] = df['age'] * 12

# Conditional values
df['age_group'] = df['age'].apply(lambda x: 'Young' if x < 30 else 'Older')

# Using numpy where for conditions
df['generation'] = np.where(df['age'] < 30, 'Millennial', 'Gen X')

Modifying Existing Data

# Update all values in a column
df['name'] = df['name'].str.upper()

# Update specific rows
df.loc[df['age'] > 30, 'status'] = 'Senior'

# Apply functions
df['age'] = df['age'].apply(lambda x: x + 1)  # Everyone ages one year

Handling Missing Data

# Check for missing values
print(df.isnull().sum())

# Drop rows with any missing values
df_clean = df.dropna()

# Fill missing values
df_filled = df.fillna(0)  # Fill with 0
df_filled = df.fillna(df.mean())  # Fill with mean (numeric columns only)
df_filled = df.fillna({'age': 0, 'name': 'Unknown'})  # Different values per column

Common Patterns and Best Practices

Method Chaining

# Instead of multiple steps
df_temp = df[df['age'] > 25]
df_temp = df_temp[['name', 'age']]
df_result = df_temp.sort_values('age')

# Use method chaining
df_result = (df[df['age'] > 25]
             [['name', 'age']]
             .sort_values('age'))

Memory-Efficient Data Types

# Check memory usage
print(df.memory_usage(deep=True))

# Convert to more efficient types
df['age'] = df['age'].astype('int8')  # If age is always < 127
df['city'] = df['city'].astype('category')  # For repeated string values

Safe Operations

# Use .copy() to avoid modifying original data
df_modified = df.copy()
df_modified['new_column'] = df_modified['age'] * 2

# Use .loc and .iloc for indexing
df.loc[0, 'name']           # Safe way to access single value
df.iloc[0:3, 1:3]           # Safe way to slice

Quick Reference: Essential Methods

MethodPurposeExample
head()View first n rowsdf.head(5)
tail()View last n rowsdf.tail(3)
info()Data types and memorydf.info()
describe()Statistical summarydf.describe()
shapeDimensionsdf.shape
columnsColumn namesdf.columns
dtypesData typesdf.dtypes
isnull()Check missing valuesdf.isnull().sum()
dropna()Remove missing valuesdf.dropna()
fillna()Fill missing valuesdf.fillna(0)

First Project: Analyzing Sample Data

Let's put it all together with a mini-project:

import pandas as pd
import numpy as np

# Create sample sales data
np.random.seed(42)  # For reproducible results
data = {
    'product': ['A', 'B', 'C', 'D'] * 25,
    'sales': np.random.randint(100, 1000, 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'month': np.random.choice(['Jan', 'Feb', 'Mar'], 100)
}

df = pd.DataFrame(data)

# Basic analysis
print("Dataset Overview:")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print("\nFirst 5 rows:")
print(df.head())

print("\nSales statistics:")
print(df['sales'].describe())

print("\nSales by product:")
print(df.groupby('product')['sales'].mean())

print("\nSales by region:")
print(df.groupby('region')['sales'].sum())

# Save results
df.to_csv('sample_sales_analysis.csv', index=False)
print("\nResults saved to 'sample_sales_analysis.csv'")

Troubleshooting Common Issues

ImportError: No module named 'pandas'

# Make sure pandas is installed
pip install pandas

# Check if you're using the right Python environment
which python
pip list | grep pandas

Memory Errors with Large Files

# Read in chunks for large files
chunk_list = []
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
    # Process each chunk
    processed_chunk = chunk[chunk['column'] > 0]
    chunk_list.append(processed_chunk)

# Combine all chunks
df = pd.concat(chunk_list, ignore_index=True)

Performance Issues

# Use efficient data types
df = pd.read_csv('data.csv', dtype={'id': 'int32', 'category': 'category'})

# Use vectorized operations instead of loops
# Slow: using loops
results = []
for index, row in df.iterrows():
    results.append(row['value'] * 2)

# Fast: using vectorization
results = df['value'] * 2

Next Steps

Congratulations! You now know the basics of pandas. Here's what to learn next:

  1. DataFrame and Series Basics - Deep dive into data structures
  2. Data Inspection - Advanced techniques for exploring data
  3. Working with Stata Files - Special focus on Stata data import/export

Practice Exercises

  1. Create a DataFrame with your favorite movies, including title, year, and rating
  2. Load a CSV file from the internet using pd.read_csv('https://...')
  3. Calculate summary statistics for a numeric column
  4. Filter data based on multiple conditions
  5. Add a new calculated column to your DataFrame

Additional Resources


Now that you have pandas installed and understand the basics, you're ready to start working with real data. The next tutorials will build on these fundamentals to help you become proficient in data analysis with Python.

More places to find me
Mental Health
follow me on Mastodon