Dev In The Mountain Header
A Developer In The mountains having fun

DataFrame and Series Basics

Understanding DataFrames and Series is fundamental to mastering pandas. These two data structures form the backbone of all pandas operations. In this comprehensive guide, we'll explore their properties, methods, and best practices for working with structured data.

Understanding Pandas Data Structures

Pandas provides two primary data structures:

  • Series: One-dimensional labeled array
  • DataFrame: Two-dimensional labeled data structure (like a table or spreadsheet)

Both structures are built on top of NumPy arrays, providing labeled axes (indexes) and powerful data manipulation capabilities.

Series: The Foundation

A Series is essentially a column of data with an index. Think of it as a single column from a spreadsheet with row labels.

Creating Series

import pandas as pd
import numpy as np

# From a list
temperatures = pd.Series([72, 75, 68, 80])
print(temperatures)
print(f"Type: {type(temperatures)}")

# From a list with custom index
temps_labeled = pd.Series([72, 75, 68, 80], 
                         index=['Mon', 'Tue', 'Wed', 'Thu'])
print(temps_labeled)

# From a dictionary (keys become index)
temp_dict = {'Mon': 72, 'Tue': 75, 'Wed': 68, 'Thu': 80}
temps_from_dict = pd.Series(temp_dict)
print(temps_from_dict)

# From a NumPy array
np_array = np.array([1, 2, 3, 4, 5])
series_from_numpy = pd.Series(np_array)
print(series_from_numpy)

Series Attributes and Properties

# Create a sample series
grades = pd.Series([85, 92, 78, 96, 88], 
                   index=['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
                   name='Math_Grades')

# Basic properties
print(f"Values: {grades.values}")        # Underlying NumPy array
print(f"Index: {grades.index}")          # Index labels
print(f"Name: {grades.name}")            # Series name
print(f"Shape: {grades.shape}")          # Dimensions
print(f"Size: {grades.size}")            # Number of elements
print(f"Data type: {grades.dtype}")      # Data type
print(f"Memory usage: {grades.memory_usage()}")

Series Operations

# Mathematical operations
print("Original grades:")
print(grades)

print("\nGrades + 5 (curve):")
print(grades + 5)

print("\nGrades * 1.1 (10% bonus):")
print(grades * 1.1)

print("\nSquare root of grades:")
print(np.sqrt(grades))

# Boolean operations
print("\nStudents with grades above 85:")
print(grades[grades > 85])

print("\nDid everyone pass (>= 70)?")
print((grades >= 70).all())

print("\nDid anyone get an A (>= 90)?")
print((grades >= 90).any())

Series Indexing and Selection

# Label-based indexing
print(f"Alice's grade: {grades['Alice']}")
print(f"Bob's grade: {grades.loc['Bob']}")

# Position-based indexing
print(f"First student's grade: {grades.iloc[0]}")
print(f"Last student's grade: {grades.iloc[-1]}")

# Multiple selection
print("Alice and Charlie's grades:")
print(grades[['Alice', 'Charlie']])

# Slice selection
print("First three students:")
print(grades.iloc[0:3])

print("Students Bob through Diana:")
print(grades.loc['Bob':'Diana'])

Series Methods

# Statistical methods
print(f"Mean: {grades.mean():.2f}")
print(f"Median: {grades.median()}")
print(f"Standard deviation: {grades.std():.2f}")
print(f"Min: {grades.min()}")
print(f"Max: {grades.max()}")

# Ranking and sorting
print("\nGrades ranked (1 = highest):")
print(grades.rank(ascending=False))

print("\nGrades sorted (ascending):")
print(grades.sort_values())

print("\nStudents sorted alphabetically:")
print(grades.sort_index())

# Value counts and unique values
subjects = pd.Series(['Math', 'Science', 'Math', 'English', 'Science', 'Math'])
print("\nSubject counts:")
print(subjects.value_counts())
print(f"\nUnique subjects: {subjects.unique()}")
print(f"Number of unique subjects: {subjects.nunique()}")

DataFrame: The Powerhouse

A DataFrame is a two-dimensional labeled data structure with columns that can contain different data types. Think of it as a collection of Series that share the same index.

Creating DataFrames

# From a dictionary of lists
student_data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [20, 21, 19, 22, 20],
    'major': ['Math', 'Physics', 'Chemistry', 'Biology', 'Math'],
    'gpa': [3.8, 3.6, 3.9, 3.7, 3.5]
}

df = pd.DataFrame(student_data)
print("DataFrame from dictionary:")
print(df)
print(f"Type: {type(df)}")

# From a list of dictionaries
students_list = [
    {'name': 'Alice', 'age': 20, 'major': 'Math', 'gpa': 3.8},
    {'name': 'Bob', 'age': 21, 'major': 'Physics', 'gpa': 3.6},
    {'name': 'Charlie', 'age': 19, 'major': 'Chemistry', 'gpa': 3.9}
]

df_from_list = pd.DataFrame(students_list)
print("\nDataFrame from list of dictionaries:")
print(df_from_list)

# From a 2D NumPy array
data_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df_from_array = pd.DataFrame(data_array, 
                            columns=['A', 'B', 'C'],
                            index=['row1', 'row2', 'row3'])
print("\nDataFrame from NumPy array:")
print(df_from_array)

DataFrame Attributes and Properties

# Basic properties
print(f"Shape: {df.shape}")              # (rows, columns)
print(f"Size: {df.size}")                # Total number of elements
print(f"Columns: {list(df.columns)}")    # Column names
print(f"Index: {list(df.index)}")        # Row index
print(f"Data types:\n{df.dtypes}")       # Data type of each column
print(f"Memory usage:\n{df.memory_usage()}")

# More detailed information
print("\nDetailed info:")
df.info()

DataFrame Structure and Navigation

# Viewing data
print("First 3 rows:")
print(df.head(3))

print("\nLast 2 rows:")
print(df.tail(2))

print("\nRandom sample of 2 rows:")
print(df.sample(2))

# Statistical summary
print("\nStatistical summary:")
print(df.describe())

print("\nStatistical summary (including non-numeric):")
print(df.describe(include='all'))

Selecting Data from DataFrames

# Column selection
print("Single column (returns Series):")
names = df['name']
print(f"Type: {type(names)}")
print(names)

print("\nSingle column (returns DataFrame):")
names_df = df[['name']]
print(f"Type: {type(names_df)}")
print(names_df)

print("\nMultiple columns:")
subset = df[['name', 'gpa']]
print(subset)

# Row selection
print("\nFirst row (by position):")
first_student = df.iloc[0]
print(f"Type: {type(first_student)}")
print(first_student)

print("\nFirst 3 rows:")
print(df.iloc[0:3])

print("\nSpecific rows and columns:")
print(df.iloc[1:4, 0:2])  # Rows 1-3, Columns 0-1

Label-Based Selection with .loc

# Set a meaningful index
df_indexed = df.set_index('name')
print("DataFrame with name as index:")
print(df_indexed)

print("\nSelect Alice's data:")
print(df_indexed.loc['Alice'])

print("\nSelect multiple students:")
print(df_indexed.loc[['Alice', 'Charlie']])

print("\nSelect students and specific columns:")
print(df_indexed.loc[['Alice', 'Bob'], ['age', 'gpa']])

# Boolean indexing
print("\nStudents with GPA > 3.7:")
high_achievers = df[df['gpa'] > 3.7]
print(high_achievers)

print("\nMath majors with GPA > 3.6:")
math_students = df[(df['major'] == 'Math') & (df['gpa'] > 3.6)]
print(math_students)

DataFrame Operations

# Adding new columns
df['grade_letter'] = df['gpa'].apply(lambda x: 'A' if x >= 3.7 else 'B' if x >= 3.0 else 'C')
print("DataFrame with letter grades:")
print(df)

# Mathematical operations on columns
df['gpa_scaled'] = df['gpa'] * 100
df['age_next_year'] = df['age'] + 1

print("\nDataFrame with calculated columns:")
print(df[['name', 'age', 'age_next_year', 'gpa', 'gpa_scaled']])

# Conditional operations
df['status'] = np.where(df['age'] >= 21, 'Senior', 'Junior')
print("\nDataFrame with status:")
print(df[['name', 'age', 'status']])

Working with Missing Data

# Create DataFrame with missing values
data_with_na = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8],
    'C': [9, 10, 11, np.nan]
}

df_na = pd.DataFrame(data_with_na)
print("DataFrame with missing values:")
print(df_na)

# Check for missing values
print("\nMissing value check:")
print(df_na.isnull())

print("\nMissing values per column:")
print(df_na.isnull().sum())

print("\nRows with any missing values:")
print(df_na[df_na.isnull().any(axis=1)])

# Handle missing values
print("\nDrop rows with any missing values:")
print(df_na.dropna())

print("\nDrop columns with any missing values:")
print(df_na.dropna(axis=1))

print("\nFill missing values with 0:")
print(df_na.fillna(0))

print("\nFill with column mean:")
print(df_na.fillna(df_na.mean()))

DataFrame Aggregation and Grouping

# Basic aggregation
print("Basic statistics:")
print(f"Mean GPA: {df['gpa'].mean():.2f}")
print(f"Average age: {df['age'].mean():.1f}")
print(f"GPA standard deviation: {df['gpa'].std():.2f}")

# Group by operations
print("\nAverage GPA by major:")
gpa_by_major = df.groupby('major')['gpa'].mean()
print(gpa_by_major)

print("\nMultiple statistics by major:")
major_stats = df.groupby('major').agg({
    'gpa': ['mean', 'std', 'count'],
    'age': ['mean', 'min', 'max']
})
print(major_stats)

# Value counts
print("\nStudents per major:")
print(df['major'].value_counts())

print("\nAge distribution:")
print(df['age'].value_counts().sort_index())

Combining DataFrames and Series

# Create additional data
scholarship_data = pd.Series([1000, 1500, 800, 1200, 900], 
                            index=df.index, 
                            name='scholarship')

# Add Series as new column
df['scholarship'] = scholarship_data
print("DataFrame with scholarship column:")
print(df[['name', 'gpa', 'scholarship']])

# Create another DataFrame to merge
course_data = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Diana', 'Frank'],
    'course_load': [15, 18, 16, 12],
    'semester': ['Fall', 'Spring', 'Fall', 'Spring']
})

# Merge DataFrames
merged_df = pd.merge(df, course_data, on='name', how='left')
print("\nMerged DataFrame:")
print(merged_df[['name', 'major', 'gpa', 'course_load', 'semester']])

Advanced DataFrame and Series Concepts

MultiIndex (Hierarchical Indexing)

# Create MultiIndex
arrays = [
    ['Math', 'Math', 'Science', 'Science'],
    ['Algebra', 'Calculus', 'Physics', 'Chemistry']
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Department', 'Course'])

# Create Series with MultiIndex
enrollment = pd.Series([30, 25, 20, 35], index=index)
print("Series with MultiIndex:")
print(enrollment)

# Access data in MultiIndex
print(f"\nMath department enrollment:")
print(enrollment['Math'])

print(f"\nAlgebra enrollment:")
print(enrollment['Math', 'Algebra'])

DataFrame Methods Chaining

# Method chaining for complex operations
result = (df
          .query('age >= 20')                    # Filter rows
          .groupby('major')                      # Group by major
          .agg({'gpa': 'mean', 'age': 'max'})   # Aggregate
          .round(2)                              # Round results
          .sort_values('gpa', ascending=False))  # Sort by GPA

print("Method chaining result:")
print(result)

Performance Considerations

# Efficient data types
print("Original memory usage:")
print(df.memory_usage(deep=True).sum())

# Convert to more efficient types
df_efficient = df.copy()
df_efficient['age'] = df_efficient['age'].astype('int8')
df_efficient['major'] = df_efficient['major'].astype('category')

print("\nOptimized memory usage:")
print(df_efficient.memory_usage(deep=True).sum())

print("\nData types comparison:")
print("Original:", df.dtypes)
print("Optimized:", df_efficient.dtypes)

Best Practices

1. Use Vectorized Operations

# Slow: Using loops
slow_result = []
for index, row in df.iterrows():
    slow_result.append(row['gpa'] * 4.0)

# Fast: Using vectorized operations
fast_result = df['gpa'] * 4.0

2. Chain Operations Efficiently

# Good: Method chaining
result = (df
          .drop_duplicates()
          .query('gpa > 3.5')
          .sort_values('age')
          .reset_index(drop=True))

# Less efficient: Multiple assignments
df_temp1 = df.drop_duplicates()
df_temp2 = df_temp1.query('gpa > 3.5')
df_temp3 = df_temp2.sort_values('age')
result = df_temp3.reset_index(drop=True)

3. Handle Missing Data Appropriately

# Check for missing data first
print("Missing data summary:")
print(df.isnull().sum())

# Choose appropriate strategy
# For small amounts of missing data: drop
clean_df = df.dropna()

# For systematic missing data: fill strategically
filled_df = df.fillna({
    'age': df['age'].median(),
    'gpa': df.groupby('major')['gpa'].transform('mean')
})

Common Patterns and Idioms

Creating Sample Data

# Quick sample DataFrame
def create_sample_df(n=100):
    np.random.seed(42)
    return pd.DataFrame({
        'id': range(1, n+1),
        'value': np.random.randn(n),
        'category': np.random.choice(['A', 'B', 'C'], n),
        'date': pd.date_range('2023-01-01', periods=n)
    })

sample_df = create_sample_df(20)
print(sample_df.head())

Conditional Column Creation

# Multiple conditions
def classify_student(row):
    if row['gpa'] >= 3.8:
        return 'Excellent'
    elif row['gpa'] >= 3.5:
        return 'Good'
    elif row['gpa'] >= 3.0:
        return 'Average'
    else:
        return 'Below Average'

df['performance'] = df.apply(classify_student, axis=1)
print(df[['name', 'gpa', 'performance']])

Safe Data Access

# Safe dictionary-like access
print("Safe access to potentially missing key:")
print(df.get('nonexistent_column', 'Default Value'))

# Safe indexing with error handling
try:
    student_data = df.loc[df['name'] == 'NonExistentStudent']
    if student_data.empty:
        print("Student not found")
    else:
        print(student_data)
except KeyError as e:
    print(f"Key error: {e}")

Quick Reference

Series Methods

MethodPurposeExample
.head(n)First n valuesseries.head(5)
.value_counts()Count unique valuesseries.value_counts()
.unique()Get unique valuesseries.unique()
.sort_values()Sort by valuesseries.sort_values()
.apply(func)Apply functionseries.apply(lambda x: x*2)

DataFrame Methods

MethodPurposeExample
.info()Data summarydf.info()
.describe()Statistical summarydf.describe()
.groupby()Group datadf.groupby('column')
.merge()Join DataFramespd.merge(df1, df2)
.query()Filter with stringdf.query('age > 20')

Next Steps

Now that you understand DataFrames and Series fundamentals, explore these related topics:

  1. Data Inspection - Advanced techniques for exploring your data
  2. Selection and Filtering - Master data selection techniques
  3. Data Cleaning - Clean and prepare messy data

DataFrames and Series are the foundation of pandas. Master these concepts, and you'll be able to handle any data analysis task with confidence.

More places to find me
Mental Health
follow me on Mastodon