Dev In The Mountain Header
A Developer In The mountains having fun

Selecting and Filtering Data in Pandas

One of the most fundamental skills in data analysis is selecting the right data to work with. Pandas provides multiple powerful methods for selecting columns, filtering rows, and accessing specific elements in your DataFrames. Mastering these techniques will make your data analysis workflow much more efficient.

Why Selection and Filtering Matter

Before you can analyze data, you need to:

  • Extract specific columns or rows
  • Filter data based on conditions
  • Select subsets for further analysis
  • Remove irrelevant data
  • Focus on data that meets certain criteria

Basic Column Selection

Select Single Column

import pandas as pd
import numpy as np

# Create sample DataFrame
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'city': ['New York', 'London', 'Tokyo', 'Paris', 'Berlin'],
    'salary': [50000, 60000, 75000, 55000, 68000],
    'department': ['Sales', 'Engineering', 'Sales', 'Marketing', 'Engineering']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Select single column (returns Series)
names = df['name']
print("\nNames column (Series):")
print(type(names))
print(names)

# Alternative syntax using dot notation
ages = df.age
print("\nAges (dot notation):")
print(ages.head())

Select Multiple Columns

# Select multiple columns (returns DataFrame)
subset = df[['name', 'salary', 'department']]
print("\nMultiple columns:")
print(subset)

# Reorder columns while selecting
reordered = df[['salary', 'name', 'age']]
print("\nReordered columns:")
print(reordered)

# Select all but certain columns
all_but_city = df.drop('city', axis=1)
print("\nAll columns except city:")
print(all_but_city)

# Drop multiple columns
minimal = df.drop(['city', 'department'], axis=1)
print("\nDropped multiple columns:")
print(minimal)

Row Selection by Position

Using iloc (Integer Location)

# Select first row
first_row = df.iloc[0]
print("First row:")
print(first_row)

# Select multiple rows by position
first_three = df.iloc[0:3]
print("\nFirst three rows:")
print(first_three)

# Select specific rows
specific = df.iloc[[0, 2, 4]]
print("\nRows 0, 2, and 4:")
print(specific)

# Select last n rows
last_two = df.iloc[-2:]
print("\nLast two rows:")
print(last_two)

# Select rows and columns by position
subset = df.iloc[0:3, 0:2]  # First 3 rows, first 2 columns
print("\nFirst 3 rows, first 2 columns:")
print(subset)

# Select non-consecutive rows and columns
custom = df.iloc[[0, 2], [0, 3]]  # Rows 0,2 and columns 0,3
print("\nCustom row and column selection:")
print(custom)

Row Selection by Label

Using loc (Label Location)

# Set a meaningful index
df_indexed = df.set_index('name')
print("DataFrame with name as index:")
print(df_indexed)

# Select by index label
alice = df_indexed.loc['Alice']
print("\nAlice's data:")
print(alice)

# Select multiple rows by label
subset = df_indexed.loc[['Alice', 'Charlie', 'Eve']]
print("\nMultiple employees:")
print(subset)

# Select rows and columns by label
bob_salary = df_indexed.loc['Bob', 'salary']
print(f"\nBob's salary: ${bob_salary:,}")

# Select range of rows and specific columns
selected = df_indexed.loc['Alice':'Diana', ['age', 'salary']]
print("\nAge and salary for Alice through Diana:")
print(selected)

Boolean Filtering (Conditional Selection)

Simple Conditions

# Filter rows where age > 30
older_than_30 = df[df['age'] > 30]
print("Employees older than 30:")
print(older_than_30)

# Filter by exact match
engineers = df[df['department'] == 'Engineering']
print("\nEngineers:")
print(engineers)

# Filter by salary range
mid_salary = df[(df['salary'] >= 55000) & (df['salary'] <= 65000)]
print("\nMid-salary range ($55k-$65k):")
print(mid_salary)

Multiple Conditions

# AND condition: Both conditions must be True
# Use & operator and parentheses
young_engineers = df[(df['age'] < 35) & (df['department'] == 'Engineering')]
print("Young engineers:")
print(young_engineers)

# OR condition: At least one condition must be True
# Use | operator
sales_or_marketing = df[(df['department'] == 'Sales') | (df['department'] == 'Marketing')]
print("\nSales or Marketing employees:")
print(sales_or_marketing)

# NOT condition: Negate with ~
not_engineering = df[~(df['department'] == 'Engineering')]
print("\nNon-engineering employees:")
print(not_engineering)

# Complex conditions
complex_filter = df[
    ((df['age'] >= 30) & (df['salary'] > 60000)) |
    (df['department'] == 'Sales')
]
print("\nComplex filter (age>=30 AND salary>60k) OR Sales:")
print(complex_filter)

String Filtering

# Contains substring
names_with_e = df[df['name'].str.contains('e', case=False)]
print("Names containing 'e':")
print(names_with_e)

# Starts with
names_start_a = df[df['name'].str.startswith('A')]
print("\nNames starting with 'A':")
print(names_start_a)

# Ends with
cities_end_n = df[df['city'].str.endswith('n')]
print("\nCities ending with 'n':")
print(cities_end_n)

# String length
long_names = df[df['name'].str.len() > 5]
print("\nLong names (>5 characters):")
print(long_names)

# Pattern matching with regex
import re
pattern_match = df[df['name'].str.match(r'^[A-C]')]  # Starts with A, B, or C
print("\nNames starting with A, B, or C:")
print(pattern_match)

Using isin() for Multiple Values

# Filter by multiple specific values
selected_cities = df[df['city'].isin(['New York', 'Tokyo', 'Paris'])]
print("Selected cities:")
print(selected_cities)

# Filter by multiple departments
tech_depts = df[df['department'].isin(['Engineering', 'IT', 'Data Science'])]
print("\nTech departments:")
print(tech_depts)

# Exclude specific values (negation)
not_these_cities = df[~df['city'].isin(['London'])]
print("\nAll cities except London:")
print(not_these_cities)

Filtering with query()

The query() method provides a more readable syntax for filtering:

# Simple query
result = df.query('age > 30')
print("Query: age > 30")
print(result)

# Multiple conditions
result = df.query('age > 28 and salary < 70000')
print("\nQuery: age > 28 and salary < 70000")
print(result)

# Using variables in query
min_age = 30
result = df.query('age >= @min_age')
print(f"\nQuery with variable: age >= {min_age}")
print(result)

# String matching in query
result = df.query('department == "Engineering"')
print("\nQuery: department == Engineering")
print(result)

# IN operator
result = df.query('city in ["New York", "Paris"]')
print("\nQuery: city in list")
print(result)

Selecting with Conditions on Multiple Columns

# Create more complex dataset
np.random.seed(42)
sales_data = pd.DataFrame({
    'salesperson': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'] * 4,
    'quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q1', 
                'Q2', 'Q2', 'Q2', 'Q2', 'Q2',
                'Q3', 'Q3', 'Q3', 'Q3', 'Q3',
                'Q4', 'Q4', 'Q4', 'Q4', 'Q4'],
    'sales': np.random.randint(10000, 50000, 20),
    'expenses': np.random.randint(2000, 10000, 20)
})

sales_data['profit'] = sales_data['sales'] - sales_data['expenses']

print("Sales Data:")
print(sales_data.head(10))

# Filter high-profit quarters
high_profit = sales_data[sales_data['profit'] > 30000]
print("\nHigh profit quarters:")
print(high_profit)

# Filter by salesperson and performance
alice_good = sales_data[
    (sales_data['salesperson'] == 'Alice') & 
    (sales_data['profit'] > 20000)
]
print("\nAlice's high-performing quarters:")
print(alice_good)

# Top performers in each quarter
top_per_quarter = sales_data.groupby('quarter').apply(
    lambda x: x.nlargest(2, 'profit')
)
print("\nTop 2 performers per quarter:")
print(top_per_quarter[['salesperson', 'quarter', 'profit']])

Filtering with Missing Values

# Create data with missing values
df_with_na = df.copy()
df_with_na.loc[1, 'salary'] = np.nan
df_with_na.loc[3, 'age'] = np.nan
df_with_na.loc[4, 'city'] = np.nan

print("Data with missing values:")
print(df_with_na)

# Find rows with any missing values
rows_with_na = df_with_na[df_with_na.isnull().any(axis=1)]
print("\nRows with any missing values:")
print(rows_with_na)

# Find rows with specific column missing
missing_salary = df_with_na[df_with_na['salary'].isnull()]
print("\nRows with missing salary:")
print(missing_salary)

# Find rows with no missing values
complete_rows = df_with_na[df_with_na.notnull().all(axis=1)]
print("\nComplete rows:")
print(complete_rows)

# Filter out rows with missing values in specific columns
has_salary = df_with_na[df_with_na['salary'].notnull()]
print("\nRows with salary data:")
print(has_salary)

Advanced Selection Techniques

Select dtypes

# Select only numeric columns
numeric_cols = df.select_dtypes(include=[np.number])
print("Numeric columns:")
print(numeric_cols)

# Select only object (string) columns
string_cols = df.select_dtypes(include=['object'])
print("\nString columns:")
print(string_cols)

# Exclude certain types
no_strings = df.select_dtypes(exclude=['object'])
print("\nExcluding string columns:")
print(no_strings)

Filter by Column Names

# Select columns matching pattern
salary_age = df.filter(regex='(age|salary)')
print("Columns matching pattern:")
print(salary_age)

# Select columns containing substring
name_cols = df.filter(like='name')
print("\nColumns containing 'name':")
print(name_cols)

# Select columns by list of names
specific = df.filter(items=['name', 'salary'])
print("\nSpecific columns:")
print(specific)

Sample Random Rows

# Random sample
random_3 = df.sample(n=3, random_state=42)
print("Random 3 rows:")
print(random_3)

# Random percentage
random_pct = df.sample(frac=0.4, random_state=42)
print(f"\n40% random sample ({len(random_pct)} rows):")
print(random_pct)

# Sample with replacement
with_replacement = df.sample(n=10, replace=True, random_state=42)
print(f"\nSample with replacement ({len(with_replacement)} rows):")
print(with_replacement)

Practical Examples

Example 1: Customer Segmentation

# Customer data
customers = pd.DataFrame({
    'customer_id': range(1, 101),
    'age': np.random.randint(18, 70, 100),
    'income': np.random.randint(20000, 150000, 100),
    'purchases': np.random.randint(0, 50, 100),
    'loyalty_years': np.random.randint(0, 10, 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100)
})

# High-value customers
high_value = customers[
    (customers['income'] > 75000) &
    (customers['purchases'] > 20) &
    (customers['loyalty_years'] >= 3)
]

print(f"High-value customers: {len(high_value)}")
print(high_value.head())

# At-risk customers (long-time customers with low recent activity)
at_risk = customers[
    (customers['loyalty_years'] >= 5) &
    (customers['purchases'] < 10)
]

print(f"\nAt-risk customers: {len(at_risk)}")
print(at_risk.head())

# Regional breakdown of young high-earners
young_high_earners = customers[
    (customers['age'] < 35) &
    (customers['income'] > 80000)
]

regional_breakdown = young_high_earners.groupby('region').size()
print("\nYoung high-earners by region:")
print(regional_breakdown)

Example 2: Sales Performance Analysis

# Sales transactions
transactions = pd.DataFrame({
    'transaction_id': range(1, 1001),
    'date': pd.date_range('2024-01-01', periods=1000, freq='H'),
    'product': np.random.choice(['A', 'B', 'C', 'D'], 1000),
    'amount': np.random.uniform(10, 500, 1000),
    'customer_type': np.random.choice(['New', 'Returning', 'VIP'], 1000, p=[0.3, 0.5, 0.2]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 1000)
})

# Large transactions
large_transactions = transactions[transactions['amount'] > 400]
print(f"Large transactions (>$400): {len(large_transactions)}")

# VIP customer analysis
vip_sales = transactions[transactions['customer_type'] == 'VIP']
vip_stats = vip_sales['amount'].agg(['sum', 'mean', 'count'])
print("\nVIP Customer Statistics:")
print(vip_stats)

# Product performance by region
product_a_north = transactions[
    (transactions['product'] == 'A') &
    (transactions['region'] == 'North')
]

print(f"\nProduct A sales in North: {len(product_a_north)} transactions")
print(f"Total revenue: ${product_a_north['amount'].sum():,.2f}")

# Find outliers (transactions > 3 standard deviations)
mean_amount = transactions['amount'].mean()
std_amount = transactions['amount'].std()
outliers = transactions[
    (transactions['amount'] > mean_amount + 3*std_amount) |
    (transactions['amount'] < mean_amount - 3*std_amount)
]

print(f"\nOutlier transactions: {len(outliers)}")

Example 3: Time-Based Filtering

# Event data with timestamps
events = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=1000, freq='H'),
    'event_type': np.random.choice(['Click', 'View', 'Purchase'], 1000, p=[0.6, 0.3, 0.1]),
    'user_id': np.random.randint(1, 101, 1000),
    'value': np.random.uniform(0, 100, 1000)
})

# Filter by date range
january = events[
    (events['timestamp'] >= '2024-01-01') &
    (events['timestamp'] < '2024-02-01')
]
print(f"January events: {len(january)}")

# Business hours only (9 AM - 5 PM)
events['hour'] = events['timestamp'].dt.hour
business_hours = events[
    (events['hour'] >= 9) &
    (events['hour'] < 17)
]
print(f"\nBusiness hours events: {len(business_hours)}")

# Weekend activity
events['is_weekend'] = events['timestamp'].dt.dayofweek >= 5
weekend_purchases = events[
    (events['is_weekend'] == True) &
    (events['event_type'] == 'Purchase')
]
print(f"\nWeekend purchases: {len(weekend_purchases)}")

# Recent high-value purchases
recent_high_value = events[
    (events['timestamp'] >= events['timestamp'].max() - pd.Timedelta(days=7)) &
    (events['event_type'] == 'Purchase') &
    (events['value'] > 75)
]
print(f"\nRecent high-value purchases: {len(recent_high_value)}")

Performance Tips

Efficient Filtering

# Use boolean indexing instead of iterating
# SLOW: Iterate through rows
# filtered = []
# for idx, row in df.iterrows():
#     if row['age'] > 30:
#         filtered.append(row)

# FAST: Boolean indexing
filtered = df[df['age'] > 30]

# Use query() for complex conditions (can be faster)
result = df.query('age > 30 and salary > 60000')

# Chain multiple filters
result = (df[df['age'] > 30]
          [df['salary'] > 60000]
          [df['department'] == 'Engineering'])

Memory-Efficient Selection

# Select only needed columns early
# INEFFICIENT: Load all columns then filter
# full_df = pd.read_csv('large_file.csv')
# subset = full_df[['name', 'age']]

# EFFICIENT: Load only needed columns
# subset = pd.read_csv('large_file.csv', usecols=['name', 'age'])

# Use categorical for repeated string values
df['department'] = df['department'].astype('category')

Stata Commands Comparison

For Stata users, here's how pandas selection relates to Stata:

# Stata: keep if age > 30
df[df['age'] > 30]

# Stata: keep if age > 30 & salary > 60000
df[(df['age'] > 30) & (df['salary'] > 60000)]

# Stata: keep name age salary
df[['name', 'age', 'salary']]

# Stata: drop if missing(salary)
df[df['salary'].notnull()]

# Stata: keep if inlist(city, "New York", "Paris")
df[df['city'].isin(['New York', 'Paris'])]

# Stata: keep in 1/10
df.iloc[0:10]

# Stata: gsort -salary
df.sort_values('salary', ascending=False)

Practice Exercises

Exercise 1: Employee Analysis

Given employee data:

  • Find all senior employees (age > 40) with salaries above $80k
  • List employees in Sales or Marketing departments
  • Identify employees whose names start with vowels

Exercise 2: Sales Filtering

With transaction data:

  • Find all purchases above $1000 made on weekends
  • Identify returning customers with total purchases > $5000
  • Filter transactions from Q1 2024 in the Western region

Exercise 3: Data Quality

Given a dataset with missing values:

  • Find rows with any missing critical fields
  • Identify duplicate records based on customer ID
  • Filter complete records for analysis

Sample Solutions

Exercise 1:

# Senior high-earners
senior_high = df[(df['age'] > 40) & (df['salary'] > 80000)]

# Sales or Marketing
sales_marketing = df[df['department'].isin(['Sales', 'Marketing'])]

# Names starting with vowels
vowel_names = df[df['name'].str[0].isin(['A', 'E', 'I', 'O', 'U'])]

print(f"Senior high-earners: {len(senior_high)}")
print(f"Sales/Marketing: {len(sales_marketing)}")
print(f"Vowel names: {len(vowel_names)}")

Key Takeaways

  • Use [] brackets for basic column and row selection
  • Use iloc[] for integer position-based selection
  • Use loc[] for label-based selection
  • Use boolean indexing for conditional filtering
  • Use &, |, ~ for combining conditions (with parentheses)
  • Use query() for readable complex filters
  • Use isin() for filtering multiple values
  • Use .str accessor for string operations
  • Select data types with select_dtypes()
  • Sample data with sample()

What's Next?

Now that you can select and filter data effectively:

Resources


Effective data selection is the foundation of data analysis. Master these techniques to work efficiently with any dataset.

More places to find me
Mental Health
follow me on Mastodon