Pandas Basics

Pandas primarily solves "labeled tabular data processing" problems. NumPy is more like an array toolkit, while Pandas is more like a table and time series toolkit.

Two Core Objects

Series

import pandas as pd

s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print(s)

DataFrame

df = pd.DataFrame(
    {
        "name": ["alice", "bob", "charlie"],
        "score": [95, 88, 91],
        "city": ["Chengdu", "Beijing", "Shanghai"],
    }
)

Most day-to-day processing revolves around DataFrame.

Reading and Saving

df = pd.read_csv("users.csv")
df.to_csv("out.csv", index=False)

If the environment has the relevant dependencies installed, you can also read and write Excel, Parquet, and other formats.

Understand the table structure first

print(df.head())
print(df.info())
print(df.shape)
print(df.columns)

This is almost always the first thing I do when I receive a new dataset.

Selecting columns, rows, and filtering

print(df["name"])
print(df[["name", "score"]])
print(df.loc[0, "name"])
print(df.iloc[0, 1])
print(df[df["score"] >= 90])

loc: Select by label
iloc: Select by position

Handling Missing Values

print(df.isna().sum())

cleaned = df.fillna({"score": 0, "city": "unknown"})

Common operations:

isna() / notna()
dropna()
fillna()

New Columns and Transformations

df["passed"] = df["score"] >= 60
df["score_level"] = df["score"].map(lambda x: "A" if x >= 90 else "B")

map(), apply(), and replace() often appear together, but I recommend understanding them this way:

Simple mapping: prefer map() or replace()
Processing by column or row: apply()
Element-wise processing on the entire table: applymap(), but avoid if possible

Group Aggregation

result = df.groupby("city")["score"].mean()
print(result)

This is one of Pandas' most valuable features: making "group by a column and then compute statistics" very smooth.

Sorting and Deduplication

print(df.sort_values("score", ascending=False))
print(df.drop_duplicates(subset=["name"]))

My own processing order

When processing a table, I typically follow this order:

head() / info() / shape() to understand the structure
Handle missing values and types
Do filtering, mapping, and new columns
Then do grouping, aggregation, and export

Two Core Objects​

Series​

DataFrame​

Reading and Saving​

Understand the table structure first​

Selecting columns, rows, and filtering​

Handling Missing Values​

New Columns and Transformations​

Group Aggregation​

Sorting and Deduplication​

My own processing order​

Related Reading​