Skip to main content

Pandas Basics

Pandas primarily solves "labeled tabular data processing" problems. NumPy is more like an array toolkit, while Pandas is more like a table and time series toolkit.

Two Core Objects

Series

import pandas as pd

s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print(s)

DataFrame

df = pd.DataFrame(
{
"name": ["alice", "bob", "charlie"],
"score": [95, 88, 91],
"city": ["Chengdu", "Beijing", "Shanghai"],
}
)

Most day-to-day processing revolves around DataFrame.

Reading and Saving

df = pd.read_csv("users.csv")
df.to_csv("out.csv", index=False)

If the environment has the relevant dependencies installed, you can also read and write Excel, Parquet, and other formats.

Understand the table structure first

print(df.head())
print(df.info())
print(df.shape)
print(df.columns)

This is almost always the first thing I do when I receive a new dataset.

Selecting columns, rows, and filtering

print(df["name"])
print(df[["name", "score"]])
print(df.loc[0, "name"])
print(df.iloc[0, 1])
print(df[df["score"] >= 90])
  • loc: Select by label
  • iloc: Select by position

Handling Missing Values

print(df.isna().sum())

cleaned = df.fillna({"score": 0, "city": "unknown"})

Common operations:

  • isna() / notna()
  • dropna()
  • fillna()

New Columns and Transformations

df["passed"] = df["score"] >= 60
df["score_level"] = df["score"].map(lambda x: "A" if x >= 90 else "B")

map(), apply(), and replace() often appear together, but I recommend understanding them this way:

  • Simple mapping: prefer map() or replace()
  • Processing by column or row: apply()
  • Element-wise processing on the entire table: applymap(), but avoid if possible

Group Aggregation

result = df.groupby("city")["score"].mean()
print(result)

This is one of Pandas' most valuable features: making "group by a column and then compute statistics" very smooth.

Sorting and Deduplication

print(df.sort_values("score", ascending=False))
print(df.drop_duplicates(subset=["name"]))

My own processing order

When processing a table, I typically follow this order:

  1. head() / info() / shape() to understand the structure
  2. Handle missing values and types
  3. Do filtering, mapping, and new columns
  4. Then do grouping, aggregation, and export