Pandas Basics
Pandas primarily solves "labeled tabular data processing" problems. NumPy is more like an array toolkit, while Pandas is more like a table and time series toolkit.
Two Core Objects
Series
import pandas as pd
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print(s)
DataFrame
df = pd.DataFrame(
{
"name": ["alice", "bob", "charlie"],
"score": [95, 88, 91],
"city": ["Chengdu", "Beijing", "Shanghai"],
}
)
Most day-to-day processing revolves around DataFrame.
Reading and Saving
df = pd.read_csv("users.csv")
df.to_csv("out.csv", index=False)
If the environment has the relevant dependencies installed, you can also read and write Excel, Parquet, and other formats.
Understand the table structure first
print(df.head())
print(df.info())
print(df.shape)
print(df.columns)
This is almost always the first thing I do when I receive a new dataset.
Selecting columns, rows, and filtering
print(df["name"])
print(df[["name", "score"]])
print(df.loc[0, "name"])
print(df.iloc[0, 1])
print(df[df["score"] >= 90])
loc: Select by labeliloc: Select by position
Handling Missing Values
print(df.isna().sum())
cleaned = df.fillna({"score": 0, "city": "unknown"})
Common operations:
isna()/notna()dropna()fillna()
New Columns and Transformations
df["passed"] = df["score"] >= 60
df["score_level"] = df["score"].map(lambda x: "A" if x >= 90 else "B")
map(), apply(), and replace() often appear together, but I recommend understanding them this way:
- Simple mapping: prefer
map()orreplace() - Processing by column or row:
apply() - Element-wise processing on the entire table:
applymap(), but avoid if possible
Group Aggregation
result = df.groupby("city")["score"].mean()
print(result)
This is one of Pandas' most valuable features: making "group by a column and then compute statistics" very smooth.
Sorting and Deduplication
print(df.sort_values("score", ascending=False))
print(df.drop_duplicates(subset=["name"]))
My own processing order
When processing a table, I typically follow this order:
head()/info()/shape()to understand the structure- Handle missing values and types
- Do filtering, mapping, and new columns
- Then do grouping, aggregation, and export