How to reduce pandas DataFrame memory usage

Reducing pandas DataFrame memory usage means measuring the loaded data, changing only safe dtypes, and proving the smaller frame still holds the values the analysis needs. This matters when a frame fits only barely in local RAM or when notebooks and batch jobs create large intermediate copies.

The most direct path is to inspect memory_usage(deep=True), convert repeated labels to category, and downcast numeric columns with to_numeric(). deep=True matters for text-heavy columns because the default memory report can undercount Python-backed values.

Dtype changes are data contracts, not cosmetic cleanup. Convert columns after checking their value range and label cardinality, keep a reload path until checks pass, and compare row counts plus selected values before replacing production code.

Steps to reduce pandas DataFrame memory usage:

Save a memory reduction check script.

memory_reduce_check.py

import pandas as pd
 
 
rows = 10000
df = pd.DataFrame(
    {
        "order_id": range(100000, 100000 + rows),
        "region": ["EMEA", "APAC", "AMER", "EMEA"] * (rows // 4),
        "priority": ["low", "normal", "urgent", "normal"] * (rows // 4),
        "quantity": [1, 2, 3, 4] * (rows // 4),
        "revenue": [149.95, 89.50, 212.25, 65.00] * (rows // 4),
    }
)
 
before = df.memory_usage(deep=True).sum()
 
optimized = df.copy()
optimized["region"] = optimized["region"].astype("category")
optimized["priority"] = optimized["priority"].astype("category")
optimized["order_id"] = pd.to_numeric(optimized["order_id"], downcast="unsigned")
optimized["quantity"] = pd.to_numeric(optimized["quantity"], downcast="unsigned")
optimized["revenue"] = pd.to_numeric(optimized["revenue"], downcast="float")
 
after = optimized.memory_usage(deep=True).sum()
percent = (1 - after / before) * 100
 
print(f"pandas {pd.__version__}")
print()
print("source memory bytes")
print(before)
print()
print("optimized memory bytes")
print(after)
print()
print("memory reduction")
print(f"{percent:.1f}%")
print()
print("optimized dtypes")
print(optimized.dtypes)
print()
print("row count preserved")
print(len(df) == len(optimized))
print()
print("key rows preserved")
print(
    optimized.iloc[:3]
    .filter(["order_id", "region", "priority", "quantity"])
    .to_string(index=False)
)
print()
print("total revenue difference")
print(f"{abs(df['revenue'].sum() - optimized['revenue'].sum()):.6f}")

Replace the in-file df assignment with the frame already loaded in the working script. Keep the original frame or source file available until the memory, dtype, and value checks match the expected data rules.

Run the check script.

$ python3 memory_reduce_check.py
pandas 3.0.3

source memory bytes
492632

optimized memory bytes
110209

memory reduction
77.6%

optimized dtypes
order_id      uint32
region      category
priority    category
quantity       uint8
revenue      float32
dtype: object

row count preserved
True

key rows preserved
 order_id region priority  quantity
   100000   EMEA      low         1
   100001   APAC   normal         2
   100002   AMER   urgent         3

total revenue difference
0.000000

The smaller total should come with expected dtypes and unchanged business values, not only a lower byte count.

Record baseline memory and key fields before changing the working frame.

baseline_memory = df.memory_usage(deep=True).sum()
baseline_rows = len(df)
baseline_keys = df.filter(["order_id", "region", "priority", "quantity"]).copy()
 
print(df.memory_usage(deep=True))
print(df.dtypes)

memory_usage(deep=True) returns bytes per column and includes deeper accounting for text-like values.

Convert repeated label columns to category.
```
df["region"] = df["region"].astype("category")
df["priority"] = df["priority"].astype("category")
```
category saves memory when labels repeat. It can use the same or more memory when most values are unique, so compare memory after conversion before keeping it.
Downcast nonnegative whole-number columns.
```
df["order_id"] = pd.to_numeric(df["order_id"], downcast="unsigned")
df["quantity"] = pd.to_numeric(df["quantity"], downcast="unsigned")
```
Use downcast=“integer” or downcast=“signed” for columns that can contain negative values.
Related: How to convert data types in pandas
Downcast floating-point columns only when lower precision is acceptable.
```
df["revenue"] = pd.to_numeric(df["revenue"], downcast="float")
```
float32 can change precision for large or highly precise values. Compare totals, thresholds, and downstream calculations before replacing float64 columns.

Recheck memory and dtypes after conversion.

optimized_memory = df.memory_usage(deep=True).sum()
reduction = (1 - optimized_memory / baseline_memory) * 100
 
print(df.dtypes)
print(f"memory reduction: {reduction:.1f}%")

The dtype output should show the intended category, unsigned integer, or float32 columns before the optimized frame is reused.

Verify that required rows and values survived the dtype changes.

assert len(df) == baseline_rows
 
pd.testing.assert_frame_equal(
    baseline_keys.astype({"order_id": "uint64", "quantity": "uint64"}),
    df.filter(["order_id", "region", "priority", "quantity"]).astype(
        {"order_id": "uint64", "quantity": "uint64"}
    ),
    check_dtype=False,
)
 
print(df.iloc[:3].filter(["order_id", "region", "priority", "quantity"]))

Compare identifiers, labels, counts, totals, and any columns used for joins or filters before saving the optimized frame.

Remove the temporary check script after the project code includes the same safeguards.
```
$ rm memory_reduce_check.py
```

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.