Reducing pandas DataFrame memory usage means measuring the loaded data, changing only safe dtypes, and proving the smaller frame still holds the values the analysis needs. This matters when a frame fits only barely in local RAM or when notebooks and batch jobs create large intermediate copies.
The most direct path is to inspect memory_usage(deep=True), convert repeated labels to category, and downcast numeric columns with to_numeric(). deep=True matters for text-heavy columns because the default memory report can undercount Python-backed values.
Dtype changes are data contracts, not cosmetic cleanup. Convert columns after checking their value range and label cardinality, keep a reload path until checks pass, and compare row counts plus selected values before replacing production code.
Steps to reduce pandas DataFrame memory usage:
- Save a memory reduction check script.
- memory_reduce_check.py
import pandas as pd rows = 10000 df = pd.DataFrame( { "order_id": range(100000, 100000 + rows), "region": ["EMEA", "APAC", "AMER", "EMEA"] * (rows // 4), "priority": ["low", "normal", "urgent", "normal"] * (rows // 4), "quantity": [1, 2, 3, 4] * (rows // 4), "revenue": [149.95, 89.50, 212.25, 65.00] * (rows // 4), } ) before = df.memory_usage(deep=True).sum() optimized = df.copy() optimized["region"] = optimized["region"].astype("category") optimized["priority"] = optimized["priority"].astype("category") optimized["order_id"] = pd.to_numeric(optimized["order_id"], downcast="unsigned") optimized["quantity"] = pd.to_numeric(optimized["quantity"], downcast="unsigned") optimized["revenue"] = pd.to_numeric(optimized["revenue"], downcast="float") after = optimized.memory_usage(deep=True).sum() percent = (1 - after / before) * 100 print(f"pandas {pd.__version__}") print() print("source memory bytes") print(before) print() print("optimized memory bytes") print(after) print() print("memory reduction") print(f"{percent:.1f}%") print() print("optimized dtypes") print(optimized.dtypes) print() print("row count preserved") print(len(df) == len(optimized)) print() print("key rows preserved") print( optimized.iloc[:3] .filter(["order_id", "region", "priority", "quantity"]) .to_string(index=False) ) print() print("total revenue difference") print(f"{abs(df['revenue'].sum() - optimized['revenue'].sum()):.6f}")
Replace the in-file df assignment with the frame already loaded in the working script. Keep the original frame or source file available until the memory, dtype, and value checks match the expected data rules.
- Run the check script.
$ python3 memory_reduce_check.py pandas 3.0.3 source memory bytes 492632 optimized memory bytes 110209 memory reduction 77.6% optimized dtypes order_id uint32 region category priority category quantity uint8 revenue float32 dtype: object row count preserved True key rows preserved order_id region priority quantity 100000 EMEA low 1 100001 APAC normal 2 100002 AMER urgent 3 total revenue difference 0.000000
The smaller total should come with expected dtypes and unchanged business values, not only a lower byte count.
- Record baseline memory and key fields before changing the working frame.
baseline_memory = df.memory_usage(deep=True).sum() baseline_rows = len(df) baseline_keys = df.filter(["order_id", "region", "priority", "quantity"]).copy() print(df.memory_usage(deep=True)) print(df.dtypes)
memory_usage(deep=True) returns bytes per column and includes deeper accounting for text-like values.
- Convert repeated label columns to category.
df["region"] = df["region"].astype("category") df["priority"] = df["priority"].astype("category")
category saves memory when labels repeat. It can use the same or more memory when most values are unique, so compare memory after conversion before keeping it.
- Downcast nonnegative whole-number columns.
df["order_id"] = pd.to_numeric(df["order_id"], downcast="unsigned") df["quantity"] = pd.to_numeric(df["quantity"], downcast="unsigned")
Use downcast=“integer” or downcast=“signed” for columns that can contain negative values.
Related: How to convert data types in pandas - Downcast floating-point columns only when lower precision is acceptable.
df["revenue"] = pd.to_numeric(df["revenue"], downcast="float")
float32 can change precision for large or highly precise values. Compare totals, thresholds, and downstream calculations before replacing float64 columns.
- Recheck memory and dtypes after conversion.
optimized_memory = df.memory_usage(deep=True).sum() reduction = (1 - optimized_memory / baseline_memory) * 100 print(df.dtypes) print(f"memory reduction: {reduction:.1f}%")
The dtype output should show the intended category, unsigned integer, or float32 columns before the optimized frame is reused.
- Verify that required rows and values survived the dtype changes.
assert len(df) == baseline_rows pd.testing.assert_frame_equal( baseline_keys.astype({"order_id": "uint64", "quantity": "uint64"}), df.filter(["order_id", "region", "priority", "quantity"]).astype( {"order_id": "uint64", "quantity": "uint64"} ), check_dtype=False, ) print(df.iloc[:3].filter(["order_id", "region", "priority", "quantity"]))
Compare identifiers, labels, counts, totals, and any columns used for joins or filters before saving the optimized frame.
- Remove the temporary check script after the project code includes the same safeguards.
$ rm memory_reduce_check.py
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.