Reducing pandas DataFrame memory usage means measuring the loaded data, changing only safe dtypes, and proving the smaller frame still holds the values the analysis needs. This matters when a frame fits only barely in local RAM or when notebooks and batch jobs create large intermediate copies.
The most direct path is to inspect memory_usage(deep=True), convert repeated labels to category, and downcast numeric columns with to_numeric(). deep=True matters for text-heavy columns because the default memory report can undercount Python-backed values.
Dtype changes are data contracts, not cosmetic cleanup. Convert columns after checking their value range and label cardinality, keep a reload path until checks pass, and compare row counts plus selected values before replacing production code.
import pandas as pd rows = 10000 df = pd.DataFrame( { "order_id": range(100000, 100000 + rows), "region": ["EMEA", "APAC", "AMER", "EMEA"] * (rows // 4), "priority": ["low", "normal", "urgent", "normal"] * (rows // 4), "quantity": [1, 2, 3, 4] * (rows // 4), "revenue": [149.95, 89.50, 212.25, 65.00] * (rows // 4), } ) before = df.memory_usage(deep=True).sum() optimized = df.copy() optimized["region"] = optimized["region"].astype("category") optimized["priority"] = optimized["priority"].astype("category") optimized["order_id"] = pd.to_numeric(optimized["order_id"], downcast="unsigned") optimized["quantity"] = pd.to_numeric(optimized["quantity"], downcast="unsigned") optimized["revenue"] = pd.to_numeric(optimized["revenue"], downcast="float") after = optimized.memory_usage(deep=True).sum() percent = (1 - after / before) * 100 print(f"pandas {pd.__version__}") print() print("source memory bytes") print(before) print() print("optimized memory bytes") print(after) print() print("memory reduction") print(f"{percent:.1f}%") print() print("optimized dtypes") print(optimized.dtypes) print() print("row count preserved") print(len(df) == len(optimized)) print() print("key rows preserved") print( optimized.iloc[:3] .filter(["order_id", "region", "priority", "quantity"]) .to_string(index=False) ) print() print("total revenue difference") print(f"{abs(df['revenue'].sum() - optimized['revenue'].sum()):.6f}")
Replace the in-file df assignment with the frame already loaded in the working script. Keep the original frame or source file available until the memory, dtype, and value checks match the expected data rules.
$ python3 memory_reduce_check.py pandas 3.0.3 source memory bytes 492632 optimized memory bytes 110209 memory reduction 77.6% optimized dtypes order_id uint32 region category priority category quantity uint8 revenue float32 dtype: object row count preserved True key rows preserved order_id region priority quantity 100000 EMEA low 1 100001 APAC normal 2 100002 AMER urgent 3 total revenue difference 0.000000
The smaller total should come with expected dtypes and unchanged business values, not only a lower byte count.
baseline_memory = df.memory_usage(deep=True).sum() baseline_rows = len(df) baseline_keys = df.filter(["order_id", "region", "priority", "quantity"]).copy() print(df.memory_usage(deep=True)) print(df.dtypes)
memory_usage(deep=True) returns bytes per column and includes deeper accounting for text-like values.
df["region"] = df["region"].astype("category") df["priority"] = df["priority"].astype("category")
category saves memory when labels repeat. It can use the same or more memory when most values are unique, so compare memory after conversion before keeping it.
df["order_id"] = pd.to_numeric(df["order_id"], downcast="unsigned") df["quantity"] = pd.to_numeric(df["quantity"], downcast="unsigned")
Use downcast=“integer” or downcast=“signed” for columns that can contain negative values.
Related: How to convert data types in pandas
df["revenue"] = pd.to_numeric(df["revenue"], downcast="float")
float32 can change precision for large or highly precise values. Compare totals, thresholds, and downstream calculations before replacing float64 columns.
optimized_memory = df.memory_usage(deep=True).sum() reduction = (1 - optimized_memory / baseline_memory) * 100 print(df.dtypes) print(f"memory reduction: {reduction:.1f}%")
The dtype output should show the intended category, unsigned integer, or float32 columns before the optimized frame is reused.
assert len(df) == baseline_rows pd.testing.assert_frame_equal( baseline_keys.astype({"order_id": "uint64", "quantity": "uint64"}), df.filter(["order_id", "region", "priority", "quantity"]).astype( {"order_id": "uint64", "quantity": "uint64"} ), check_dtype=False, ) print(df.iloc[:3].filter(["order_id", "region", "priority", "quantity"]))
Compare identifiers, labels, counts, totals, and any columns used for joins or filters before saving the optimized frame.
$ rm memory_reduce_check.py