Large CSV files can exceed available memory when pandas loads every row into one DataFrame. Reading the file in chunks keeps only one manageable batch in memory while still allowing row counts, aggregations, filters, or exports to run across the whole file.
The chunksize argument on pandas.read_csv() returns a TextFileReader that can be used as a context manager and iterated one DataFrame at a time. Each chunk should be processed inside the loop, with only small summaries or output files kept after that chunk is finished.
Chunked reads fit streaming reductions, filtered exports, and row validation where the final result is smaller than the source file. Work that needs a full-table sort, global deduplication, or arbitrary joins may still need a database, Parquet workflow, or another out-of-core data tool.
Related: How to read CSV files with pandas
Related: How to reduce pandas DataFrame memory usage
order_id,region,amount 1001,North,120 1002,South,95 1003,North,80 1004,West,130 1005,South,110 1006,North,75 1007,West,60
import pandas as pd row_count = 0 amount_total = 0 region_parts = [] with pd.read_csv( "orders.csv", usecols=["region", "amount"], dtype={"region": "string", "amount": "int64"}, chunksize=3, ) as reader: for chunk_number, chunk in enumerate(reader, start=1): chunk_amount = chunk["amount"].sum() row_count += len(chunk) amount_total += chunk_amount region_parts.append(chunk.groupby("region")["amount"].sum()) print(f"chunk {chunk_number}: rows={len(chunk)} amount={chunk_amount}") region_totals = pd.concat(region_parts).groupby(level=0).sum().sort_index() print(f"rows processed={row_count}") print(f"amount total={amount_total}") print("region totals:") print(region_totals.to_string())
usecols limits the columns loaded into each chunk, and dtype prevents expensive type guesses for known columns.
$ python read_orders_in_chunks.py chunk 1: rows=3 amount=295 chunk 2: rows=3 amount=315 chunk 3: rows=1 amount=60 rows processed=7 amount total=670 region totals: region North 275 South 205 West 190
The three chunk lines show that pandas iterated the file in batches. The final rows processed=7 line should match the number of data rows expected from the source file.
Use a larger chunksize such as 50000 or 100000 only after the per-chunk code stays within memory. Avoid appending every chunk to a list of DataFrame objects unless the combined result is intentionally small.