How to read a large CSV in chunks with pandas

Large CSV files can exceed available memory when pandas loads every row into one DataFrame. Reading the file in chunks keeps only one manageable batch in memory while still allowing row counts, aggregations, filters, or exports to run across the whole file.

The chunksize argument on pandas.read_csv() returns a TextFileReader that can be used as a context manager and iterated one DataFrame at a time. Each chunk should be processed inside the loop, with only small summaries or output files kept after that chunk is finished.

Chunked reads fit streaming reductions, filtered exports, and row validation where the final result is smaller than the source file. Work that needs a full-table sort, global deduplication, or arbitrary joins may still need a database, Parquet workflow, or another out-of-core data tool.

Steps to read a large CSV in pandas chunks:

Save a representative CSV file.

orders.csv

order_id,region,amount
1001,North,120
1002,South,95
1003,North,80
1004,West,130
1005,South,110
1006,North,75
1007,West,60

Create a chunk reader script.

read_orders_in_chunks.py

import pandas as pd
 
row_count = 0
amount_total = 0
region_parts = []
 
with pd.read_csv(
    "orders.csv",
    usecols=["region", "amount"],
    dtype={"region": "string", "amount": "int64"},
    chunksize=3,
) as reader:
    for chunk_number, chunk in enumerate(reader, start=1):
        chunk_amount = chunk["amount"].sum()
        row_count += len(chunk)
        amount_total += chunk_amount
        region_parts.append(chunk.groupby("region")["amount"].sum())
        print(f"chunk {chunk_number}: rows={len(chunk)} amount={chunk_amount}")
 
region_totals = pd.concat(region_parts).groupby(level=0).sum().sort_index()
 
print(f"rows processed={row_count}")
print(f"amount total={amount_total}")
print("region totals:")
print(region_totals.to_string())

usecols limits the columns loaded into each chunk, and dtype prevents expensive type guesses for known columns.

Run the chunk reader.

$ python read_orders_in_chunks.py
chunk 1: rows=3 amount=295
chunk 2: rows=3 amount=315
chunk 3: rows=1 amount=60
rows processed=7
amount total=670
region totals:
region
North    275
South    205
West     190

Check the chunk lines before trusting the final totals.

The three chunk lines show that pandas iterated the file in batches. The final rows processed=7 line should match the number of data rows expected from the source file.
Set a production chunk size for the real CSV.

Use a larger chunksize such as 50000 or 100000 only after the per-chunk code stays within memory. Avoid appending every chunk to a list of DataFrame objects unless the combined result is intentionally small.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.