Large CSV files can exceed available memory when pandas loads every row into one DataFrame. Reading the file in chunks keeps only one manageable batch in memory while still allowing row counts, aggregations, filters, or exports to run across the whole file.
The chunksize argument on pandas.read_csv() returns a TextFileReader that can be used as a context manager and iterated one DataFrame at a time. Each chunk should be processed inside the loop, with only small summaries or output files kept after that chunk is finished.
Chunked reads fit streaming reductions, filtered exports, and row validation where the final result is smaller than the source file. Work that needs a full-table sort, global deduplication, or arbitrary joins may still need a database, Parquet workflow, or another out-of-core data tool.
Related: How to read CSV files with pandas
Related: How to reduce pandas DataFrame memory usage
Steps to read a large CSV in pandas chunks:
- Save a representative CSV file.
- orders.csv
order_id,region,amount 1001,North,120 1002,South,95 1003,North,80 1004,West,130 1005,South,110 1006,North,75 1007,West,60
- Create a chunk reader script.
- read_orders_in_chunks.py
import pandas as pd row_count = 0 amount_total = 0 region_parts = [] with pd.read_csv( "orders.csv", usecols=["region", "amount"], dtype={"region": "string", "amount": "int64"}, chunksize=3, ) as reader: for chunk_number, chunk in enumerate(reader, start=1): chunk_amount = chunk["amount"].sum() row_count += len(chunk) amount_total += chunk_amount region_parts.append(chunk.groupby("region")["amount"].sum()) print(f"chunk {chunk_number}: rows={len(chunk)} amount={chunk_amount}") region_totals = pd.concat(region_parts).groupby(level=0).sum().sort_index() print(f"rows processed={row_count}") print(f"amount total={amount_total}") print("region totals:") print(region_totals.to_string())
usecols limits the columns loaded into each chunk, and dtype prevents expensive type guesses for known columns.
- Run the chunk reader.
$ python read_orders_in_chunks.py chunk 1: rows=3 amount=295 chunk 2: rows=3 amount=315 chunk 3: rows=1 amount=60 rows processed=7 amount total=670 region totals: region North 275 South 205 West 190
- Check the chunk lines before trusting the final totals.
The three chunk lines show that pandas iterated the file in batches. The final rows processed=7 line should match the number of data rows expected from the source file.
- Set a production chunk size for the real CSV.
Use a larger chunksize such as 50000 or 100000 only after the per-chunk code stays within memory. Avoid appending every chunk to a list of DataFrame objects unless the combined result is intentionally small.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.