How to read a large CSV in chunks with pandas

Large CSV files can exceed available memory when pandas loads every row into one DataFrame. Reading the file in chunks keeps only one manageable batch in memory while still allowing row counts, aggregations, filters, or exports to run across the whole file.

The chunksize argument on pandas.read_csv() returns a TextFileReader that can be used as a context manager and iterated one DataFrame at a time. Each chunk should be processed inside the loop, with only small summaries or output files kept after that chunk is finished.

Chunked reads fit streaming reductions, filtered exports, and row validation where the final result is smaller than the source file. Work that needs a full-table sort, global deduplication, or arbitrary joins may still need a database, Parquet workflow, or another out-of-core data tool.

Steps to read a large CSV in pandas chunks:

  1. Save a representative CSV file.
    orders.csv
    order_id,region,amount
    1001,North,120
    1002,South,95
    1003,North,80
    1004,West,130
    1005,South,110
    1006,North,75
    1007,West,60
  2. Create a chunk reader script.
    read_orders_in_chunks.py
    import pandas as pd
     
    row_count = 0
    amount_total = 0
    region_parts = []
     
    with pd.read_csv(
        "orders.csv",
        usecols=["region", "amount"],
        dtype={"region": "string", "amount": "int64"},
        chunksize=3,
    ) as reader:
        for chunk_number, chunk in enumerate(reader, start=1):
            chunk_amount = chunk["amount"].sum()
            row_count += len(chunk)
            amount_total += chunk_amount
            region_parts.append(chunk.groupby("region")["amount"].sum())
            print(f"chunk {chunk_number}: rows={len(chunk)} amount={chunk_amount}")
     
    region_totals = pd.concat(region_parts).groupby(level=0).sum().sort_index()
     
    print(f"rows processed={row_count}")
    print(f"amount total={amount_total}")
    print("region totals:")
    print(region_totals.to_string())

    usecols limits the columns loaded into each chunk, and dtype prevents expensive type guesses for known columns.

  3. Run the chunk reader.
    $ python read_orders_in_chunks.py
    chunk 1: rows=3 amount=295
    chunk 2: rows=3 amount=315
    chunk 3: rows=1 amount=60
    rows processed=7
    amount total=670
    region totals:
    region
    North    275
    South    205
    West     190
  4. Check the chunk lines before trusting the final totals.

    The three chunk lines show that pandas iterated the file in batches. The final rows processed=7 line should match the number of data rows expected from the source file.

  5. Set a production chunk size for the real CSV.

    Use a larger chunksize such as 50000 or 100000 only after the per-chunk code stays within memory. Avoid appending every chunk to a list of DataFrame objects unless the combined result is intentionally small.