How to read and write Parquet files with pandas

Parquet files store tabular data in a columnar format that pandas can write and read without flattening every value into CSV text. Parquet keeps typed columns and supports selective column reads when DataFrame output moves to the next Python or analytics job.

The pyarrow engine keeps the read and write path explicit, while compression=“snappy” matches the default pandas compression choice for Parquet writes. Setting index=False keeps the written file focused on data columns instead of adding an index field for non-pandas consumers.

A round-trip read should show the same rows, expected dtypes, and selected columns when columns limits the read. Keep the same checks when replacing the small DataFrame with a production export, especially when indexes, categorical columns, or object-heavy columns could change the file schema.

Steps to read and write Parquet files with pandas:

Install the Parquet engine package if the active Python environment does not already have one.
```
$ python3 -m pip install pyarrow
```
pandas requires pyarrow or fastparquet for Parquet files. Using pyarrow keeps the engine behavior explicit.
Related: How to install pandas with pip

Create a Parquet round-trip script.

parquet_roundtrip.py

from pathlib import Path
 
import pandas as pd
 
 
path = Path("orders.parquet")
 
orders = pd.DataFrame(
    {
        "order_id": ["A100", "A101", "A102"],
        "customer": ["Ada", "Lin", "Maya"],
        "region": ["EMEA", "APAC", "AMER"],
        "total_usd": [149.50, 88.00, 212.25],
    }
)
 
orders.to_parquet(
    path,
    engine="pyarrow",
    compression="snappy",
    index=False,
)
 
round_trip = pd.read_parquet(path, engine="pyarrow")
selected = pd.read_parquet(
    path,
    engine="pyarrow",
    columns=["order_id", "total_usd"],
)
 
print(round_trip.to_string(index=False))
print()
print(round_trip.dtypes)
print()
print(f"rows match: {len(round_trip) == len(orders)}")
print(f"columns: {', '.join(round_trip.columns)}")
print(f"selected columns: {', '.join(selected.columns)}")
print(f"order IDs match: {round_trip['order_id'].tolist() == orders['order_id'].tolist()}")

index=False omits the DataFrame index from the Parquet file. Leave it out or set index=True only when the index carries business data that another reader needs.

Run the script to write the Parquet file and read it back.

$ python3 parquet_roundtrip.py
order_id customer region  total_usd
    A100      Ada   EMEA     149.50
    A101      Lin   APAC      88.00
    A102     Maya   AMER     212.25

order_id         str
customer         str
region           str
total_usd    float64
dtype: object

rows match: True
columns: order_id, customer, region, total_usd
selected columns: order_id, total_usd
order IDs match: True

The row check, selected-column check, and matching order IDs confirm that to_parquet() wrote the file and read_parquet() loaded the expected data.

Remove the temporary files after the round-trip behavior is confirmed.
```
$ rm orders.parquet parquet_roundtrip.py
```

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.