Parquet files store tabular data in a columnar format that pandas can write and read without flattening every value into CSV text. Parquet keeps typed columns and supports selective column reads when DataFrame output moves to the next Python or analytics job.
The pyarrow engine keeps the read and write path explicit, while compression=“snappy” matches the default pandas compression choice for Parquet writes. Setting index=False keeps the written file focused on data columns instead of adding an index field for non-pandas consumers.
A round-trip read should show the same rows, expected dtypes, and selected columns when columns limits the read. Keep the same checks when replacing the small DataFrame with a production export, especially when indexes, categorical columns, or object-heavy columns could change the file schema.
Related: How to read CSV files with pandas
Related: How to write a CSV file with pandas
Related: How to read and write JSON with pandas
Steps to read and write Parquet files with pandas:
- Install the Parquet engine package if the active Python environment does not already have one.
$ python3 -m pip install pyarrow
pandas requires pyarrow or fastparquet for Parquet files. Using pyarrow keeps the engine behavior explicit.
Related: How to install pandas with pip - Create a Parquet round-trip script.
- parquet_roundtrip.py
from pathlib import Path import pandas as pd path = Path("orders.parquet") orders = pd.DataFrame( { "order_id": ["A100", "A101", "A102"], "customer": ["Ada", "Lin", "Maya"], "region": ["EMEA", "APAC", "AMER"], "total_usd": [149.50, 88.00, 212.25], } ) orders.to_parquet( path, engine="pyarrow", compression="snappy", index=False, ) round_trip = pd.read_parquet(path, engine="pyarrow") selected = pd.read_parquet( path, engine="pyarrow", columns=["order_id", "total_usd"], ) print(round_trip.to_string(index=False)) print() print(round_trip.dtypes) print() print(f"rows match: {len(round_trip) == len(orders)}") print(f"columns: {', '.join(round_trip.columns)}") print(f"selected columns: {', '.join(selected.columns)}") print(f"order IDs match: {round_trip['order_id'].tolist() == orders['order_id'].tolist()}")
index=False omits the DataFrame index from the Parquet file. Leave it out or set index=True only when the index carries business data that another reader needs.
- Run the script to write the Parquet file and read it back.
$ python3 parquet_roundtrip.py order_id customer region total_usd A100 Ada EMEA 149.50 A101 Lin APAC 88.00 A102 Maya AMER 212.25 order_id str customer str region str total_usd float64 dtype: object rows match: True columns: order_id, customer, region, total_usd selected columns: order_id, total_usd order IDs match: TrueThe row check, selected-column check, and matching order IDs confirm that to_parquet() wrote the file and read_parquet() loaded the expected data.
- Remove the temporary files after the round-trip behavior is confirmed.
$ rm orders.parquet parquet_roundtrip.py
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.