Aggregating data with pandas groupby collapses detailed rows into totals, counts, and averages by one or more keys. Analysts use it when transaction, event, or measurement data needs a per-region, per-product, or per-period summary before reporting.
DataFrame.groupby() splits rows by the chosen key columns, and agg() applies one or more reductions to each group. Named aggregation gives each output column a clear name while pointing it at the source column and aggregation function.
Keep group keys as columns with as_index=False when the summary will be exported, joined, or printed like a normal table. Compare grouped totals with source totals before using the summary, especially when filters, missing group keys, or categorical columns may change which rows appear.
Steps to aggregate data with pandas groupby:
- Create a Python script named groupby_sales.py.
- groupby_sales.py
import pandas as pd orders = pd.DataFrame( { "region": [ "East", "East", "East", "West", "West", "West", "North", "North", ], "channel": [ "online", "store", "online", "online", "store", "store", "online", "partner", ], "order_id": [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008], "sales_usd": [1200, 750, 980, 1430, 1020, 650, 500, 780], "units": [12, 8, 10, 14, 9, 5, 4, 7], } ) group_columns = ["region", "channel"] summary = ( orders.groupby(group_columns, as_index=False, sort=False) .agg( order_count=("order_id", "count"), total_sales_usd=("sales_usd", "sum"), avg_order_usd=("sales_usd", "mean"), total_units=("units", "sum"), ) ) summary["avg_order_usd"] = summary["avg_order_usd"].round(2) source_total = orders["sales_usd"].sum() summary_total = summary["total_sales_usd"].sum() expected_groups = len(orders.loc[:, group_columns].drop_duplicates()) assert len(summary) == expected_groups assert source_total == summary_total print(f"pandas {pd.__version__}") print() print("GROUPED_SALES") print(summary.to_string(index=False)) print() print("VERIFY_TOTALS") print(f"source rows: {len(orders)}") print(f"group rows: {len(summary)}") print(f"source sales: {source_total}") print(f"summary sales: {summary_total}") print("verification: grouped totals match source rows")
group_columns holds the keys that define each output row. The summary uses one row for each unique region and channel pair.
- Run the script and confirm the grouped summary.
$ python3 groupby_sales.py pandas 3.0.3 GROUPED_SALES region channel order_count total_sales_usd avg_order_usd total_units East online 2 2180 1090.0 22 East store 1 750 750.0 8 West online 1 1430 1430.0 14 West store 2 1670 835.0 14 North online 1 500 500.0 4 North partner 1 780 780.0 7 VERIFY_TOTALS source rows: 8 group rows: 6 source sales: 7310 summary sales: 7310 verification: grouped totals match source rows
- Set the grouping keys with a column list.
group_columns = ["region", "channel"]
Use one column for a simple summary, or a list of columns when each output row needs a combined key such as region plus channel.
- Keep the group keys as normal output columns.
orders.groupby(group_columns, as_index=False, sort=False)
as_index=False returns SQL-style grouped output, which keeps region and channel in the result columns instead of moving them into the index.
- Name each aggregated output column with agg().
summary = ( orders.groupby(group_columns, as_index=False, sort=False) .agg( order_count=("order_id", "count"), total_sales_usd=("sales_usd", "sum"), avg_order_usd=("sales_usd", "mean"), total_units=("units", "sum"), ) )
The first value in each tuple is the source column. The second value is the aggregation function applied to that column.
- Preserve first-seen group order when report row order matters.
orders.groupby(group_columns, as_index=False, sort=False)
pandas sorts group keys by default. With sort=False, groups appear in the order each key first appears in the source DataFrame.
- Include missing group keys only when they should form their own group.
orders.groupby(group_columns, as_index=False, sort=False, dropna=False)
dropna=True is the default, so rows with missing group-key values are left out of grouped results unless dropna=False is set.
- Verify grouped totals against the source rows.
source_total = orders["sales_usd"].sum() summary_total = summary["total_sales_usd"].sum() expected_groups = len(orders.loc[:, group_columns].drop_duplicates()) assert len(summary) == expected_groups assert source_total == summary_total
Compare counts, totals, or other business-critical measures before exporting the grouped summary.
- Remove the sample script after adapting the grouping pattern.
$ rm groupby_sales.py
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.