Aggregating data with pandas groupby collapses detailed rows into totals, counts, and averages by one or more keys. Analysts use it when transaction, event, or measurement data needs a per-region, per-product, or per-period summary before reporting.
DataFrame.groupby() splits rows by the chosen key columns, and agg() applies one or more reductions to each group. Named aggregation gives each output column a clear name while pointing it at the source column and aggregation function.
Keep group keys as columns with as_index=False when the summary will be exported, joined, or printed like a normal table. Compare grouped totals with source totals before using the summary, especially when filters, missing group keys, or categorical columns may change which rows appear.
import pandas as pd orders = pd.DataFrame( { "region": [ "East", "East", "East", "West", "West", "West", "North", "North", ], "channel": [ "online", "store", "online", "online", "store", "store", "online", "partner", ], "order_id": [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008], "sales_usd": [1200, 750, 980, 1430, 1020, 650, 500, 780], "units": [12, 8, 10, 14, 9, 5, 4, 7], } ) group_columns = ["region", "channel"] summary = ( orders.groupby(group_columns, as_index=False, sort=False) .agg( order_count=("order_id", "count"), total_sales_usd=("sales_usd", "sum"), avg_order_usd=("sales_usd", "mean"), total_units=("units", "sum"), ) ) summary["avg_order_usd"] = summary["avg_order_usd"].round(2) source_total = orders["sales_usd"].sum() summary_total = summary["total_sales_usd"].sum() expected_groups = len(orders.loc[:, group_columns].drop_duplicates()) assert len(summary) == expected_groups assert source_total == summary_total print(f"pandas {pd.__version__}") print() print("GROUPED_SALES") print(summary.to_string(index=False)) print() print("VERIFY_TOTALS") print(f"source rows: {len(orders)}") print(f"group rows: {len(summary)}") print(f"source sales: {source_total}") print(f"summary sales: {summary_total}") print("verification: grouped totals match source rows")
group_columns holds the keys that define each output row. The summary uses one row for each unique region and channel pair.
$ python3 groupby_sales.py pandas 3.0.3 GROUPED_SALES region channel order_count total_sales_usd avg_order_usd total_units East online 2 2180 1090.0 22 East store 1 750 750.0 8 West online 1 1430 1430.0 14 West store 2 1670 835.0 14 North online 1 500 500.0 4 North partner 1 780 780.0 7 VERIFY_TOTALS source rows: 8 group rows: 6 source sales: 7310 summary sales: 7310 verification: grouped totals match source rows
group_columns = ["region", "channel"]
Use one column for a simple summary, or a list of columns when each output row needs a combined key such as region plus channel.
orders.groupby(group_columns, as_index=False, sort=False)
as_index=False returns SQL-style grouped output, which keeps region and channel in the result columns instead of moving them into the index.
summary = ( orders.groupby(group_columns, as_index=False, sort=False) .agg( order_count=("order_id", "count"), total_sales_usd=("sales_usd", "sum"), avg_order_usd=("sales_usd", "mean"), total_units=("units", "sum"), ) )
The first value in each tuple is the source column. The second value is the aggregation function applied to that column.
orders.groupby(group_columns, as_index=False, sort=False)
pandas sorts group keys by default. With sort=False, groups appear in the order each key first appears in the source DataFrame.
orders.groupby(group_columns, as_index=False, sort=False, dropna=False)
dropna=True is the default, so rows with missing group-key values are left out of grouped results unless dropna=False is set.
source_total = orders["sales_usd"].sum() summary_total = summary["total_sales_usd"].sum() expected_groups = len(orders.loc[:, group_columns].drop_duplicates()) assert len(summary) == expected_groups assert source_total == summary_total
Compare counts, totals, or other business-critical measures before exporting the grouped summary.
$ rm groupby_sales.py