How to aggregate data with pandas groupby

Aggregating data with pandas groupby collapses detailed rows into totals, counts, and averages by one or more keys. Analysts use it when transaction, event, or measurement data needs a per-region, per-product, or per-period summary before reporting.

DataFrame.groupby() splits rows by the chosen key columns, and agg() applies one or more reductions to each group. Named aggregation gives each output column a clear name while pointing it at the source column and aggregation function.

Keep group keys as columns with as_index=False when the summary will be exported, joined, or printed like a normal table. Compare grouped totals with source totals before using the summary, especially when filters, missing group keys, or categorical columns may change which rows appear.

Steps to aggregate data with pandas groupby:

Create a Python script named groupby_sales.py.

groupby_sales.py

import pandas as pd
 
orders = pd.DataFrame(
    {
        "region": [
            "East",
            "East",
            "East",
            "West",
            "West",
            "West",
            "North",
            "North",
        ],
        "channel": [
            "online",
            "store",
            "online",
            "online",
            "store",
            "store",
            "online",
            "partner",
        ],
        "order_id": [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
        "sales_usd": [1200, 750, 980, 1430, 1020, 650, 500, 780],
        "units": [12, 8, 10, 14, 9, 5, 4, 7],
    }
)
 
group_columns = ["region", "channel"]
 
summary = (
    orders.groupby(group_columns, as_index=False, sort=False)
    .agg(
        order_count=("order_id", "count"),
        total_sales_usd=("sales_usd", "sum"),
        avg_order_usd=("sales_usd", "mean"),
        total_units=("units", "sum"),
    )
)
 
summary["avg_order_usd"] = summary["avg_order_usd"].round(2)
 
source_total = orders["sales_usd"].sum()
summary_total = summary["total_sales_usd"].sum()
expected_groups = len(orders.loc[:, group_columns].drop_duplicates())
 
assert len(summary) == expected_groups
assert source_total == summary_total
 
print(f"pandas {pd.__version__}")
print()
print("GROUPED_SALES")
print(summary.to_string(index=False))
print()
print("VERIFY_TOTALS")
print(f"source rows: {len(orders)}")
print(f"group rows: {len(summary)}")
print(f"source sales: {source_total}")
print(f"summary sales: {summary_total}")
print("verification: grouped totals match source rows")

group_columns holds the keys that define each output row. The summary uses one row for each unique region and channel pair.

Run the script and confirm the grouped summary.

$ python3 groupby_sales.py
pandas 3.0.3

GROUPED_SALES
region channel  order_count  total_sales_usd  avg_order_usd  total_units
  East  online            2             2180         1090.0           22
  East   store            1              750          750.0            8
  West  online            1             1430         1430.0           14
  West   store            2             1670          835.0           14
 North  online            1              500          500.0            4
 North partner            1              780          780.0            7

VERIFY_TOTALS
source rows: 8
group rows: 6
source sales: 7310
summary sales: 7310
verification: grouped totals match source rows

Set the grouping keys with a column list.
```
group_columns = ["region", "channel"]
```
Use one column for a simple summary, or a list of columns when each output row needs a combined key such as region plus channel.
Keep the group keys as normal output columns.
```
orders.groupby(group_columns, as_index=False, sort=False)
```
as_index=False returns SQL-style grouped output, which keeps region and channel in the result columns instead of moving them into the index.

Name each aggregated output column with agg().

summary = (
    orders.groupby(group_columns, as_index=False, sort=False)
    .agg(
        order_count=("order_id", "count"),
        total_sales_usd=("sales_usd", "sum"),
        avg_order_usd=("sales_usd", "mean"),
        total_units=("units", "sum"),
    )
)

The first value in each tuple is the source column. The second value is the aggregation function applied to that column.

Preserve first-seen group order when report row order matters.
```
orders.groupby(group_columns, as_index=False, sort=False)
```
pandas sorts group keys by default. With sort=False, groups appear in the order each key first appears in the source DataFrame.
Include missing group keys only when they should form their own group.
```
orders.groupby(group_columns, as_index=False, sort=False, dropna=False)
```
dropna=True is the default, so rows with missing group-key values are left out of grouped results unless dropna=False is set.

Verify grouped totals against the source rows.

source_total = orders["sales_usd"].sum()
summary_total = summary["total_sales_usd"].sum()
expected_groups = len(orders.loc[:, group_columns].drop_duplicates())
 
assert len(summary) == expected_groups
assert source_total == summary_total

Compare counts, totals, or other business-critical measures before exporting the grouped summary.

Remove the sample script after adapting the grouping pattern.
```
$ rm groupby_sales.py
```

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.