How to convert columns to categorical data in pandas

Converting columns to categorical data in pandas stores repeated labels as the category dtype instead of ordinary string values. Use it for fields such as teams, regions, states, ratings, or priorities when a limited set of labels drives grouping, sorting, plotting, or memory use.

A plain astype(“category”) conversion lets pandas infer the labels that are already present in a column. Use CategoricalDtype when the allowed labels or sort order must be explicit, especially for ordered values such as low, normal, and urgent.

Categories are not a general replacement for every text column. They work best when values repeat; if most rows contain unique strings, the category metadata can use as much or more memory than the original column. Values outside an explicit category list become missing, so check the converted columns before using them downstream.

Steps to convert pandas columns to category dtype:

Save a short categorical conversion script.

categorical_dtype.py

import pandas as pd
from pandas.api.types import CategoricalDtype
 
df = pd.DataFrame(
    {
        "ticket": [101, 102, 103, 104, 105, 106],
        "team": ["api", "frontend", "api", "ops", "frontend", "api"],
        "priority": ["normal", "urgent", "low", "normal", "low", "urgent"],
    }
)
 
print(f"pandas {pd.__version__}")
print()
 
print("source dtypes")
print(df.dtypes)
print()
 
df["team"] = df["team"].astype("category")
 
print("team dtype")
print(df["team"].dtype)
print()
 
print("team categories")
print(df["team"].cat.categories)
print()
 
label_sample = pd.Series(["api", "frontend", "ops", "api"] * 1000, dtype="str")
memory = pd.Series(
    {
        "str_bytes": label_sample.memory_usage(deep=True),
        "category_bytes": label_sample.astype("category").memory_usage(deep=True),
    }
)
 
print("memory check")
print(memory)
print()
 
priority_dtype = CategoricalDtype(
    categories=["low", "normal", "urgent"],
    ordered=True,
)
df["priority"] = df["priority"].astype(priority_dtype)
 
print("priority dtype")
print(df["priority"].dtype)
print()
 
print("priority categories")
print(df["priority"].cat.categories)
print()
 
print("priority ordered")
print(df["priority"].cat.ordered)
print()
 
print("sorted by priority")
print(df.sort_values("priority").filter(["ticket", "priority"]).to_string(index=False))
print()
 
print("missing after conversion")
print(df.filter(["team", "priority"]).isna().sum())

Replace the sample df with the DataFrame already loaded in the working script. Keep the CategoricalDtype categories in the intended sort order.

Run the script and confirm that the source label columns become categorical.

$ python3 categorical_dtype.py
pandas 3.0.3

source dtypes
ticket      int64
team          str
priority      str
dtype: object

team dtype
category

team categories
Index(['api', 'frontend', 'ops'], dtype='str')

memory check
str_bytes         49132
category_bytes     4171
dtype: int64

priority dtype
category

priority categories
Index(['low', 'normal', 'urgent'], dtype='str')

priority ordered
True

sorted by priority
 ticket priority
    103      low
    105      low
    101   normal
    104   normal
    102   urgent
    106   urgent

missing after conversion
team        0
priority    0
dtype: int64

The small memory check uses repeated labels to show the kind of column where category can reduce memory. Measure the real column before treating memory savings as guaranteed.

Convert an unordered label column when the existing labels are the allowed categories.
```
df["team"] = df["team"].astype("category")
```
pandas infers the category labels from the values present in the column. Inspect df[“team”].cat.categories before relying on the label set for validation or export.
Convert an ordered column with an explicit category list.
```
priority_dtype = CategoricalDtype(
    categories=["low", "normal", "urgent"],
    ordered=True,
)
df["priority"] = df["priority"].astype(priority_dtype)
```
Any non-missing value that is not listed in categories becomes missing after conversion. Add or correct allowed labels before converting production data.
Verify the converted dtype, categories, and order flag.
```
print(df["priority"].dtype)
print(df["priority"].cat.categories)
print(df["priority"].cat.ordered)
```
cat.ordered must be True for ordered comparisons and order-aware sort behavior.
Check that conversion did not create unexpected missing values.
```
print(df.filter(["team", "priority"]).isna().sum())
```
A nonzero count after conversion usually means the explicit category list missed at least one source value.
Sort or compare the ordered column before using it in downstream output.
```
print(df.sort_values("priority").filter(["ticket", "priority"]).to_string(index=False))
```
The sorted output should follow the category order low, normal, urgent rather than alphabetical or source-row order.
Related: How to sort a pandas DataFrame

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.