Converting columns to categorical data in pandas stores repeated labels as the category dtype instead of ordinary string values. Use it for fields such as teams, regions, states, ratings, or priorities when a limited set of labels drives grouping, sorting, plotting, or memory use.
A plain astype(“category”) conversion lets pandas infer the labels that are already present in a column. Use CategoricalDtype when the allowed labels or sort order must be explicit, especially for ordered values such as low, normal, and urgent.
Categories are not a general replacement for every text column. They work best when values repeat; if most rows contain unique strings, the category metadata can use as much or more memory than the original column. Values outside an explicit category list become missing, so check the converted columns before using them downstream.
Related: How to convert data types in pandas
Related: How to reduce pandas DataFrame memory usage
Related: How to sort a pandas DataFrame
import pandas as pd from pandas.api.types import CategoricalDtype df = pd.DataFrame( { "ticket": [101, 102, 103, 104, 105, 106], "team": ["api", "frontend", "api", "ops", "frontend", "api"], "priority": ["normal", "urgent", "low", "normal", "low", "urgent"], } ) print(f"pandas {pd.__version__}") print() print("source dtypes") print(df.dtypes) print() df["team"] = df["team"].astype("category") print("team dtype") print(df["team"].dtype) print() print("team categories") print(df["team"].cat.categories) print() label_sample = pd.Series(["api", "frontend", "ops", "api"] * 1000, dtype="str") memory = pd.Series( { "str_bytes": label_sample.memory_usage(deep=True), "category_bytes": label_sample.astype("category").memory_usage(deep=True), } ) print("memory check") print(memory) print() priority_dtype = CategoricalDtype( categories=["low", "normal", "urgent"], ordered=True, ) df["priority"] = df["priority"].astype(priority_dtype) print("priority dtype") print(df["priority"].dtype) print() print("priority categories") print(df["priority"].cat.categories) print() print("priority ordered") print(df["priority"].cat.ordered) print() print("sorted by priority") print(df.sort_values("priority").filter(["ticket", "priority"]).to_string(index=False)) print() print("missing after conversion") print(df.filter(["team", "priority"]).isna().sum())
Replace the sample df with the DataFrame already loaded in the working script. Keep the CategoricalDtype categories in the intended sort order.
$ python3 categorical_dtype.py
pandas 3.0.3
source dtypes
ticket int64
team str
priority str
dtype: object
team dtype
category
team categories
Index(['api', 'frontend', 'ops'], dtype='str')
memory check
str_bytes 49132
category_bytes 4171
dtype: int64
priority dtype
category
priority categories
Index(['low', 'normal', 'urgent'], dtype='str')
priority ordered
True
sorted by priority
ticket priority
103 low
105 low
101 normal
104 normal
102 urgent
106 urgent
missing after conversion
team 0
priority 0
dtype: int64
The small memory check uses repeated labels to show the kind of column where category can reduce memory. Measure the real column before treating memory savings as guaranteed.
df["team"] = df["team"].astype("category")
pandas infers the category labels from the values present in the column. Inspect df[“team”].cat.categories before relying on the label set for validation or export.
priority_dtype = CategoricalDtype( categories=["low", "normal", "urgent"], ordered=True, ) df["priority"] = df["priority"].astype(priority_dtype)
Any non-missing value that is not listed in categories becomes missing after conversion. Add or correct allowed labels before converting production data.
print(df["priority"].dtype) print(df["priority"].cat.categories) print(df["priority"].cat.ordered)
cat.ordered must be True for ordered comparisons and order-aware sort behavior.
print(df.filter(["team", "priority"]).isna().sum())
A nonzero count after conversion usually means the explicit category list missed at least one source value.
print(df.sort_values("priority").filter(["ticket", "priority"]).to_string(index=False))
The sorted output should follow the category order low, normal, urgent rather than alphabetical or source-row order.
Related: How to sort a pandas DataFrame