Converting columns to categorical data in pandas stores repeated labels as the category dtype instead of ordinary string values. Use it for fields such as teams, regions, states, ratings, or priorities when a limited set of labels drives grouping, sorting, plotting, or memory use.
A plain astype(“category”) conversion lets pandas infer the labels that are already present in a column. Use CategoricalDtype when the allowed labels or sort order must be explicit, especially for ordered values such as low, normal, and urgent.
Categories are not a general replacement for every text column. They work best when values repeat; if most rows contain unique strings, the category metadata can use as much or more memory than the original column. Values outside an explicit category list become missing, so check the converted columns before using them downstream.
Related: How to convert data types in pandas
Related: How to reduce pandas DataFrame memory usage
Related: How to sort a pandas DataFrame
Steps to convert pandas columns to category dtype:
- Save a short categorical conversion script.
- categorical_dtype.py
import pandas as pd from pandas.api.types import CategoricalDtype df = pd.DataFrame( { "ticket": [101, 102, 103, 104, 105, 106], "team": ["api", "frontend", "api", "ops", "frontend", "api"], "priority": ["normal", "urgent", "low", "normal", "low", "urgent"], } ) print(f"pandas {pd.__version__}") print() print("source dtypes") print(df.dtypes) print() df["team"] = df["team"].astype("category") print("team dtype") print(df["team"].dtype) print() print("team categories") print(df["team"].cat.categories) print() label_sample = pd.Series(["api", "frontend", "ops", "api"] * 1000, dtype="str") memory = pd.Series( { "str_bytes": label_sample.memory_usage(deep=True), "category_bytes": label_sample.astype("category").memory_usage(deep=True), } ) print("memory check") print(memory) print() priority_dtype = CategoricalDtype( categories=["low", "normal", "urgent"], ordered=True, ) df["priority"] = df["priority"].astype(priority_dtype) print("priority dtype") print(df["priority"].dtype) print() print("priority categories") print(df["priority"].cat.categories) print() print("priority ordered") print(df["priority"].cat.ordered) print() print("sorted by priority") print(df.sort_values("priority").filter(["ticket", "priority"]).to_string(index=False)) print() print("missing after conversion") print(df.filter(["team", "priority"]).isna().sum())
Replace the sample df with the DataFrame already loaded in the working script. Keep the CategoricalDtype categories in the intended sort order.
- Run the script and confirm that the source label columns become categorical.
$ python3 categorical_dtype.py pandas 3.0.3 source dtypes ticket int64 team str priority str dtype: object team dtype category team categories Index(['api', 'frontend', 'ops'], dtype='str') memory check str_bytes 49132 category_bytes 4171 dtype: int64 priority dtype category priority categories Index(['low', 'normal', 'urgent'], dtype='str') priority ordered True sorted by priority ticket priority 103 low 105 low 101 normal 104 normal 102 urgent 106 urgent missing after conversion team 0 priority 0 dtype: int64The small memory check uses repeated labels to show the kind of column where category can reduce memory. Measure the real column before treating memory savings as guaranteed.
- Convert an unordered label column when the existing labels are the allowed categories.
df["team"] = df["team"].astype("category")
pandas infers the category labels from the values present in the column. Inspect df[“team”].cat.categories before relying on the label set for validation or export.
- Convert an ordered column with an explicit category list.
priority_dtype = CategoricalDtype( categories=["low", "normal", "urgent"], ordered=True, ) df["priority"] = df["priority"].astype(priority_dtype)
Any non-missing value that is not listed in categories becomes missing after conversion. Add or correct allowed labels before converting production data.
- Verify the converted dtype, categories, and order flag.
print(df["priority"].dtype) print(df["priority"].cat.categories) print(df["priority"].cat.ordered)
cat.ordered must be True for ordered comparisons and order-aware sort behavior.
- Check that conversion did not create unexpected missing values.
print(df.filter(["team", "priority"]).isna().sum())
A nonzero count after conversion usually means the explicit category list missed at least one source value.
- Sort or compare the ordered column before using it in downstream output.
print(df.sort_values("priority").filter(["ticket", "priority"]).to_string(index=False))
The sorted output should follow the category order low, normal, urgent rather than alphabetical or source-row order.
Related: How to sort a pandas DataFrame
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.